[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Ext3 related oops and a crash



We have here an knfs fileserver running ext3 on 2.4.18 kernel with three 
filesystems:

<CLIP>
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda3             10080520   3609632   6368476  37% /
/dev/hda4             16437332  12295408   3974932  76% /home
/dev/md1             1024872060 409906136 614441636  41% /fs
</CLIP>

The /fs filesystem lives on a software RAID5 md1 partiton consisting of 
7 SCSI disks, two SCSI wires, on a qlogic scsi controller (qla1280 
driver). Quotas are turned on, but I have been getting crashes with quotas 
off too. Almost all load is on the /fs filesystem. 

The kernel is tainted by Intels e1000 gigabit ethernet driver. However, 
these problems have occurred also when the Intels driver is not in use.

I have tried to rule out motherboard/cpu/memory problems by doing kernel 
compilatations in a loop for 24 hours (trying to get sig11). No problems 
there.

The machine has Intels dual CPU motherboard. Since we have been having 
these stability problems, we allready are using single CPU kernel, but it 
seems that it did not help.

About once in two weeks, the machine crashes, always giving a filesystem 
related stack trace (it has a serial console, so I get stack trace even 
when it cannot be saved to disk anymore). 

Here is the latest oops and a panic occurring right after that (it seems, 
that the system was badly messed up after the oops, so the last two 
traces might not be interesting). Does anyone have any ideas about what 
the problem might be? I have other traces while the system was still 
running SMP too.

The oops:

Oops: 0000
CPU:    0
EIP:    0010:[<c0165e69>]    Tainted: P 
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000   ebx: dbffe1a0   ecx: eb270ba0   edx: dbffe1a0
esi: 00000000   edi: f1f4ed90   ebp: f1f4edc0   esp: f630be70
ds: 0018   es: 0018   ss: 0018
Process kjournald (pid: 1274, stackpage=f630b000)
Stack: dbffe1a0 f1f4e970 c01614c3 dbffe1a0 f1f4e970 00000000 00000000 
00000000 
       00000023 eb270ba0 e3254940 005cb334 0100010a c01e93d3 f783dda0 
00000001 
       c351dda0 c351d260 cbc28c00 cbc28540 d882c260 c05e8ce0 da0c0180 
da0c06c0 
Call Trace: [<c01614c3>] [<c01e93d3>] [<c0164375>] [<c01641e0>] 
[<c0105726>] 
   [<c0164200>] 
Code: 8b 56 04 85 d2 79 23 68 c7 fd 22 c0 68 bc 06 00 00 68 19 fb 

>>EIP; c0165e69 <__journal_remove_journal_head+9/e0>   <=====
Trace; c01614c3 <journal_commit_transaction+343/119a>
Trace; c01e93d3 <ip_rcv+313/3a0>
Trace; c0164375 <kjournald+175/2b0>
Trace; c01641e0 <commit_timeout+0/10>
Trace; c0105726 <kernel_thread+26/30>
Trace; c0164200 <kjournald+0/2b0>
Code;  c0165e69 <__journal_remove_journal_head+9/e0>
00000000 <_EIP>:
Code;  c0165e69 <__journal_remove_journal_head+9/e0>   <=====
   0:   8b 56 04                  mov    0x4(%esi),%edx   <=====
Code;  c0165e6c <__journal_remove_journal_head+c/e0>
   3:   85 d2                     test   %edx,%edx
Code;  c0165e6e <__journal_remove_journal_head+e/e0>
   5:   79 23                     jns    2a <_EIP+0x2a> c0165e93 
<__journal_remove_journal_head+33/e0>
Code;  c0165e70 <__journal_remove_journal_head+10/e0>
   7:   68 c7 fd 22 c0            push   $0xc022fdc7
Code;  c0165e75 <__journal_remove_journal_head+15/e0>
   c:   68 bc 06 00 00            push   $0x6bc
Code;  c0165e7a <__journal_remove_journal_head+1a/e0>
  11:   68 19 fb 00 00            push   $0xfb19


Here  is the second stack trace:

invalid operand: 0000
CPU:    0
EIP:    0010:[<c011ab96>]    Tainted: P 
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: 00000000
esi: f630a000   edi: 0000000b   ebp: 00000004   esp: f630bd44
ds: 0018   es: 0018   ss: 0018
Process kjournald (pid: 0, stackpage=f630b000)
Stack: c022c170 f630be3c c0114b9e c023ff2a 00000000 c01074b6 00000000 
c02406e6
       c0225f4f c022c170 00000000 00000001 00000000 c0165e69 c0165e69 
c0114f87
       0000000b f630be3c 00000000 00000282 f630a000 00000000 f630a000 
00000000
Call Trace: [<c0114b9e>] [<c01074b6>] [<c0165e69>] [<c0165e69>] 
[<c0114f87>] 
   [<f899b662>] [<f8906195>] [<f8960a80>] [<c01a456a>] [<c010b30d>] 
[<c0135f05>]
 
   [<c0114be0>] [<c0107024>] [<c0160018>] [<c0165e69>] [<c01614c3>] 
[<c01e93d3>]
 
   [<c0164375>] [<c01641e0>] [<c0105726>] [<c0164200>] 
Code: 0f 0b e9 a9 fe ff ff 8d 76 00 53 8b 44 24 08 8b 5c 24 0c 85 

>>EIP; c011ab96 <do_exit+1b6/1c0>   <=====
Trace; c0114b9e <bust_spinlocks+3e/50>
Trace; c01074b6 <die+46/60>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c0114f87 <do_page_fault+3a7/4eb>
Trace; f899b662 <[nfsd]nfsd_proc_rename+52/110>
Trace; f8906195 <[md]md_make_request+35/70>
Trace; f8960a80 <[sd_mod]sd_template+0/0>
Trace; c01a456a <generic_make_request+13a/150>
Trace; c010b30d <call_apic_timer_interrupt+5/18>
Trace; c0135f05 <__refile_buffer+55/60>
Trace; c0114be0 <do_page_fault+0/4eb>
Trace; c0107024 <error_code+34/40>
Trace; c0160018 <journal_get_undo_access+68/110>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c01614c3 <journal_commit_transaction+343/119a>
Trace; c01e93d3 <ip_rcv+313/3a0>
Trace; c0164375 <kjournald+175/2b0>
Trace; c01641e0 <commit_timeout+0/10>
Trace; c0105726 <kernel_thread+26/30>
Trace; c0164200 <kjournald+0/2b0>
Code;  c011ab96 <do_exit+1b6/1c0>
00000000 <_EIP>:
Code;  c011ab96 <do_exit+1b6/1c0>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c011ab98 <do_exit+1b8/1c0>
   2:   e9 a9 fe ff ff            jmp    fffffeb0 <_EIP+0xfffffeb0> 
c011aa46 <do
_exit+66/1c0>
Code;  c011ab9d <do_exit+1bd/1c0>
   7:   8d 76 00                  lea    0x0(%esi),%esi
Code;  c011aba0 <complete_and_exit+0/20>
   a:   53                        push   %ebx
Code;  c011aba1 <complete_and_exit+1/20>
   b:   8b 44 24 08               mov    0x8(%esp,1),%eax
Code;  c011aba5 <complete_and_exit+5/20>
   f:   8b 5c 24 0c               mov    0xc(%esp,1),%ebx
Code;  c011aba9 <complete_and_exit+9/20>
  13:   85 00                     test   %eax,(%eax)

And here is the final crash:

<1>Unable to handle kernel paging request at virtual address fffffa50
c012e2d2
*pde = 00001063
Oops: 0000
CPU:    0
EIP:    0010:[<c012e2d2>]    Tainted: P 
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: fffffa38   ebx: 00000002   ecx: fffffa38   edx: 00000000
esi: f26ff3c0   edi: f26ff3c0   ebp: f77b53e0   esp: f630bad8
ds: 0018   es: 0018   ss: 0018
Process kjournald (pid: 0, stackpage=f630b000)
Stack: c01d865e f26ff3c0 00000000 c01d869b f26ff3c0 00000000 c01d8809 f26ff3c0 
       f26ff3c0 c020723c f26ff3c0 2030d680 00000006 00000000 f630bb28 f77b53e0 
       f77b53e0 00010000 f630b038 f630b038 cc0ad680 2030d680 f26ff3c0 ed9da000
Call Trace: [<c01d865e>] [<c01d869b>] [<c01d8809>] [<c020723c>] [<c011ee90>] 
   [<c01dc7da>] [<c010831a>] [<c011bca3>] [<c01084cc>] [<c010a4e8>] [<c0110018>] 
   [<c0117c20>] [<c011aa21>] [<c01074c2>] [<c0107700>] [<c0107780>] [<c011ab96>] 
   [<c01084bc>] [<c010a4e8>] [<c0107024>] [<c011ab96>] [<c0114b9e>] [<c01074b6>] 
   [<c0165e69>] [<c0165e69>] [<c0114f87>] [<f899b662>] [<f8906195>] [<f8960a80>] 
   [<c01a456a>] [<c010b30d>] [<c0135f05>] [<c0114be0>] [<c0107024>] [<c0160018>] 
   [<c0165e69>] [<c01614c3>] [<c01e93d3>] [<c0164375>] [<c01641e0>] [<c0105726>] 
   [<c0164200>] 
Code: 8b 41 18 a9 00 40 00 00 75 14 ff 49 14 0f 94 c0 84 c0 74 0a 

>>EIP; c012e2d2 <__free_pages+2/30>   <=====
Trace; c01d865e <skb_release_data+3e/70>
Trace; c01d869b <kfree_skbmem+b/70>
Trace; c01d8809 <__kfree_skb+109/110>
Trace; c020723c <arp_rcv+44c/460>
Trace; c011ee90 <update_process_times+20/b0>
Trace; c01dc7da <net_rx_action+12a/210>
Trace; c010831a <handle_IRQ_event+3a/70>
Trace; c011bca3 <do_softirq+53/a0>
Trace; c01084cc <do_IRQ+9c/b0>
Trace; c010a4e8 <call_do_IRQ+5/d>
Trace; c0110018 <centaur_get_mcr+18/90>
Trace; c0117c20 <panic+e0/f0>
Trace; c011aa21 <do_exit+41/1c0>
Trace; c01074c2 <die+52/60>
Trace; c0107700 <do_invalid_op+0/90>
Trace; c0107780 <do_invalid_op+80/90>
Trace; c011ab96 <do_exit+1b6/1c0>
Trace; c01084bc <do_IRQ+8c/b0>
Trace; c010a4e8 <call_do_IRQ+5/d>
Trace; c0107024 <error_code+34/40>
Trace; c011ab96 <do_exit+1b6/1c0>
Trace; c0114b9e <bust_spinlocks+3e/50>
Trace; c01074b6 <die+46/60>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c0114f87 <do_page_fault+3a7/4eb>
Trace; f899b662 <[nfsd]nfsd_proc_rename+52/110>
Trace; f8906195 <[md]md_make_request+35/70>
Trace; f8960a80 <[sd_mod]sd_template+0/0>
Trace; c01a456a <generic_make_request+13a/150>
Trace; c010b30d <call_apic_timer_interrupt+5/18>
Trace; c0135f05 <__refile_buffer+55/60>
Trace; c0114be0 <do_page_fault+0/4eb>
Trace; c0107024 <error_code+34/40>
Trace; c0160018 <journal_get_undo_access+68/110>
Trace; c0165e69 <__journal_remove_journal_head+9/e0>
Trace; c01614c3 <journal_commit_transaction+343/119a>
Trace; c01e93d3 <ip_rcv+313/3a0>
Trace; c0164375 <kjournald+175/2b0>
Trace; c01641e0 <commit_timeout+0/10>
Trace; c0105726 <kernel_thread+26/30>
Trace; c0164200 <kjournald+0/2b0>
Code;  c012e2d2 <__free_pages+2/30>
00000000 <_EIP>:
Code;  c012e2d2 <__free_pages+2/30>   <=====
   0:   8b 41 18                  mov    0x18(%ecx),%eax   <=====
Code;  c012e2d5 <__free_pages+5/30>
   3:   a9 00 40 00 00            test   $0x4000,%eax
Code;  c012e2da <__free_pages+a/30>
   8:   75 14                     jne    1e <_EIP+0x1e> c012e2f0 <__free_pages+20/30>
Code;  c012e2dc <__free_pages+c/30>
   a:   ff 49 14                  decl   0x14(%ecx)
Code;  c012e2df <__free_pages+f/30>
   d:   0f 94 c0                  sete   %al
Code;  c012e2e2 <__free_pages+12/30>
  10:   84 c0                     test   %al,%al
Code;  c012e2e4 <__free_pages+14/30>
  12:   74 0a                     je     1e <_EIP+0x1e> c012e2f0 <__free_pages+20/30>

- Jani






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]