[Linux-cluster] oops after 12 hours during umount
Daniel McNeil
daniel at osdl.org
Wed Apr 13 21:56:08 UTC 2005
On Mon, 2005-04-11 at 20:30, David Teigland wrote:
> On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote:
> > I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit
> > a problem at Apr 6 05:30. So the test ran for 36 hours.
> > cl030 and cl031 were getting "SM: process_reply invalid"
> > messages and cl032 got "No response" and "Missed too many
> > heartbeats"
>
> The SM messages are an effect of CMAN removing nodes. There's a fair
> chance that this recent fix will help:
> http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html
Good news and bad news.
Good news: I think my previous problem was an
network upgrade that accidentally cut off one of my nodes.
Bad news: after upgrading to the latest cvs I hit an oops after
12 hours. The below looks life we are accessing freed memory.
I have slab debug and spin lock debug configured.
Here's the oops:
Unable to handle kernel paging request at virtual address 6b6b6bbf
printing eip:
c03e8682
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod video
CPU: 0
EIP: 0060:[<c03e8682>] Not tainted VLI
EFLAGS: 00010246 (2.6.11)
EIP is at _spin_lock+0x22/0x90
eax: 00000000 ebx: 6b6b6bbf ecx: 00000001 edx: cdc82000
esi: cdc82000 edi: 6b6b6bbf ebp: cdc82ea4 esp: cdc82e9c
ds: 007b es: 007b ss: 0068
Process umount (pid: 14022, threadinfo=cdc82000 task=cc113a60)
Stack: d2bee958 d2beea7c cdc82ebc c0162f06 d2bee958 d2bee968 d2bee958 6b6b6b6b
cdc82edc c017bb24 d2bee958 00004192 00000001 cdc82eec ce844050 f90314e0
cdc82efc c017bc14 cbd665d0 cdc82eec d2bee4ec cbe47b3c cbd66544 ce844050
Call Trace:
[<c01041ff>] show_stack+0x7f/0xa0
[<c01043b2>] show_registers+0x162/0x1e0
[<c01045de>] die+0xfe/0x190
[<c0115892>] do_page_fault+0x3b2/0x6f2
[<c0103e57>] error_code+0x2b/0x30
[<c0162f06>] invalidate_inode_buffers+0x46/0x90
[<c017bb24>] invalidate_list+0x44/0xe0
[<c017bc14>] invalidate_inodes+0x54/0x90
[<c0167974>] generic_shutdown_super+0x74/0x140
[<f9010aee>] gfs_kill_sb+0x2e/0x69 [gfs]
[<c0167821>] deactivate_super+0x81/0xa0
[<c017ed5c>] sys_umount+0x3c/0xa0
[<c017edd9>] sys_oldumount+0x19/0x20
[<c010335d>] sysenter_past_esp+0x52/0x75
Code: 00 00 00 8d bf 00 00 00 00 55 89 e5 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 47 06 d3 ff be 00 f0 ff ff 21 e6 31 c0 <86> 03 84 c0 7e 0b 8b 1c 24 8b 74 24 04 89 ec 5d c3 b8 01 00 00
Daniel
More information about the Linux-cluster
mailing list