[Linux-cluster] oops after 12 hours during umount

Wed Apr 13 21:56:08 UTC 2005

On Mon, 2005-04-11 at 20:30, David Teigland wrote:
> On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote:
> > I started my mount/tar/rm/ tests on Apr  4 17:41 and I hit
> > a problem at Apr  6 05:30.  So the test ran for 36 hours.
> > cl030 and cl031 were getting "SM: process_reply invalid"
> > messages and cl032 got "No response" and "Missed too many
> > heartbeats"
> 
> The SM messages are an effect of CMAN removing nodes.  There's a fair
> chance that this recent fix will help:
> http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html

Good news and bad news. 

Good news: I think my previous problem was an
network upgrade that accidentally cut off one of my nodes.

Bad news: after upgrading to the latest cvs I hit an oops after
12 hours.  The below looks life we are accessing freed memory.
I have slab debug and spin lock debug configured.

Here's the oops:

Unable to handle kernel paging request at virtual address 6b6b6bbf
 printing eip:
c03e8682
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod video
CPU:    0
EIP:    0060:[<c03e8682>]    Not tainted VLI
EFLAGS: 00010246   (2.6.11)
EIP is at _spin_lock+0x22/0x90
eax: 00000000   ebx: 6b6b6bbf   ecx: 00000001   edx: cdc82000
esi: cdc82000   edi: 6b6b6bbf   ebp: cdc82ea4   esp: cdc82e9c
ds: 007b   es: 007b   ss: 0068
Process umount (pid: 14022, threadinfo=cdc82000 task=cc113a60)
Stack: d2bee958 d2beea7c cdc82ebc c0162f06 d2bee958 d2bee968 d2bee958 6b6b6b6b
       cdc82edc c017bb24 d2bee958 00004192 00000001 cdc82eec ce844050 f90314e0
       cdc82efc c017bc14 cbd665d0 cdc82eec d2bee4ec cbe47b3c cbd66544 ce844050
Call Trace:
 [<c01041ff>] show_stack+0x7f/0xa0
 [<c01043b2>] show_registers+0x162/0x1e0
 [<c01045de>] die+0xfe/0x190
 [<c0115892>] do_page_fault+0x3b2/0x6f2
 [<c0103e57>] error_code+0x2b/0x30
 [<c0162f06>] invalidate_inode_buffers+0x46/0x90
 [<c017bb24>] invalidate_list+0x44/0xe0
 [<c017bc14>] invalidate_inodes+0x54/0x90
 [<c0167974>] generic_shutdown_super+0x74/0x140
 [<f9010aee>] gfs_kill_sb+0x2e/0x69 [gfs]
 [<c0167821>] deactivate_super+0x81/0xa0
 [<c017ed5c>] sys_umount+0x3c/0xa0
 [<c017edd9>] sys_oldumount+0x19/0x20
 [<c010335d>] sysenter_past_esp+0x52/0x75
Code: 00 00 00 8d bf 00 00 00 00 55 89 e5 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 47 06 d3 ff be 00 f0 ff ff 21 e6 31 c0 <86> 03 84 c0 7e 0b 8b 1c 24 8b 74 24 04 89 ec 5d c3 b8 01 00 00

Daniel