[Linux-cluster] GFS2 panic on current release

Wed Nov 18 16:29:02 UTC 2009

Hi,

Can you open a bz for that one? Its not related to any issues on disk
and it looks like a new bug to me. It is a fairly rare event though I
think...

You need a rename which is going to unlink a target inode, combined with
the requirement to allocate a block in order to satisfy the space
requirement for adding the new entry to the directory, combined with
there being an "unlinked but not deallocated" inode sitting in the same
resource group as was chosen for the allocation.

If you unmount on all nodes, run fsck.gfs2 (the very latest one with the
#500483 fix - thats very important) on the filesystem, then that will
greatly reduce the chances of you hitting this again in the near future.

The message refers to a bug trap for recursive locking which was added
some time back to catch any cases like this before any corruption
occurs,

Steve.

On Wed, 2009-11-18 at 11:06 -0500, Allen Belletti wrote:
> Hi All,
> 
> A few weeks ago I discovered that I'd had an obsolete gfs2 kernel module 
> loaded and removed it, thus bringing it up to the revision included in 
> the current kernel.  Was hoping that all was well, but then yesterday 
> morning one of the nodes panicked as follows:
> 
> original: gfs2_rename+0x19d/0x63b [gfs2]
> pid : 12810
> lock type: 3 req lock state : 1
> new: gfs2_rlist_alloc+0x5c/0x6a [gfs2]
> pid: 12810
> lock type: 3 req lock state : 1
>   G:  s:EX n:3/33d0327 f:y t:EX d:EX/0 l:0 a:5 r:4
>    H: s:EX f:H e:0 p:12810 [imap] gfs2_rename+0x19d/0x63b [gfs2]
>    R: n:54330151 f:05 b:274/274 i:1121
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/gfs2/glock.c:1074
> invalid opcode: 0000 [1] SMP
> last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/irq
> CPU 1
> Modules linked in: nfs fscache nfs_acl lock_dlm gfs2 dlm configfs lockd 
> sunrpc ipv6 xfrm_nalgo crypto_api ipt_LOG xt_state ip_conntrack 
> nfnetlink xt_tcpudp iptable_filter ip_tables x_tables 8021q dm_multipath 
> scsi_dh video backlight sbs i2c_ec button battery asus_acpi 
> acpi_memhotplug ac parport_pc lp parport i2c_amd756 k8temp ide_cd 
> i2c_core hwmon sg amd_rng cdrom k8_edac pcspkr tg3 floppy edac_mc e1000 
> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero 
> dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc shpchp mptspi mptscsih 
> mptbase scsi_transport_spi sd_mod scsi_mod raid1 ext3 jbd uhci_hcd 
> ohci_hcd ehci_hcd
> Pid: 12810, comm: imap Not tainted 2.6.18-164.6.1.el5 #1
> RIP: 0010:[<ffffffff8862a6df>]  [<ffffffff8862a6df>] 
> :gfs2:gfs2_glock_nq+0x231/0x273
> RSP: 0018:ffff8101ba8d9868  EFLAGS: 00010292
> RAX: 0000000000000000 RBX: ffff8101ba8d9cb0 RCX: 0000000000000461
> RDX: ffff8101ffe27a98 RSI: ffffffff80309c28 RDI: ffffffff80309c20
> RBP: ffff8101860b1340 R08: ffffffff80309c28 R09: 000000000000003f
> R10: ffff8101ba8d9368 R11: 0000000000000000 R12: ffff8100e87ea590
> R13: ffff8100e87ea590 R14: ffff8100ed24e000 R15: 0000000000000000
> FS:  00002b18a78ac530(0000) GS:ffff810103901940(0000) knlGS:00000000acbfbb90
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002b70cf5cf000 CR3: 00000001b4d4a000 CR4: 00000000000006e0
> Process imap (pid: 12810, threadinfo ffff8101ba8d8000, task 
> ffff8101ffe277e0)
> Stack:  ffff8101860b1340 0000000000000001 ffff8100b3e1b000 ffff8100b3e1a0e8
>   0000000000000000 ffffffff8862a74e 0000000000000038 ffff810184e88368
>   0000000000000001 ffffffff800caa0b 0000000000000005 ffff810184e88368
> Call Trace:
>   [<ffffffff8862a74e>] :gfs2:gfs2_glock_nq_m+0x2d/0xf4
>   [<ffffffff800caa0b>] __kzalloc+0x9/0x21
>   [<ffffffff88622831>] :gfs2:do_strip+0x175/0x349
>   [<ffffffff886217e2>] :gfs2:recursive_scan+0xf2/0x175
>   [<ffffffff886218fe>] :gfs2:trunc_dealloc+0x99/0xe7
>   [<ffffffff886226bc>] :gfs2:do_strip+0x0/0x349
>   [<ffffffff80090000>] sched_exit+0xb4/0xb5
>   [<ffffffff88638dda>] :gfs2:gfs2_delete_inode+0xdd/0x191
>   [<ffffffff88638d43>] :gfs2:gfs2_delete_inode+0x46/0x191
>   [<ffffffff88628e77>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a
>   [<ffffffff88638cfd>] :gfs2:gfs2_delete_inode+0x0/0x191
>   [<ffffffff8002f48f>] generic_delete_inode+0xc6/0x143
>   [<ffffffff8863d9a4>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691
>   [<ffffffff886248c4>] :gfs2:gfs2_dirent_find_space+0x0/0x41
>   [<ffffffff88623983>] :gfs2:gfs2_dirent_search+0x147/0x16e
>   [<ffffffff886377c5>] :gfs2:gfs2_rename+0x3be/0x63b
>   [<ffffffff88637506>] :gfs2:gfs2_rename+0xff/0x63b
>   [<ffffffff8863754c>] :gfs2:gfs2_rename+0x145/0x63b
>   [<ffffffff88637571>] :gfs2:gfs2_rename+0x16a/0x63b
>   [<ffffffff886375a4>] :gfs2:gfs2_rename+0x19d/0x63b
>   [<ffffffff88629e29>] :gfs2:gfs2_holder_uninit+0xd/0x1f
>   [<ffffffff886385bf>] :gfs2:gfs2_permission+0xaf/0xd4
>   [<ffffffff88633124>] :gfs2:gfs2_drevalidate+0x158/0x214
>   [<ffffffff8000d902>] permission+0x81/0xc8
>   [<ffffffff8002a7d9>] vfs_rename+0x2f4/0x471
>   [<ffffffff80036c20>] sys_renameat+0x180/0x1eb
>   [<ffffffff800b66f5>] audit_syscall_entry+0x180/0x1b3
>   [<ffffffff8005d28d>] tracesys+0xd5/0xe0
> 
> 
> Code: 0f 0b 68 f8 27 64 88 c2 32 04 be 01 00 00 00 4c 89 ef e8 df
> RIP  [<ffffffff8862a6df>] :gfs2:gfs2_glock_nq+0x231/0x273
>   RSP <ffff8101ba8d9868>
> <0>Kernel panic - not syncing: Fatal exception
>   Killed by signal 15.
> 
> It seems possible that there would be some filesystem damage from 
> running the old code and I'm going to fsck this weekend, but wanted to 
> post this in case it revealed an obvious problem to anyone.  The 
> "invalid opcode: 0000" makes me think we ended up executing code that 
> was actually data, but beyond that I'm clueless.
> 
> Thanks,
> Allen
>