[Linux-cluster] GFS2 becomes non-responsive, no fencing

Ross Vandegrift ross at kallisti.us
Wed Aug 27 15:13:45 UTC 2008


On Wed, Aug 27, 2008 at 09:30:52AM +0100, Steven Whitehouse wrote:
> There are a few things to check. Firstly compare /proc/slabinfo on a
> slow node with that on a node running at normal speed. That will tell
> you if there is a problem with memory leaking or not being reclaimed
> properly.

Nothing interesting at a quick glance over slabinfo.  The active node
has much more consumed memory than the inactive one, which would seem
normal since the busted node hasn't done any work in hours.

Anything large is expected - icache, dcache, buffer_head.  The active
node has more objects then the busted one.

> If the node seems stuck, then try and echo t >/proc/sysrq-trigger and
> look at the backtraces of any process which has called into gfs2 to see
> where they are waiting. Also a dump of the glocks (you'll need to have
> debugfs mounted) on all nodes should then allow you to work out whether
> something on the stuck nodes is waiting for something on one of the
> other nodes. Sometimes its useful to look at the DLM locks as well.

Okay - processes that are hung due to I/O on GFS2 filesystem all have
a similar call stack:

 =======================
ls            D C981824A  2928 11112  11000 (NOTLB)
       e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550 
       c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044 
       f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000 
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm]
 [<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2]
 [<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2]
 [<c047f754>] filldir64+0x0/0xc5
 [<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2]
 [<c047f935>] vfs_readdir+0x63/0x8d
 [<c047f754>] filldir64+0x0/0xc5
 [<c047f9c2>] sys_getdents64+0x63/0xa5
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
python        D 2A0CCE0D  1676 10551  10175 (NOTLB)
       f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550 
       c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4 
       f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000 
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2]
 [<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2]
 [<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2]
 [<c047b557>] permission+0x78/0xb5
 [<c047c9c0>] __link_path_walk+0x141/0xd33
 [<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
 [<c048d67a>] __mark_inode_dirty+0x13d/0x14f
 [<c047d5fb>] link_path_walk+0x49/0xbd
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047d9c8>] do_path_lookup+0x20e/0x25e
 [<c047ded5>] sys_mkdirat+0x36/0xb6
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047df64>] sys_mkdir+0xf/0x13
 [<c0404eff>] syscall_call+0x7/0xb
 =======================


I've got dumps of the glocks from debugfs, but I'm not really familiar
enough to GFS to understand what I'm reading.  I tried to file a bug
in RH Bugzilla, but am getting 503 errors.  I've posted the glock
dumps here:

http://kallisti.us/~ross/working-glocks
http://kallisti.us/~ross/broken-glocks

Can you point me in the direction of a document that explains what
the various things in the output mean?

-- 
Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37




More information about the Linux-cluster mailing list