[Linux-cluster] mount hang in kcl_join_service

Wed Feb 16 23:39:37 UTC 2005

I have not been able to get my tests to run for more than
1 day for the last several tries.  This time my test hung
during mount in kcl_join_service().  My test does mount and umount 
several times for each test run.  This time it hung on the
22nd test run.  It looks like it was starting a 3node test
where a gfs file system is mounted on all 3 nodes and then
does a umount/mount 1 node at a time.  So this should have
done an umount on cl031 and then hung on a mount on cl031
with cl030 and cl032 having the gfs file system still mounted.

The mount stack trace is:
mount         D C170EF9C     0  1557      1         12111  3932 (NOTLB)
f09b9c30 00000086 f1f4c580 c170ef9c 00002ca3 c2c39ae0 00000008 00000000
       e947e548 7b01b78b 00002ca3 f09b9c10 f1f4c580 00000000 c170f8c0 c170ef60
       00000000 00003ba8 7b01b9ea 00002ca3 c2c39ae0 c2c39c4c 00000000 00002ca3
Call Trace:
 [<c03ce814>] wait_for_completion+0xa4/0xe0
 [<f8ab6164>] kcl_join_service+0x154/0x180 [cman]
 [<f8890fff>] init_mountgroup+0x6f/0xc0 [lock_dlm]
 [<f88934b1>] lm_dlm_mount+0xa1/0xf0 [lock_dlm]
 [<f8812300>] lm_mount+0x140/0x230 [lock_harness]
 [<f9017f4d>] gfs_lm_mount+0x1fd/0x390 [gfs]
 [<f9024276>] fill_super+0x596/0x14c0 [gfs]
 [<f902533f>] gfs_get_sb+0x15f/0x1b0 [gfs]
 [<c0166ae8>] do_kern_mount+0x58/0xe0
 [<c017ce08>] do_new_mount+0x98/0xe0
 [<c017d4b5>] do_mount+0x165/0x1b0
 [<c017d8c7>] sys_mount+0x97/0x100
 [<c010323d>] sysenter_past_esp+0x52/0x75

A bunch of info is available here:
http://developer.osdl.org/daniel/GFS/test.11feb2005/

The bad news is that taking a stack trace to a serial console
causes nodes to be kicked out of the cluster, so some of the
info has the nodes being kicked out.

Any ideas on how to figure this out?

Daniel