[Linux-cluster] GFS strange behavior and mount hang on 2.6.9 - 3 nodes

Fri Nov 5 22:33:57 UTC 2004

I been testing 3-node GFS file system on shared fibre channel
storage, and run into a couple of strange things.  The 3 nodes
are cl030, cl031, and cl032.

1. After running tar tests on 3 nodes for about a day,
   I wanted to try out the patch to get rid of the might_sleep()
   warning.  I umounted the GFS file system on cl031 and then
   tried to rmmod the lock_dlm module, but couldn't because of
   the use count on the modules:

   # umount /gfs_stripe5
dlm: connecting to 2
dlm: closing connection to node 1
Debug: sleeping function called from invalid context at
include/linux/rwsem.h:43in_atomic():1, irqs_disabled():0
 [<c01062ae>] dump_stack+0x1e/0x30
 [<c011ce47>] __might_sleep+0xb7/0xf0
 [<f8ad2a85>] nodeid2con+0x25/0x1e0 [dlm]
 [<f8ad4102>] lowcomms_close+0x42/0x70 [dlm]
 [<f8ad59cc>] put_node+0x2c/0x70 [dlm]
 [<f8ad5b97>] release_csb+0x17/0x30 [dlm]
 [<f8ad60d3>] nodes_clear+0x33/0x40 [dlm]
 [<f8ad60f7>] ls_nodes_clear+0x17/0x30 [dlm]
 [<f8ad25fd>] release_lockspace+0x1fd/0x2f0 [dlm]
 [<f8a9ff5c>] release_gdlm+0x1c/0x30 [lock_dlm]
 [<f8aa0214>] lm_dlm_unmount+0x24/0x50 [lock_dlm]
 [<f881e496>] lm_unmount+0x46/0xac [lock_harness]
 [<f8ba189f>] gfs_put_super+0x30f/0x3c0 [gfs]
 [<c01654fa>] generic_shutdown_super+0x18a/0x1a0dlm: connecting to 1

 [<c016608d>] kill_block_super+0x1d/0x40
 [<c01652a1>] deactivate_super+0x81/0xa0
 [<c017c6cc>] sys_umount+0x3c/0xa0

dlm: closing connection to node 2
dlm: closing connection to node 3
dlm: got connection from 2
dlm: got connection from 1
# lsmod
Module                  Size  Used by
lock_dlm               39408  2
dlm                   128008  1 lock_dlm
gfs                   296780  0
lock_harness            3868  2 lock_dlm,gfs
qla2200                86432  0
qla2xxx               112064  1 qla2200
cman                  128480  8 lock_dlm,dlm
dm_mod                 53536  0

# rmmod lock_dlm
ERROR: Module lock_dlm is in use

---->
At this point, the lock_dlm module would not unload because 
it still had a use count of 2.

The "got connection" messages after the umount look strange.
What do those messages mean?

2. After rebooting, cl031, I got cl031 to rejoin the cluster,
    but when trying to mount the mount hung:

# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   cl030a
   2    1    3   M   cl032a
   3    1    3   M   cl031a

# mount -t gfs /dev/sdf1 /gfs_stripe5
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"
dlm: stripefs: recover event 2 (first)
dlm: stripefs: add nodes
dlm: connecting to 1

==> mount HUNG here

cl031 root]# cat /proc/cluster/services
Service          Name                              GID LID State    
Code
Fence Domain:    "default"                           1   2 run       -
[2 1 3]

DLM Lock Space:  "stripefs"                         18   3 join     
S-6,20,3
[2 1 3]
================
cl032 root]# cat /proc/cluster/services
Service          Name                              GID LID State    
Code
Fence Domain:    "default"                           1   4 run       -
[1 2 3]

DLM Lock Space:  "stripefs"                         18  21 update   
U-4,1,3
[1 2 3]

GFS Mount Group: "stripefs"                         19  22 run       -
[1 2]
================
cl030 proc]# cat /proc/cluster/services
Service          Name                              GID LID State    
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "stripefs"                         18  23 update   
U-4,1,3
[1 2 3]

GFS Mount Group: "stripefs"                         19  24 run       -
[1 2]

It looks like some problem joining the DLM Lock Space.
I have stack traces available from all 3 machines if
that provides any info (http://developer.osdl.org/daniel/gfs_hang/)

I reset cl031 and the other 2 nodes recovered ok:
dlm: stripefs: total nodes 3
dlm: stripefs: nodes_reconfig failed -1
dlm: stripefs: recover event 76 error -1

cl032:
CMAN: no HELLO from cl031a, removing from the cluster
dlm: stripefs: total nodes 3
dlm: stripefs: nodes_reconfig failed 1
dlm: stripefs: recover event 69 error

Anyone seen anything like this?

Daniel