[Linux-cluster] DLM locks with 1 node on 2 node cluster

Mon Aug 28 15:41:55 UTC 2006

I am using the latest cluster from RHEL4 branch. I have 2 node cluster:
nodes A and B. Node A grabs a lock in the exclusive mode, node B waits for a
membership change. I manually reset node A at which point node B gets the
membership change notification and then tries to acquire the lock in the
exclusive mode. At this point this operations locks forever. Once the node A
is up DLM returns with the lock acquired for node B - as expected.
However, if I shutdown node A instead of killing it then everything works as
expected - Node B gets the notification and the successfully grabs the lock
w/o locking up.

It can be easily reproduced with dlmtest: grab the lock on one machine in EX
mode (bof226), block on another for the same lock (bof227), kill the first
machine - see that we never acquire lock on the second:

1) *** GRAB LOCK MY_RES (bof226)
[root at bof226 usertest]# ./dlmtest -Q -m EX MY_RES -d 10000
locking MY_RES EX ...done (lkid = 1015e)
lockinfo: status     = 0
lockinfo: resource   = 'MY_RES'
lockinfo: grantcount = 1
lockinfo: convcount  = 0
lockinfo: waitcount  = 0
lockinfo: masternode = 1
lockinfo: lock: lkid        = 1015e
lockinfo: lock: master lkid = 0
lockinfo: lock: parent lkid = 0
lockinfo: lock: node        = 1
lockinfo: lock: pid         = 3771
lockinfo: lock: state       = 2
lockinfo: lock: grmode      = 5
lockinfo: lock: rqmode      = 255

2) *** GRAB LOCK MY_RES (bof227)
[root at bof227 usertest]# ./dlmtest -Q -m EX -d 10000 MY_RES
locking MY_RES EX ...

3) *** KILL bof226
4) *** WAITING FOREVER 
5) *** BOOTING UP bof226 results in lock acquired
lockinfo: status     = 0
lockinfo: resource   = 'MY_RES'
lockinfo: grantcount = 1
lockinfo: convcount  = 0
lockinfo: waitcount  = 0
lockinfo: masternode = 2
lockinfo: lock: lkid        = 10312
lockinfo: lock: master lkid = 103eb
lockinfo: lock: parent lkid = 0
lockinfo: lock: node        = 2
lockinfo: lock: pid         = 4136
lockinfo: lock: state       = 2
lockinfo: lock: grmode      = 5
lockinfo: lock: rqmode      = 255

Has anybody else seen this? I was wondering if this is a bug or there is
something special about 2-node clusters, or do I misunderstand how it
supposed to work?
    Mike