[Linux-cluster] lock_gulmd hanging on startup (STABLE, as of 24th running on Debian/Sarge)

Mon Jun 27 03:35:39 UTC 2005

Hi

I'm tryint to get gfs over gnbd running on Debian/Sarge.
ccsd is running fine (Using either IPv4, or IPv6), but lock_gulmd
hangs when it's started. I have enabled IPv6 in my kernel, but didn't
configure any IPv6 addresses. There are, howevery, link-local IPv6 
addresses configures for each interface (Linux seems to add them 
automatically). I'm running lock_gulmd with the following options
"-n cluster-ws-sx --use_ccs --name master.ws-sx.cluster.solution-x.com 
-v ReallyAll".

Any tip & ideas? Any debugging I could do to trace that down?

This is what it syslogs:
Jun 27 05:15:22 elrond ccsd[795]: Starting ccsd DEVEL.1119711496:
Jun 27 05:15:22 elrond ccsd[795]:  Built: Jun 25 2005 16:59:43
Jun 27 05:15:22 elrond ccsd[795]:  Copyright (C) Red Hat, Inc.  2004 
All rights reserved.
Jun 27 05:15:22 elrond ccsd[795]:   IP Protocol:: IPv6 only   Multicast 
(default):: SET
Jun 27 05:15:28 elrond ccsd[795]: cluster.conf (cluster name = 
cluster-ws-sx, version = 1) found.
Jun 27 05:15:32 elrond lock_gulmd_main[814]: Forked lock_gulmd_core.
Jun 27 05:15:32 elrond lock_gulmd_core[826]: Starting lock_gulmd_core 
DEVEL.1119711496. (built Jun 25 2005 17:00:28) Copyright (C) 2004 Red 
Hat, Inc.  All rights reser
ved.
Jun 27 05:15:32 elrond lock_gulmd_core[826]: I am running in Standard mode.
Jun 27 05:15:32 elrond lock_gulmd_core[826]: I am 
(master.ws-sx.cluster.solution-x.com) with ip (::ffff:10.100.20.1)
Jun 27 05:15:32 elrond lock_gulmd_core[826]: This is cluster cluster-ws-sx
Jun 27 05:15:32 elrond lock_gulmd_core[826]: In state: Pending
Jun 27 05:15:32 elrond lock_gulmd_core[826]: In state: Master
Jun 27 05:15:32 elrond lock_gulmd_core[826]: I see no Masters, So I am 
becoming the Master.
Jun 27 05:15:32 elrond lock_gulmd_core[826]: Sending Quorum update to 
slave master.ws-sx.cluster.solution-x.com
Jun 27 05:15:32 elrond lock_gulmd_core[826]: Could not send quorum 
update to slave master.ws-sx.cluster.solution-x.com
Jun 27 05:15:32 elrond lock_gulmd_core[826]: New generation of server 
state. (1119842132653336)
Jun 27 05:15:32 elrond lock_gulmd_core[826]: Got heartbeat from 
master.ws-sx.cluster.solution-x.com at 1119842132653434 
(last:1119842132653434 max:0 avg:0)
Jun 27 05:15:33 elrond lock_gulmd_main[814]: Forked lock_gulmd_LT.
Jun 27 05:15:33 elrond lock_gulmd_LT[828]: Starting lock_gulmd_LT 
DEVEL.1119711496. (built Jun 25 2005 17:00:28) Copyright (C) 2004 Red 
Hat, Inc.  All rights reserved.

Jun 27 05:15:33 elrond lock_gulmd_LT[828]: I am running in Standard mode.
Jun 27 05:15:33 elrond lock_gulmd_LT[828]: I am 
(master.ws-sx.cluster.solution-x.com) with ip (::ffff:10.100.20.1)
Jun 27 05:15:33 elrond lock_gulmd_LT[828]: This is cluster cluster-ws-sx
Jun 27 05:15:33 elrond lock_gulmd_LT000[828]: Locktable 0 started.
Jun 27 05:15:34 elrond lock_gulmd_main[814]: Forked lock_gulmd_LTPX.
Jun 27 05:15:34 elrond lock_gulmd_LTPX[831]: Starting lock_gulmd_LTPX 
DEVEL.1119711496. (built Jun 25 2005 17:00:28) Copyright (C) 2004 Red 
Hat, Inc.  All rights reser
ved.
Jun 27 05:15:34 elrond lock_gulmd_LTPX[831]: I am running in Standard mode.
Jun 27 05:15:34 elrond lock_gulmd_LTPX[831]: I am 
(master.ws-sx.cluster.solution-x.com) with ip (::ffff:10.100.20.1)
Jun 27 05:15:34 elrond lock_gulmd_LTPX[831]: This is cluster cluster-ws-sx
Jun 27 05:15:34 elrond lock_gulmd_LTPX[831]: ltpx started.

ps auxwww | grep gulm gives:
root       826  0.0  0.1  2008  840 ?        S<s  05:15   0:00 
lock_gulmd_core --cluster_name cluster-ws-sx --servers 
::ffff:10.100.20.1 --name master.ws-sx.cluster.solution-x.com 
--verbosity ReallyAll
root       828  0.0  0.1  2008  820 ?        S<s  05:15   0:00 
lock_gulmd_LT --cluster_name cluster-ws-sx --servers ::ffff:10.100.20.1 
--name master.ws-sx.cluster.solution-x.com --verbosity ReallyAll
root       831  0.0  0.1  2008  820 ?        S<s  05:15   0:00 
lock_gulmd_LTPX --cluster_name cluster-ws-sx --servers 
::ffff:10.100.20.1 --name master.ws-sx.cluster.solution-x.com 
--verbosity ReallyAll

And finally, strace shows all 3 pids stuck in a recv call on fd 6.

Here is my cluster.conf:
<cluster name="cluster-ws-sx" config_version="1">
         <gulm>
                 <lockserver name="master.ws-sx.cluster.solution-x.com"/>
         </gulm>
         <clusternodes>
                 <clusternode name="master.ws-sx.cluster.solution-x.com">
                         <method name="single">
                                 <device name="gnbd" 
nodename="master.ws-sx.cluster.solution-x.com"/>
                         </method>
                 </clusternode>

                 <clusternode name="s1.ws-sx.cluster.solution-x.com">
                         <method name="single">
                                 <device name="gnbd" 
nodename="s1.ws-sx.cluster.solution-x.com"/>
                         </method>
                 </clusternode>
         </clusternodes>

         <fencedevices>
                 <fencedevice name="gnbd" agent="fence_gnbd" 
servers="10.100.20.1"/>
         </fencedevices>
</cluster>

greetings, Florian Pflug