[Linux-cluster] RHEL 4.7 fenced fails -- stuck join state: S-2,2,1

Robert Hurst rhurst at bidmc.harvard.edu
Tue Aug 11 14:55:48 UTC 2009


Simple 4-node cluster, 2-nodes have a GFS shared home directory mounted
for over a month.  Today, I wanted to mount /home on a 3rd node, so:

# service fenced start                [failed]

Weird.  Checking /var/log/messages show:

Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10 (built
Jan 22 2009 18:39:16) installed
Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22
2009 18:39:32) installed
Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster "lock_dlm",
"ccc_cluster47:home"
Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18)
installed
Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found; check
fenced
Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm,
table = ccc_cluster47:home, hostdata = 

# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           0   2 join
S-2,2,1
[]

So, a fenced process is now hung:

root     28302  0.0  0.0  3668  192 ?        Ss   10:19   0:00 fenced -t
120 -w

Q: Any idea how to "recover" from this state, without rebooting?

The other two servers are unaffected by this (thankfully) and show
normal operations:

$ cman_tool services

Service          Name                              GID LID State
Code
Fence Domain:    "default"                           2   2 run       -
[1 12]

DLM Lock Space:  "home"                              5   5 run       -
[1 12]

GFS Mount Group: "home"                              6   6 run       -
[1 12]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090811/7b46b120/attachment.htm>


More information about the Linux-cluster mailing list