[Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion

Robert Clark cluster at defuturo.co.uk
Fri Jun 8 11:51:19 UTC 2007


I've upgraded a test cluster to RHEL4U5, but I'm having some problems on
boot. ccsd and cman seem to be having a small disagreement over whether
the cluster is quorate meaning fenced fails to start up.

First, here is a normal boot sequence from another cluster running 4U4:

May 31 10:19:31 node04 ccsd[2503]: Starting ccsd 1.0.7: 
May 31 10:19:31 node04 ccsd[2503]:  Built: Aug 25 2006 15:00:06 
May 31 10:19:31 node04 ccsd[2503]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
May 31 10:19:31 node04 ccsd:  succeeded
May 31 10:19:31 node04 kernel: CMAN 2.6.9-45.15 (built Mar 27 2007 22:56:03) installed
May 31 10:19:31 node04 kernel: NET: Registered protocol family 30
May 31 10:19:31 node04 kernel: DLM 2.6.9-44.9 (built Mar 27 2007 23:00:18) installed
May 31 10:19:31 node04 ccsd[2503]: cluster.conf (cluster name = cluster1, version = 2) found. 
May 31 10:19:32 node04 kernel: CMAN: Waiting to join or form a Linux-cluster
May 31 10:19:32 node04 last message repeated 3 times
May 31 10:19:32 node04 ccsd[2503]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.1 
May 31 10:19:32 node04 ccsd[2503]: Initial status:: Inquorate 
May 31 10:19:34 node04 kernel: CMAN: sending membership request
May 31 10:19:34 node04 last message repeated 3 times
May 31 10:19:34 node04 last message repeated 3 times
May 31 10:19:34 node04 last message repeated 5 times
May 31 10:19:34 node04 last message repeated 5 times
May 31 10:19:35 node04 kernel: CMAN: got node node05
May 31 10:19:35 node04 kernel: CMAN: got node node06
May 31 10:19:35 node04 kernel: CMAN: got node node03
May 31 10:19:35 node04 kernel: CMAN: got node node01
May 31 10:19:35 node04 kernel: CMAN: got node node02
May 31 10:19:35 node04 ccsd[2503]: Cluster is quorate.  Allowing connections. 
May 31 10:19:35 node04 kernel: CMAN: quorum regained, resuming activity
May 31 10:19:35 node04 cman: startup succeeded
May 31 10:19:38 node04 defuturo: fenced succeeded


Now, some logs for the 4U5 cluster:

Jun  8 12:40:26 tamarillo ccsd[2448]: Starting ccsd 1.0.10: 
Jun  8 12:40:26 tamarillo ccsd[2448]:  Built: May 31 2007 15:48:09 
Jun  8 12:40:26 tamarillo ccsd[2448]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Jun  8 12:40:26 tamarillo ccsd:  succeeded
Jun  8 12:40:26 tamarillo kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
Jun  8 12:40:26 tamarillo kernel: NET: Registered protocol family 30
Jun  8 12:40:26 tamarillo kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed
Jun  8 12:40:26 tamarillo ccsd[2448]: cluster.conf (cluster name = defuturo_test, version = 2) found. 
Jun  8 12:40:27 tamarillo kernel: CMAN: Waiting to join or form a Linux-cluster
Jun  8 12:40:27 tamarillo kernel: CMAN: sending membership request
Jun  8 12:40:28 tamarillo kernel: CMAN: got node guava
Jun  8 12:40:28 tamarillo kernel: CMAN: got node kiwano
Jun  8 12:40:28 tamarillo kernel: CMAN: quorum regained, resuming activity
Jun  8 12:40:28 tamarillo cman: startup succeeded
Jun  8 12:40:30 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:30 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:31 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:31 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:32 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:32 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:33 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:33 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:34 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:34 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:35 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:35 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:36 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:36 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:37 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:37 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:38 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:38 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:39 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:39 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:40 tamarillo defuturo: fenced failed
Jun  8 12:40:40 tamarillo ccsd[2448]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4 
Jun  8 12:40:40 tamarillo ccsd[2448]: Initial status:: Quorate 


This error happens on most boots, but not all, so I suspect a race
condition. By the time I can log into the node, it's quorate and I can
start up fenced manually. I've put in some debugging and verified
that /proc/cluster/status lists the cluster as quorate immediately
before and after attempting to start fenced.

There are a couple of things of note about our set-up:

1) We're not using fence from 4U5 because of bz241217, so
fence-1.32.25-1 is installed on both clusters.

2) fenced is being started in a chroot jail by our own script which
runs:

    /usr/sbin/chroot /mnt/fenced /sbin/fence_tool -t 0 join -w

The output from that command is:

fence_tool: cannot connect to ccs -111

fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111


3) /etc/sysconfig/cluster sets CMAN_CLUSTER_TIMEOUT=0 and
CMAN_QUORUM_TIMEOUT=86400.

  Does anyone know what might cause ccsd to continue to refuse
connections for a lack of quorum after cman has decided the cluster is
quorate?

	Robert




More information about the Linux-cluster mailing list