[Linux-cluster] Severe problems with 64-bit RHCS on RHEL5.1

Thu Apr 17 18:42:33 UTC 2008

On Thu, 2008-04-17 at 09:17 +0100, Gordan Bobic wrote:
> Harri.Paivaniemi at tietoenator.com wrote:

> > 1. 2-node cluster. Can't start only one node to get cluster services up - it hangs in fencing and waits until I start te second node and immediately after that, when both nodes are starting cman, the cluster comes up. So if I have lost one node, I can't get the cluster up, if I have to restart for seome reason the working node. It should work like before (both nodes are down, I start one, it fences another and comes up). Now it just waits... log says:
> > 
> > ccsd[25272]: Error while processing connect: Connection refused
> > 
> > This is so common error message, that it just tell's nothing to me....
> 
> I have seen similar error messages before, and it has usually been 
> caused by either the node names/interfaces/IPs not being listed 
> correctly in /etc/hosts file, or iptables firewalling rules blocking 
> communication between the nodes.

It's probably also partly the cluster not being quorate.  ccsd is very
verbose, and it logs errors perhaps when it shouldn't...

> > 2. qdisk doesn't work. 2- node cluster. Start it (both nodes at the same time) to get it up. Works ok, qdisk works, heuristic works. Everything works. If I stop cluster daemons on one node, that node can't join to cluster anymore without a complete reboot. It joins, another node says ok, the node itself says ok, quorum is registred and heuristic is up, but the node's quorum-disk stays offline and another node says this node is offline. If I reboot this machine, it joins to cluster ok.

> I believe it's supposed to work that way. When a node fails it needs to 
> be fully restarted before it is allowed back into the cluster. I'm sure 
> this has been mentioned on the list recently.

If you cleanly stop the cluster daemons, fencing shouldn't be needed
here.  If the node's not getting allowed in to the cluster, there's some
reason for it.  A way to tell if a node's being rejected is:

   cman_tool status

If you see 'DisallowedNodes' (I think?), the current "quorate" partition
thinks that the other node needs to be fenced.  I don't remember the
cases that lead to this situation, though.

Anyway, clean stop of the cluster should never require fencing.

> > 3. Funny thing: heuristic ping didn't work at all in the beginning and support gave me a "ping-script" which make it to work... so this describes quite well how experimental this cluster is nowadays...
> > 
> > I have to tell you it is a FACT that basics are ok: fencing works ok in a normal situation, I don't have typos, configs are in sync,  everything is ok, but these problems still exists.
> 
> I've been in similar situations before, but in the end it always turned 
> out to be me doing something silly (see above re: host files and 
> iptables as examples).

Need for the ping-script is definitely a bug.  It's because ping uses
signals to wake itself up, and qdiskd blocked those signals before fork
(and of course, ping doesn't unblock signals itself).  It's fixed in
current 4.6.z/5.1.z errata (IIRC) and definitely in 5.2 beta.

> > I have 2 times sent sosreports etc. so RH support. They hava spent 3 weeks and still can't say whats wrong...

> Sadly, that seems to be the quality of commercial support from any 
> vendor. Support nowdays seems to have only one purpose - managerial 
> back-covering exercise so they can pass the buck.

It's unfortunate that this is the conception.

-- Lon