[Linux-cluster] Cluster Crashes

Lon Hohberger lhh at redhat.com
Thu Nov 16 22:38:59 UTC 2006


On Thu, 2006-11-16 at 15:58 -0600, isplist at logicore.net wrote:
> First of all, is there a way I can test to see if my Brocade switch is 
> actually doing any fencing or not? I get the sense it's doing nothing.

Yes.  Use 'fence_brocade' from the command line directly to cut off a
node (turn off the right port).  If it works correctly, you should be
unable to write to shared storage from that node.

If you can still write correctly, log in to the switch and see what's
going on - if the port is still active, it could be that the fencing
agent is out of date.

It's fairly easy to fix fencing agents.  If you think the agent is the
problem, use 'script foo.txt', then telnet to your switch, turn a port
off, check the status, turn it back on, check the status, disconnect,
and type 'exit'.  It will create a script file called 'foo.txt' with a
complete log of your session.


> I think this because my cluster is terribly unstable. If I reboot a node, 
> that's fine, it works, the cluster stays up. However, if one of the nodes 
> crashes in any manner, it takes down everything to the point of having to shut 
> down every machine and starting it all one at a time.

Yes, it sounds like it.  More scary is that the agent might be returning
a false positive...

-- Lon





More information about the Linux-cluster mailing list