[Linux-cluster] cluster latest cvs does not fence dead nodes automatically

Tue Feb 15 07:13:59 UTC 2005

On Tue, Feb 15, 2005 at 01:52:21PM +0700, Fajar A. Nugraha wrote:
> Hi,
> 
> I'm building two-node cluster using today's cvs from 
> sources.redhat.com:/cvs/cluster.
> Shared storage is located on FC shared disk.
> All work as expected up to using gfs.
> 
> When I simulated a node crash (I did ifcfg eth0 down on node 2),
> node 1 simply says (on syslog):
> 
> Feb 15 13:33:35 hosting-cl02-01 CMAN: removing node hosting-cl02-02 from 
> the cluster : Missed too many heartbeats
> 
> However, NO fencing occured. Not even a "fence failed" message. I use 
> fence_ibmblade.
> After that, access to gfs device blocked (df -k still works though), and 
> /proc/cluster/nodes show
> 
> Node  Votes Exp Sts  Name
>   1    1    1   M   node-01
>   2    1    1   X   node-02

It looks like the node names are wrong.  There were some recent changes to
how we deal with node names, but I don't see how this setup could ever
have worked, even with previous code.

The node names you put in cluster.conf must match the name on the network
interface you want cman to use.  It looks like the machine's name is
"hosting-cl02-02".  Is that the hostname on the network interface you want
cman to use?  If so, then that's the name you should enter in
cluster.conf, and that's the name that should appear when you run
"cman_tool status" and "cman_tool nodes".

> there's an "X" on node 2, but /proc/cluster/service shows
> 
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 run       -
> [1 2]
> 
> DLM Lock Space:  "clvmd"                             2   3 run       -
> [1 2]
> 
> DLM Lock Space:  "data"                              3   4 run       -
> [1 2]
> 
> DLM Lock Space:  "config"                            5   6 run       -
> [1 2]
> 
> GFS Mount Group: "data"                              4   5 run       -
> [1 2]
> 
> GFS Mount Group: "config"                            6   7 run       -
> [1 2]
> 
> which is the same content with before node 2 is dead.
> AFAIK, state should be "recover" or "waiting to recover" instead of run.
> 
> If I reboot node 2 (which is the same thing if you exceute
> fence_ibmblade manually), and restart cluster services on that node, all 
> is back to normal,
> and these messages show on syslog :
> 
> Feb 15 13:38:40 node-01 CMAN: node node-02 rejoining
> Feb 15 13:38:40 node-01 fenced[25486]: node-02 not a cluster member 
> after 0 sec post_fail_delay

Above the names were "hosting-cl02-01" and "hosting-cl02-02".  Could you
clear that up and if there are still problems send your cluster.conf file?
Thanks

-- 
Dave Teigland  <teigland at redhat.com>