[Linux-cluster] Node won't rejoin after reboot

Wed Sep 24 14:34:59 UTC 2008

Hello,
we are currently trying to determine a problem in our cluster setup. We
are having two problems, both related together:
1) When doing failover, living node reports "waiting for node to be
fenced" and no failover is done...
2) When the failing node rejoins the cluster, it is killed with a
message: "Killing node node2 because it has rejoined the cluster with existing state"

Both seems to be network related, Cisco infrastructure (65xx and 35xx). And both of them
disappear when moving to non-Cisco infrastructure.

Please let me emphasize, that I AM aware of this document:
http://www.openais.org/doku.php?id=faq:cisco_switches
And we have configured the Cisco according to this (and nevertheless, I
believe this is valid only for multi-switch infrastructure, our nodes
are both connected to a single switch).

We are also aware of:
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a008059a9df.shtml
But this is not our problem, again, single-switch scenario. We have
tried to turn IGMP snooping off and our engineers reported that it
didn't work.

Also:
http://www.mail-archive.com/linux-cluster@redhat.com/msg03889.html
didn't help.

I have intercepted the traffic on the living node using tcpdump,
including all layer headers and it seems that there is no IGMP Join
message from the second node. I suspect it may be the problem. Do
anybody know any details I can check or a fix for this bug?

Thanks,
Jakub