[Linux-cluster] Tiebreaker IP

Fri Aug 26 18:24:39 UTC 2005

Rob,

Heres a summary of what I have observed with this configuration. You may
want to verify the accuracy of my observations on your own. 

Starting with RHEL3, the RHCS verified node membership via a network
heartbeat rather than/in addition to a disk timestamp. The network
heartbeat traffic moves over the same interface that is used to access
the network resources. This means that there is no dedicated heartbeat
interface like you would see in a microsoft cluster.

The tiebreaker IP is used to prevent a split brain situation in a a
cluster with an even number of nodes. Lets say you have 2 active cluster
nodes... say nodeA and nodeB, and nodeA owns an NFS disk resource and
IP. Then lets say nodeA fails to receive a heartbeat from nodeB over its
primary interface. This could mean several things: nodeA's interface is
down, nodeB's interface is down, or their shared switch is down. So if
nodeA and nodeB stop communicating with eachother, they will both try to
ping the tiebreaker IP, which is usually your default gateway IP. If
nodeA gets a response from the tiebreaker IP, it will continue to own
the resource. If it cant, it will assume its external interface is down
and fence/reboot itself. The same holds true for nodeB. Unlike RHEL2.1
which used STONITH, RHEL3 cluster nodes reboot themselves. Therefor,
even if nodeB can reach the tiebreaker and CANT reach nodeA, it will not
get the cluster resource until nodeA releases it. This prevents the
nodes from accessing the shared disk resource concommitantly.

This configuration prevents split brain by ensuring the resource owner
doesn't get killed accidentally by its peer. For those that remember,
ping=ponging was a big problem with RHEL2.1 clusters. If they couldn't
read their partners disk timestamp update in a timely manner -- due to
IO latency or whatever -- they would reboot their partner node. On
reboot, the rebooted node would STONITH the other node, etc.

Anyway, I hope this answers your questions. It is fairly easy to test.
Set up a 2 node cluster, then reboot the service owner. If the service
starts on the other node, you should be configured correctly. Next
disconnect the service owner from the network. The service owner should
reboot itself with the watchdog or fail over the resource, depending on
how its ocnfigured. Repeat this test with the non-service owner. (the
resources should not move in this case.) then take turns disconnecting
them from the shared storage. 

Cheers, jacob

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Veer, Rob ter
> Sent: Thursday, August 25, 2005 4:45 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Tiebreaker IP
> 
> Hello, 
> 
> I'm trying to get a deeper insight into the workings of the 
> tiebreaker with a standard RHCS cluster configuration. It's 
> not clear to me what the role of the tiebreaker within a 2 or 
> 4 node cluster exactly is. 
> 
> Is there any documentation on this subject? I'm particulary 
> interested in the flow of events when the tiebreaker is used. 
> I know the tiebreaker is used to prevent node isolation, but 
> how exactly?
> 
> Regards,
> Rob. 
> 
>