[Linux-cluster] Repeated fencing

Doug Tucker tuckerd at lyle.smu.edu
Wed Feb 24 14:54:38 UTC 2010


Thanks to you and Carlos.  I understand a bit better now what you are
referring to, however, I don't believe that is the issue.  The reason we
went to the crossover cable was to avoid this issue, as we had a switch
die once, and both then thought they were master and tried to fence the
other.  In my situation, there is no reason for the missed heartbeat
that I can find.  The interfaces have not gone down.  We ran a test
where I started a ping between the 2 that wrote out to a file until a
"heartbeat" missed and a reboot occurred.  There was not a single missed
ping between the 2 nodes prior to the event.  Also in a split brain,
both machines should recognize the other one "gone" and try to become
master.  In this case, only 1 of the nodes at a time is seeing a "missed
heartbeat" and then attempting to fence the other.  We have replaced all
hardware to include cables even to ensure it wasn't that.  This appears
to be some software bug of sorts.  Again, we have another 2 node cluster
that this doesn't occur on, but, they are running a different kernel and
gfs module.

On Wed, 2010-02-24 at 03:34 -0600, ESGLinux wrote:
> Hi Doug, 
> 
> 
> the split brain is what is happening to you ;-)
> 
> 
> From wikipedia: http://en.wikipedia.org/wiki/High-availability_cluster
> "HA clusters usually use a heartbeat private network connection which
> is used to monitor the health and status of each node in the cluster.
> One subtle, but serious condition every clustering software must be
> able to handle is split-brain. Split-brain occurs when all of the
> private links go down simultaneously, but the cluster nodes are still
> running. If that happens, each node in the cluster may mistakenly
> decide that every other node has gone down and attempt to start
> services that other nodes are still running. Having duplicate
> instances of services may cause data corruption on the shared
> storage."
> 
> 
> The qourum disk is a good choice to avoid it as they have told you.
> 
> 
> Good luck, 
> 
> 
> Greetings, 
> 
> 
> ESG
> 
> 
> 
> 
> 2010/2/23 Doug Tucker <tuckerd at lyle.smu.edu>
>         > Hi Doug, maybe you can avoid this kind of problem using a
>         quorumdisk partition. a two node cluster is split-brain prone
>         and with a quorumdisk partition you can avoid split-brain
>         situations, which probably is causing this behavior.
>         >
>         > So, about use a cross-over (or straight) cable, I don't know
>         any issue about it, but, try to check if it's using
>         full-duplex mode. half-duplex mode on cross-over linked
>         machines probably will cause heartbeat problems.
>         >
>         > cya..
>         
>         
>         Can you give me a little info about "split-brain" issues?  I
>         don't
>         understand what you mean by that, and what I'm solving with a
>         quorumdisk.  And it has always worked fine, it just started
>         happening
>         after the install of this newer kernel/gfs module.  The other
>         2 node
>         cluster is still rock solid.  Also, the network interfaces on
>         the
>         crossover are full duplex.  Thanks for writing back, you're
>         the first
>         person who offered anything.
>         
>         
>         
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> 





More information about the Linux-cluster mailing list