[Linux-cluster] Cluster failing after rebooting a standby node

Tue Apr 29 12:39:50 UTC 2008

Ben J wrote:
> Hi Christine,
> 
> Thanks for the reply.
> 
> I've been able to today replicate the cluster failing again by rebooting
> one of the standby nodes.  I captured tcpdump data from 2 of the active
> nodes (store01 and store02) and from the 2 standby nodes (ha01 and
> ha02).  Ha01 is the node that we rebooted, so it will only show cluster
> communication that occurred up until it rebooted.  See attached zip file.
> 
> Note, I've sent this off-list as I didn't want to send this to the list
> for obvious reasons. :)
> 
> Let me know if you need any further information.  I've had the cluster
> running with debug level 7 logging, so I've got that information as well
> if you'd like me to shoot through that as well.
>

Thanks for the tcpdumps, they were very helpful in eliminating several
possible causes I had considered. Unfortunately I still don't quite know
what IS happening!

It seems that when one node leaves the cluster the others go into
transition MASTER state (because they all saw the node go down at the
same time) and they never resolve this state. What normally happens is
that one node will nominate itself master and take over the transition.
But it seems like this is not happening for some reason.

I did manage to reproduce it (or something very similar) on a three node
cluster yesterday, unfortunately I didn't have debugging enabled in the
modules so it didn't tell me much more (though it did tell me a little
more). I have restarted some tests and I hope they will yield some
results soon (ish).

-- 

Chrissie