Ben J wrote:
Hi Christine,
Thanks for the reply.
I've been able to today replicate the cluster failing again by rebooting
one of the standby nodes. I captured tcpdump data from 2 of the active
nodes (store01 and store02) and from the 2 standby nodes (ha01 and
ha02). Ha01 is the node that we rebooted, so it will only show cluster
communication that occurred up until it rebooted. See attached zip file.
Note, I've sent this off-list as I didn't want to send this to the list
for obvious reasons. :)
Let me know if you need any further information. I've had the cluster
running with debug level 7 logging, so I've got that information as well
if you'd like me to shoot through that as well.
Thanks for the tcpdumps, they were very helpful in eliminating several
possible causes I had considered. Unfortunately I still don't quite know
what IS happening!
It seems that when one node leaves the cluster the others go into
transition MASTER state (because they all saw the node go down at the
same time) and they never resolve this state. What normally happens is
that one node will nominate itself master and take over the transition.
But it seems like this is not happening for some reason.
I did manage to reproduce it (or something very similar) on a three node
cluster yesterday, unfortunately I didn't have debugging enabled in the
modules so it didn't tell me much more (though it did tell me a little
more). I have restarted some tests and I hope they will yield some
results soon (ish).