[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Cluster failing after rebooting a standby node



Thanks for the reply.

At the moment we've done some testing within our network and managed to switch to multicast mode for cluster communication successfully. Red Hat support suggested that the bug we were encountering might be only present in broadcast cluster communication mode, would you think that the bug is related to the method of cluster communication (i.e. broadcast), or would you say that we should expect to also experience this issue using multicast as well?

As I mentioned in my first post, we can't replicate the issue until the cluster has been running for around 3-4 days at least. Is this around how long your test cluster had been running when you replicated it, or did you do something else to replicate this sooner?

Thanks,

Ben


Christine Caulfield wrote:
Ben J wrote:
Hi Christine,

Thanks for the reply.

I've been able to today replicate the cluster failing again by rebooting
one of the standby nodes.  I captured tcpdump data from 2 of the active
nodes (store01 and store02) and from the 2 standby nodes (ha01 and
ha02).  Ha01 is the node that we rebooted, so it will only show cluster
communication that occurred up until it rebooted.  See attached zip file.

Note, I've sent this off-list as I didn't want to send this to the list
for obvious reasons. :)

Let me know if you need any further information.  I've had the cluster
running with debug level 7 logging, so I've got that information as well
if you'd like me to shoot through that as well.


Thanks for the tcpdumps, they were very helpful in eliminating several
possible causes I had considered. Unfortunately I still don't quite know
what IS happening!

It seems that when one node leaves the cluster the others go into
transition MASTER state (because they all saw the node go down at the
same time) and they never resolve this state. What normally happens is
that one node will nominate itself master and take over the transition.
But it seems like this is not happening for some reason.

I did manage to reproduce it (or something very similar) on a three node
cluster yesterday, unfortunately I didn't have debugging enabled in the
modules so it didn't tell me much more (though it did tell me a little
more). I have restarted some tests and I hope they will yield some
results soon (ish).



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]