[Linux-cluster] Cluster failing after rebooting a standby node

Ben J bjlist at westnet.com.au
Wed Apr 30 04:31:32 UTC 2008


Thanks for the reply.

At the moment we've done some testing within our network and managed to 
switch to multicast mode for cluster communication successfully.  Red 
Hat support suggested that the bug we were encountering might be only 
present in broadcast cluster communication mode, would you think that 
the bug is related to the method of cluster communication (i.e. 
broadcast), or would you say that we should expect to also experience 
this issue using multicast as well?

As I mentioned in my first post, we can't replicate the issue until the 
cluster has been running for around 3-4 days at least.  Is this around 
how long your test cluster had been running when you replicated it, or 
did you do something else to replicate this sooner?

Thanks,

Ben


Christine Caulfield wrote:
> Ben J wrote:
>   
>> Hi Christine,
>>
>> Thanks for the reply.
>>
>> I've been able to today replicate the cluster failing again by rebooting
>> one of the standby nodes.  I captured tcpdump data from 2 of the active
>> nodes (store01 and store02) and from the 2 standby nodes (ha01 and
>> ha02).  Ha01 is the node that we rebooted, so it will only show cluster
>> communication that occurred up until it rebooted.  See attached zip file.
>>
>> Note, I've sent this off-list as I didn't want to send this to the list
>> for obvious reasons. :)
>>
>> Let me know if you need any further information.  I've had the cluster
>> running with debug level 7 logging, so I've got that information as well
>> if you'd like me to shoot through that as well.
>>
>>     
>
> Thanks for the tcpdumps, they were very helpful in eliminating several
> possible causes I had considered. Unfortunately I still don't quite know
> what IS happening!
>
> It seems that when one node leaves the cluster the others go into
> transition MASTER state (because they all saw the node go down at the
> same time) and they never resolve this state. What normally happens is
> that one node will nominate itself master and take over the transition.
> But it seems like this is not happening for some reason.
>
> I did manage to reproduce it (or something very similar) on a three node
> cluster yesterday, unfortunately I didn't have debugging enabled in the
> modules so it didn't tell me much more (though it did tell me a little
> more). I have restarted some tests and I hope they will yield some
> results soon (ish).
>
>   




More information about the Linux-cluster mailing list