[Linux-cluster] Two-node cluster: Node attempts stateful merge after clean reboot

Thu Sep 12 06:57:22 UTC 2013

On 11/09/13 7:31 PM, Digimer wrote:
> That log message does show the node joining. Can you reliably
> reproduce this? If so, can you please 'tail -f -n 0 /var/log/messages'
> on both nodes, break the cluster and wait for the node to restart,
> 'tail' the rebooted node's /var/log/messages, wait the six minutes and
> then, after the second fence occurs, post both node's logs?
>
I was indeed able to reliably reproduce this and that's where my
confusion came from. I don't understand why the node seems to be joining
(and leaving immediately afterwards as per the log), all within the
360secs post join fence delay and still gets fenced.

As this is a semi-production system (we had to move quickly), I went
with a qdisk based approach now, using a small iscsi disk from a remote
site. This works very well and reliable as far as I can tell from the
testing that I have done so far. I would still be interested to hear why
the initial approach failed.

How would have manually starting the cluster services a difference
anyway? Does that mean that one should join the cluster and fence domain
first to ensure a stateless join and only then start rgmanager? Isn't
that something that could be achieved with some delays in the startup
scripts as well?

Either way, thank you all for helping out this quick!