[Linux-cluster] nodes halted with net lost

Tue Apr 28 15:41:17 UTC 2009

On Tue, 28 Apr 2009 17:21:13 +0200, ESGLinux <esggrupos at gmail.com> wrote:

> The nodes are connected through a single switcher (I know, this is a
single
> point of failure...). If I reboot the switcher, the two nodes halt.
> (through
> fencing it can be done because the go through the same switcher)

If they can't fence each other, cluster services will pause until fencing
can
be performed and verified. If this isn't happening (because the only path
between them with also covers fencing, is gone), then the behaviour you are
seeing is expected. But when the switch comes back up, they should resume.

If they don't resume when the switch comes back up, then that sounds like a
fencing configuration issue. Have you verified that fencing works and that
each node can successfully fence the other?

It is normally a good idea to isolate the cluster communication to a
dedicated interface. If you only have 2 nodes, you could just connect them
directly on a dedicated interface, without a switch.

> I don´t know if this behaviour is normal and if its possible to control
> it.
> I want that when this happens the nodes dont do nothing  or at least they
> reboot, not halt.

You can configure the action you want the fencing agent to perform. Look
up the man page for the fencing agent you are using. I thought the default
was to reboot (at least it is for the DRAC agent).

Gordan