[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Diskless Quorum Disk



Lon, thank you for the response. It appears that what I thought was a fence duel, was actually the cluster fencing the proper node and DRBD halting the surviving node after a split brain scenario. (Have some work to do on my drbd.conf obviously.) After the fenced node revived, it saw that the other was unresponsive (it had been halted) and then fenced it; in this case inducing it to power on.

Our DRAC shares the NICs with the host. We will probably hack on the DRAC fence script a little to take advantage of some other features available besides doing a poweroff poweron.

Using two_node=1 may be an option again, but then the FAQ indicates the quorum disk might still be beneficial. Using a loop device didn't seem to go so well, but that could be due to configuration error. Having one node not see the qdisk is probably an automatic test failure.

Thanks again,
Chris


Lon Hohberger wrote:
On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote:
My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards using telnet over their NICs. The same NICs used in my bonded config on the OS so I assumed it was on the same network path. Perhaps I assume incorrectly.

That sounds mostly right.  The point is that a node disconnected from
the cluster must not be able to fence a node which is supposedly still
connected.

That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected
from the cluster.  However, 'A' must be able to be fenced if 'A' becomes
disconnected.

Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in
that it shares a NIC with the host machine?)

Desired effect would be survivor claims service(s) running on unreachable node and attempts to fence unreachable node or bring it back online without fencing should it establish contact. Actual result was survivor spun its wheels trying to fence unreachable node and did not assume services.

Yes, this is an unfortunate limitation of using (most) integrated power
management systems.  Basically, some BMCs share a NIC with the host
(IPMI), and some run off of the machine's power supply (IPMI, iLO,
DRAC).  When the fence device becomes unreachable, we don't know whether
it's a total network outage or a "power disconnected" state.

* If the power to a node has been disconnected, it's safe to recover.

* If the node just lost all of its network connectivity, it's *NOT* safe
to recover.

* In both cases, we can not confirm the node is dead... which is why we
don't recover.

Restoring network connectivity induced the previously unreachable node to reboot and the surviving node experienced some kind of weird power off and then powered back on (???).

That doesn't sound right; the surviving node should have stayed put (not
rebooted).

Ergo I figured I must need quorum disk so I can use something like a ping node. My present plan is to use a loop device for the quorum disk device and then setup ping heuristics. Will this even work, i.e. do the nodes both need to see the same qdisk or can I fool the service with a loop device?

I don't believe the effect of tricking qdiskd in this way have been
explored; I don't see why it wouldn't work in theory, but... qdiskd with
or without a disk won't fix the behavior you experienced (uncertain
state due to failure to fence -> retry / wait for node to come back).

I am not deploying GFS or GNDB and I have no SAN. My only option would be to add another DRBD partition for this purpose which may or may not work.

What is the proper setup option, two_node=1 or qdisk?

In your case, I'd say two_node="1".



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]