[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Understanding Fencing



Side note, then i will answer in-line. When possible, please start a new email to a mailing list, instead of hitting "reply" on an existing message and then deleting the content. A lot of people's email clients threading breaks when an email isn't new.

On 09/01/2012 11:57 AM, joshi dhaval wrote:
Hello,

I tried to read some documents on fencing, still bit confused with
technology. ( i dont want to buy any extra hardware just for fencing ).

Was this one of the things you read?

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

we are using HP DL 380 G6, G7 servers at out environment, only way i can
see fencing possible in my environment is HP ILO.

Yes, you can use fence_ilo with that. I have done so myself and cover how to set it up here:

https://alteeve.ca/w/Configuring_HP_iLO_2_on_EL6

and how to use it as a fence device here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO

what is PDU ? do i need to purchase separate device to enable fencing
using PDU ?

A PDU (Power Distribution Unit) is, by itself, just another name for a power bar, though it generally refers to rack-mounted power bars. In fencing though, we use a version called a "Switched PDU". These are power bars with a network connection. They allow you to connect remotely and turn each outlet on and off independently of the other ports. They also offer power monitoring and so one, but that's outside fencing.

So in fencing, if for example, the power supply failed then the server would power down and take the IPMI or iLO interface with it (see below). Without any power at all, the IPMI will not reply as it will also have no power. We know in this case that the node is gone, but the other nodes don't. All they know is that they can't talk to the node or it's IPMI/iLO interfaces, which could just as well be network outage leaving the node alive.

In this case, the cluster can call the Switched PDUs and ask them to turn off the outlet(s) feeding the server. When the PDUs say "ok, they're off", *then* the cluster can safely say "ok, now I know it has to be off" and can begin recovery.

is that IPMI is same as HP ILO ?

No, but they are similar. I have a short write-up of it here:

https://alteeve.ca/w/IPMI

IPMI is a generic way for a server to offer "Out of Band" management. That is just a fancy way of saying "You can check on the state of the server even when the server is powered off".

The piece of hardware inside your server that provides IPMI is called a "BMC" (Baseboard Management Controller). Think of it like a little, separate computer sitting on your server's motherboard. It draws it's power from the host, it can read the host's sensors (power state, fans, temperatures, etc) but it is still a totally separate device.

In fencing, if one node stops responding (say because the OS crashed), another node in the cluster will call the victim's IPMI interface and say "please power off the host". The BMC then, effectively, "pushes and holds the power button" until the host shuts down. Then the IPMI device tells the caller that the power off was successful. The cluster then knows the state of the victim (it is powered off now) so it can safely recover.

As for the difference between IPMI and iLO;

Most major hardware vendors took IPMI and added a bunch of features on top of it. Then they renamed it to something they wanted. So HP called theirs "iLO", IBM called theirs "RSA", Dell called theirs "DRAC" and so on. These are all very similar to IPMI (some are similar enough that stock IPMI tools work with them).

for above hardware what you suggest are the most reliable fencing
techniques i should use ?

I would use 'fence_ilo'.

is that cross cable connection is possible just to check hearbeats like
VCS has gab and llt ?

I don't know VCS or llt so I can't comment. In RHCS, we use "corosync" for cluster membership. By default, it uses a multicast group for passing messages around the cluster and for detecting a node's death. It's similar to what I think you mean by "heartbeat". It is advised that you use a proper switch, though I do not believe it is required.

i am panning to configure 2 nodes cluster first once i will have
confidence i will move it to 4 or 5 node cluster.

Then definitely use a proper switch, not back to back.

Regards,
Dhaval

A final comment;

In clustering, a failed fence action will leave the cluster in a state where it does not know the condition of a member. Given the dangers of making an assumption, the cluster would rather block (hang) than proceed in a way that could cause damage. This is why fencing is so critical; It restores the cluster to a known state after a fault.

If you use only iLO for fencing (and many people do only use IPMI, iLO, etc), then you will be fine most of the time. For me personally, this is not good enough. If for any reason the other node(s) can't reach the IPMI or iLO interface, the fence action will fail and the cluster will hang. With a switched PDU, you have a backup fence device that would protect you against this by providing an alternate method of confirming the node's state. Thus, adding a switched PDU to your cluster, you remove another single point of failure.

digimer

--
Digimer
Papers and Projects: https://alteeve.ca


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]