[Linux-cluster] Fencing through iLO and functioning of kdump

Ben Turner bturner at redhat.com
Wed Sep 1 14:48:23 UTC 2010


Here is a kbase on fence scsi:

https://access.redhat.com/kb/docs/DOC-17809

It should answer any questions you have:

https://access.redhat.com/kb/docs/DOC-17809

Usually I try the fence_scsi_test to be sure my devices are capable, note:

"To assist with finding and detecting devices which are (or are not) suitable for use with fence_scsi, a tool has been provided. The fence_scsi_test script will find devices visible to the node and report whether or not they are compatible with SCSI persistent reservations."

-Ben


----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:

> Ben,
> 
> Thank you for pointing me at fence_scsi.
> It looks like fence_scsi will fit the bill elegantly. And it should be
> much more reliable then iLO fencing if the cluster uses properly
> configured, dual fabric FC SAN for shared storage.
> 
> I read the fence_scsi manual page and have one more question.
> 
> What do I need to do for my cluster to start using SCSI reservations?
> Is this done by default?
> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ben Turner
> Sent: Saturday, 28 August 2010 03:29
> To: linux clustering
> Subject: Re: [Linux-cluster] Fencing through iLO and functioning of
> kdump
> 
> You have a couple options here:
> 
> 1.  Switch to fence_scsi(uses scsi reservation as you described) or an
> other I/O fencing method that does not reboot the system.  This will
> enable you core dump to complete without power fencing interrupting
> it.
> 
> 2.  Put in a post fail delay long enough for fencing to complete. 
> This is sub optimal as your cluster services/resources will be hung
> for the duration of the post fail delay.  I usually only do this when
> I know I have a node that is crashing and no I/O fencing
> capabilities.
> 
> 3.  If you don't have access to an I/O fence agent and it post fail
> delay won't work for some reason you can try:
> 
> Best practice I can think of right now would be the following:
> 1. disable the power fence device on the host you're seeing panics on,
> I have changed the IP for it in cluster.conf in the past 2. when that
> node fails, the other nodes will attempt to fence the host
>    and it will fail since the fence device was disabled
>    (NOTE: between steps 2 and 3, cluster operation is suspended) 3.
> administrator can now do things like:
>    - disconnect the FC and network cables form the affected host
> ensuring
>      that it is 'manually I/O fenced'
>    - run fence_ack_manual on the other host to override the failed
>      fencing operation to continue cluster operation on the other
> nodes 4. Now the failed host is free to continue kdumping for as long
> as need be
> 
> Hope this helps.
> 
> -b
> 
> 
> ----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:
> 
> > Hi,
> > 
> > How can I reconcile the need to have Kdump configured and
> operational 
> > on cluster nodes with the need for fencing of a node most commonly
> and 
> > conveniently implemented through iLO on HP servers?
> > 
> > Customers require Kdump configured and operational to be able to
> have 
> > kernel crashes analysed by Red Hat support. The taking of crash dump
> 
> > starts immediately after the crash, but it may take very
> considerable 
> > time on a machine with 512 GB of memory (more than an hour) if done
> in 
> > dumplevel 0 and over 1 GBE network. However, if I use iLO fencing
> then 
> > the crashed node will be powered off through iLO which will 
> > irrecovably kill the the kernel dump in progress and erase the
> memory 
> > content containing the crashed kernel image.
> > 
> > Ideally, I would love to have the functionality that is present in 
> > several UNIX clusters, when a crashed node completes its kernel
> crash 
> > dump in peace. In UNIX clusters the crashed node can be configured
> to 
> > reboot automatically after kernel crash and rejoin the cluster. It 
> > typically does the kernel dump as a part of the boot.
> > 
> > The UNIX clusters typically use SCSI reservation to protect
> integrity 
> > of storage. This enables them to keep the failed node isolated
> whilst 
> > it is still able to do the kernel crash dump before rejoining the 
> > cluster. I believe this option is not avilable in Linux Cluster.
> > 
> > So, how can I have functioning Linux cluster with ability of taking
> a 
> > kernel crash dump of crashed nodes and without blocking the access
> to 
> > shared GFS2 filesystem for the hour or so that bit may take a crash
> 
> > dump obn a very large system?
> > 
> > Thanks and regards,
> > 
> > Chris Jankowski
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list