[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] SCSI Reservations & Red Hat Cluster Suite



Attached is the latest version of the "Using SCSI Persistent Reservations with Red Hat Cluster Suite" document for review.

Feel free to send questions and comments.

-Ryan
Using SCSI Persistent Reservations with Red Hat Cluster Suite

Ryan O'Hara <rohara redhat com>

January 2008

--

1 - Introduction

When cluster nodes share storage devices, it is necessary to control
access to the storage devices. In the event of a node failure, the
failed not should not have access to the underlying storage
devices. SCSI persistent reservations provide the capability to
control the access of each node to shared storage devices. Red Hat
Cluster Suite employs SCSI persistent reservations as a fencing
methods through the use of the fence_scsi agent. The fence_scsi agent
provides a method to revoke access to shared storage devices, provided
that the storage support SCSI persistent reservations.

Using SCSI reservations as a fencing method is quite different from
traditional power fencing methods. It is very important to understand
the software, hardware, and configuration requirements prior to using
SCSI persistent reservations as a fencing method.

2 - Overview

In order to understand how Red Hat Cluster Suite is able to use SCSI
persistent reservations as a fencing method, it is helpful to have
some basic knowledge of SCSI persistent reservations.

There are two important concepts withing SCSI persistent
reservations that should be made clear: registrations and
reservations. 

2.1 - Registrations

A registration occurs when a node registers a unique key with a
device. A device can have many registrations. For our purposes, each
node will create a registration on each device.

2.2 - Reservations

A reservation dictates how a device can be accessed. In contrast to
registrations, there can be only one reservation on a device at any
time. The node that holds the reservation is know as the "reservation
holder". The reservation defines how other nodes may access the
device. For example, Red Hat Cluster Suite uses a "Write Exclusive,
Registrants Only" reservation. This type of reservation indicates that
only nodes that have registered with that device may write to the
device.

2.3 - Fencing

Red Hat Cluster Suite is able to perform fencing via SCSI persistent
reservations by simply removing a node's registration key from all
devices. When a node failure occurs, the fence_scsi agent will remove
the failed node's key from all devices, thus preventing it from being
able to write to those devices. More information can be found in
Section x.x of this document.

3 - Requirements

3.1 - Software Requirements

In order to use SCSI persistent reservations as a fencing methods,
several requirements must be met/

- Red Hat Cluster Suite 4.5 or greater
- Red Hat Cluster Suite 5.0 or greater

The sg3_utils package must also be installed. This package provides
the tools needed by the various scripts to manage SCSI persistent
reservations.

3.2 - Storage Requirements

In order to use SCSI persistent reservations as a fencing method, all
shared storage must use LVM2 cluster volumes. In addition, all devices
within these volumes must be SPC-3 compliant. If you are unsure if
your cluster and shared storage environment meets these requirements,
a script is available to determine if your shared storage devices are
capable of using SCSI persistent reservations. See section x.x.

4  - Limitations

In addition to these requirements, fencing by way of SCSI persistent
reservations also some limitations.

- Multipath devices are not currently supported.

- All nodes in the cluster must have a consistent view of storage. In
  other words, all nodes in the cluster must register with the same
  devices. This limitation exists for the simple reason that each node
  must be able to remove another node's registration key from all the
  devices that it registered with. I order to do this, the node
  performing the fencing operation must be aware of all devices that
  other nodes are registered with. If all cluster nodes have a
  consistent view of storage, this requirement is met.

- Devices used for the cluster volumes should be a complete LUN, not
  partitions. SCSI persistent reservations work on an entire LUN,
  meaning that access is controlled to each LUN, not individual
  partitions.

To assist with finding and detecting devices which are (or are not)
suitable for use with fence_scsi, a tool has been provided. The
fence_scsi_test script will find devices visible to the node and
report whether or not they are compatible with SCSI persistent
reservations. A full description of this tool can be found in Section
x.x of this document.

4 - Components

Red Hat Cluster Suite provides three components (scripts) to be used
in conjunction with SCSI persistent reservations. The fence_scsi_test
script provides a means to discover and test devices and report
whether or not they are capable of support SCSI persistent
reservations. The scsi_reserve init script, if enabled, will run at
node startup and discover shared storage devices and create
registrations/reservations on each device using the node's unique
key. The fence_scsi script, if configured as the fencing method, will
remove a failed node's registration key from all known devices.

4.1 - fence_scsi_test

The fence_scsi_test script will find all devices visible to a node and
report whether or not those devices are compatible with SCSI
persistent reservations. There are two modes of operation for this
this, and the user must explicitly state which mode to use by using to
appropriate command-line option.

- Cluster Mode

Specified with the '-c' flag on the command-line. This mode is
intended for use with an existing cluster environment. Specifically,
this mode will discover all LVM2 cluster volumes and extract the
devices within those volumes. In other words, only devices that exist
within LVM2 cluster volumes will be tested.

- SCSI Mode

Specified with the '-s' flag on the command-line. This mode is
intended to test all SCSI devices visible to the node, which is useful
when planning the cluster volume configuration. Note that this mode
will test all SCSI devices found in the /sys/block/ directory, which
may include local SCSI devices.

In both modes, the devices found will be test for compatibility. This
is done by simply attempt to register with the devices. Successful
registration indicates that the device is capable of performing SCSI
persistent reservations. If registration is successful, the script
will remove the registration.

Users will want to pay close attention to which devices report
failure. If fence_scsi_test is run in "cluster mode" and reports
devices that have failed the test, you must not use fence_scsi as your
fencing method. If fence_scsi_test was run in "SCSI mode" are reports
failures for devices, those devices must not be used for shared
storage (LVM2 cluster volumes) if you with to use fence_scsi as a
fencing method.

4.2 - scsi_reserve

Once you have verified that your cluster storage is compatible and
meets the requirements necessary to use fence_scsi, you can enable the
scsi_reserve init script. This can be done with the following command:

	% chkconfig scsi_reserve on

When enabled, the scsi_reserve script handles creation of
registrations and reservations at system startup.

The scsi_reserve init script will first generate the node's unique
key. This key is based on the cluster ID and the node ID, thus it is
guaranteed to be unique. The next step in the scsi_reserve script
depends on which parameter was used. The following options are
allowed: start, stop, and status. Each case requires that the cluster
manager (cman) be running. This is needed to extract information about
the cluster and the individual node.

4.2.1 - scsi_reserve start

Running the scsi_reserve init script with the 'start' option will
proceed to create registrations on all devices that were previously
discovered. If necessary, it will also create the reservation. The
script will report success or failure. Success indicates that the node
was capable to registering with all devices that were
discovered. Failure indicates that the script was unable register with
one or more device. Should a failure occur, the cluster has no way of
completely fencing a node in the event of a node failure.

It is important to note that 'scsi_reserve start' should be run before
mounting the file system. The reason for this is that if you already
have a file system mounted and then create a reservation on any of the
devices used by that file system, any node that is not registered with
those devices will be unable to write to the file system.

4.2.2 - scsi_reserve stop

When scsi_reserve is run with the 'stop' command, it will attempt
remove the node's registration key from all devices that it registered
with at startup. Removing the registration is only a problem if that
node is also the reservation holder and other node's are still
registered with the device(s). In this case, the node will not be able
to unregister since doing so would also release the reservation. Note
that the script will report failure when attempting to remove a node's
registration if it is the reservation holder and other registrations
exist.

4.2.3 - scsi_reserve status

When the scsi_reserve script is run with the 'status' command, it will
list the devices that the node is registered with.

4.3 - fence_scsi

The fence_scsi script is the actual fence agent that is run when node
failure occurs. Typically this script will not be run manually, but
rather invoked by fence domain. Using this script manually will remove
a node's registrations from all devices, but will not remove the node
from the cluster.

When a node is fenced using fence_scsi, it simply removes the
specified node's registrations from all devices. This prevents write
access to those devices. In the special case where the node being
fenced is also the reservation holder, the node the is performing the
fence operation will become the new reservation holder.

Note that if the node being fenced has the file system mounted,
removing its registrations prevents the node from accessing the
file system. This sudden inability to access the devices upon which the
file system exists may result in I/O errors and a subsequent withdraw
from the file system. This behavior is expected.

5 - Configuration

Below is a sample configuration (cluster.conf) for a cluster that uses
SCSI persistent reservations as its fence method. Note that each node
defines its fence device and passes its node name to the agent via the
"node" attribute.

Also note that each node explicitly defines its "nodeid". This is
required for all clusters that use fence_scsi as the fence method. The
"nodeid" attribute must be defined so that the various SCSI
reservation scripts can predictably generate the node's unique
registration key.

<?xml version="1.0"?>
<cluster config_version="1" name="my_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="30"/>
        <clusternodes>
                <clusternode name="node-01" votes="1" nodeid="1">
                <fence>
                        <method name="scsi">
                        <device name="fence_dev" node="node-01"/>
                        </method>
                </fence>
                </clusternode>
                <clusternode name="node-02" votes="1" nodeid="2">
                <fence>
                        <method name="scsi">
                        <device name="fence_dev" node="node-02"/>
                        </method>
                </fence>
                </clusternode>
                <clusternode name="node-03" votes="1" nodeid="3">
                <fence>
                        <method name="scsi">
                        <device name="fence_dev" node="node-03"/>
                        </method>
                </fence>
                </clusternode>
        </clusternodes>
        <cman cluster_id="1234"/>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="fence_dev"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]