[Cluster-devel] cluster/cman/qdisk README
lhh at sourceware.org
lhh at sourceware.org
Wed Aug 16 14:54:18 UTC 2006
CVSROOT: /cvs/cluster
Module name: cluster
Branch: STABLE
Changes by: lhh at sourceware.org 2006-08-16 14:54:18
Modified files:
cman/qdisk : README
Log message:
Sync readme to RHEL4 branch
Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/README.diff?cvsroot=cluster&only_with_tag=STABLE&r1=1.4.2.1&r2=1.4.2.2
--- cluster/cman/qdisk/README 2006/07/21 18:01:38 1.4.2.1
+++ cluster/cman/qdisk/README 2006/08/16 14:54:18 1.4.2.2
@@ -1,274 +1 @@
-qdisk 1.0 - a disk-based quorum algorithm for Linux-Cluster
-
-(C) 2006 Red Hat, Inc.
-
-1. Overview
-
-1.1. Problem
-
-In some situations, it may be necessary or desirable to sustain
-a majority node failure of a cluster without introducing the need for
-asymmetric (client-server, or heavy-weighted voting nodes).
-
-1.2. Design Requirements
-
-* Ability to sustain 1..(n-1)/n simultaneous node failures, without the
-danger of a simple network partition causing a split brain. That is, we
-need to be able to ensure that the majority failure case is not merely
-the result of a network partition.
-
-* Ability to use external reasons for deciding which partition is the
-the quorate partition in a partitioned cluster. For example, a user may
-have a service running on one node, and that node must always be the master
-in the event of a network partition. Or, a node might lose all network
-connectivity except the cluster communication path - in which case, a
-user may wish that node to be evicted from the cluster.
-
-* Integration with CMAN. We must not require CMAN to run with us (or
-without us). Linux-Cluster does not require a quorum disk normally -
-introducing new requirements on the base of how Linux-Cluster operates
-is not allowed.
-
-* Data integrity. In order to recover from a majority failure, fencing
-is required. The fencing subsystem is already provided by Linux-Cluster.
-
-* Non-reliance on hardware or protocol specific methods (i.e. SCSI
-reservations). This ensures the quorum disk algorithm can be used on the
-widest range of hardware configurations possible.
-
-* Little or no memory allocation after initialization. In critical paths
-during failover, we do not want to have to worry about being killed during
-a memory pressure situation because we request a page fault, and the Linux
-OOM killer responds...
-
-
-1.3. Hardware Configuration Considerations
-
-1.3.1. Concurrent, Synchronous, Read/Write Access
-
-This daemon requires a shared block device with concurrent read/write
-access from all nodes in the cluster. The shared block device can be
-a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI
-target, or even GNBD. The quorum daemon uses O_DIRECT to write to the
-device.
-
-1.3.2. Bargain-basement JBODs need not apply
-
-There is a minimum performance requirement inherent when using disk-based
-cluster quorum algorithms, so design your cluster accordingly. Using a
-cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause
-problems at the first load spike. Plan your loads accordingly; a node's
-inability to write to the quorum disk in a timely manner will cause the
-cluster to evict the node. Using host-RAID or multi-initiator parallel
-SCSI configurations with the qdisk daemon is unlikely to work, and will
-probably cause administrators a lot of frustration. That having been
-said, because the timeouts are configurable, most hardware should work
-if the timeouts are set high enough.
-
-1.3.3. Fencing is Required
-
-In order to maintain data integrity under all failure scenarios, use of
-this quorum daemon requires adequate fencing, preferrably power-based
-fencing.
-
-
-1.4. Limitations
-
-* At this time, this daemon only supports a maximum of 16 nodes.
-
-* Cluster node IDs must be statically configured in cluster.conf and
-must be numbered from 1..16 (there can be gaps, of course).
-
-* Cluster node votes should be more or less equal.
-
-* CMAN must be running before the qdisk program can start. This
-limitation will be removed before a production release.
-
-* CMAN's eviction timeout should be at least 2x the quorum daemon's
-to give the quorum daemon adequate time to converge on a master during a
-failure + load spike situation.
-
-* The total number of votes assigned to the quorum device should be
-equal to or greater than the total number of node-votes in the cluster.
-While it is possible to assign only one (or a few) votes to the quorum
-device, the effects of doing so have not been explored.
-
-* Currently, the quorum disk daemon is difficult to use with CLVM if
-the quorum disk resides on a CLVM logical volume. CLVM requires a
-quorate cluster to correctly operate, which introduces a chicken-and-egg
-problem for starting the cluster: CLVM needs quorum, but the quorum daemon
-needs CLVM (if and only if the quorum device lies on CLVM-managed storage).
-One way to work around this is to *not* set the cluster's expected votes
-to include theh quorum daemon's votes. Bring all nodes online, and start
-the quorum daemon *after* the whole cluster is running. This will allow
-the expected votes to increase naturally.
-
-
-2. Algorithms
-
-2.1. Heartbeating & Liveliness Determination
-
-Nodes update individual status blocks on the quorum disk at a user-
-defined rate. Each write of a status block alters the timestamp, which
-is what other nodes use to decide whether a node has hung or not. If,
-after a user-defined number of 'misses' (that is, failure to update a
-timestamp), a node is declared offline. After a certain number of 'hits'
-(changed timestamp + "i am alive" state), the node is declared online.
-
-The status block contains additional information, such as a bitmask of
-the nodes that node believes are online. Some of this information is
-used by the master - while some is just for performace recording, and
-may be used at a later time. The most important pieces of information
-a node writes to its status block are:
-
- - timestamp
- - internal state (available / not available)
- - score
- - max score
- - vote/bid messages
- - other nodes it thinks are online
-
-
-2.2. Scoring & Heuristics
-
-The administrator can configure up to 10 purely arbitrary heuristics, and
-must exercise caution in doing so. By default, only nodes scoring over
-1/2 of the total maximum score will claim they are available via the
-quorum disk, and a node (master or otherwise) whose score drops too low
-will remove itself (usually, by rebooting).
-
-The heuristics themselves can be any command executable by 'sh -c'. For
-example, in early testing, I used this:
-
- <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
-
-This is a literal sh-ism which tests for the existence of a file called
-"/quorum". Without that file, the node would claim it was unavailable.
-This is an awful example, and should never, ever be used in production,
-but is provided as an example as to what one could do...
-
-Typically, the heuristics should be snippets of shell code or commands which
-help determine a node's usefulness to the cluster or clients. Ideally, you
-want to add traces for all of your network paths (e.g. check links, or
-ping routers), and methods to detect availability of shared storage.
-
-
-2.3. Master Election
-
-Only one master is present at any one time in the cluster, regardless of
-how many partitions exist within the cluster itself. The master is
-elected by a simple voting scheme in which the lowest node which believes
-it is capable of running (i.e. scores high enough) bids for master status.
-If the other nodes agree, it becomes the master. This algorithm is
-run whenever no master is present.
-
-If another node comes online with a lower node ID while a node is still
-bidding for master status, it will rescind its bid and vote for the lower
-node ID. If a master dies or a bidding node dies, the voting algorithm
-is started over. The voting algorithm typically takes two passes to
-complete.
-
-Master deaths take marginally longer to recover from than non-master
-deaths, because a new master must be elected before the old master can
-be evicted & fenced.
-
-
-2.4. Master Duties
-
-The master node decides who is or is not in the master partition, as
-well as handles eviction of dead nodes (both via the quorum disk and via
-the linux-cluster fencing system by using the cman_kill_node() API).
-
-
-2.5. How it All Ties Together
-
-When a master is present, and if the master believes a node to be online,
-that node will advertise to CMAN that the quorum disk is avilable. The
-master will only grant a node membership if:
-
- (a) CMAN believes the node to be online, and
- (b) that node has made enough consecutive, timely writes to the quorum
- disk.
-
-
-3. Configuration
-
-3.1. The <quorumd> tag
-
-This tag is a child of the top-level <cluster> tag.
-
- <quorumd
- interval="1" This is the frequency of read/write cycles
- tko="10" This is the number of cycles a node must miss
- in order to be declared dead.
- votes="3" This is the number of votes the quorum daemon
- advertises to CMAN when it has a high enough
- score.
- log_level="4" This controls the verbosity of the quorum daemon
- in the system logs. 0 = emergencies; 7 = debug
- log_facility="local4" This controls the syslog facility used by the
- quorum daemon when logging.
- status_file="/foo" Write internal states out to this file
- periodically ("-" = use stdout).
- min_score="3" Absolute minimum score to be consider one's
- self "alive". If omitted, or set to 0, the
- default function "floor((n+1)/2)" is used.
- device="/dev/sda1" This is the device the quorum daemon will use.
- This device must be the same on all nodes.
- label="mylabel"/> This overrides the device field if present.
- If specified, the quorum daemon will read
- /proc/partitions and check for qdisk signatures
- on every block device found, comparing the label
- against the specified label. This is useful in
- configurations where the block device name
- differs on a per-node basis.
-
-
-3.2. The <heuristic> tag
-
-This tag is a child of the <quorumd> tag.
-
- <heuristic
- program="/test.sh" This is the program used to determine if this
- heuristic is alive. This can be anything which
- may be executed by "/bin/sh -c". A return value
- of zero indicates success.
- score="1" This is the weight of this heuristic. Be careful
- when determining scores for heuristics.
- interval="2"/> This is the frequency at which we poll the
- heuristic.
-
-3.3. Example
-
- <quorumd interval="1" tko="10" votes="3" device="/dev/gnbd/qdisk">
- <heuristic program="ping routerA -c1 -t1" score="1" interval="2"/>
- <heuristic program="ping routerB -c1 -t1" score="1" interval="2"/>
- <heuristic program="ping routerC -c1 -t1" score="1" interval="2"/>
- </quorumd>
-
-3.4. Heuristic score considerations
-
-* Heuristic timeouts should be set high enough to allow the previous run
-of a given heuristic to complete.
-
-* Heuristic scripts returning anything except 0 as their return code
-are considered failed.
-
-* The worst-case for improperly configured quorum heuristics is a race
-to fence where two partitions simultaneously try to kill each other.
-
-3.5. Creating a quorum disk partition
-
-3.5.1. The mkqdisk utility.
-
-The mkqdisk utility can create and list currently configured quorum disks
-visible to the local node.
-
- mkqdisk -L List available quorum disks.
-
- mkqdisk -f <label> Find a quorum device by the given label.
-
- mkqdisk -c <device> -l <label>
- Initialize <device> and name it <label>. This
- will destroy all data on the device, so be careful
- when running this command.
+See qdisk(5) for setup and other information
More information about the Cluster-devel
mailing list