[Cluster-devel] cluster/cman/qdisk README

Wed Aug 16 14:54:18 UTC 2006

CVSROOT:	/cvs/cluster
Module name:	cluster
Branch: 	STABLE
Changes by:	lhh at sourceware.org	2006-08-16 14:54:18

Modified files:
	cman/qdisk     : README 

Log message:
	Sync readme to RHEL4 branch

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/qdisk/README.diff?cvsroot=cluster&only_with_tag=STABLE&r1=1.4.2.1&r2=1.4.2.2

--- cluster/cman/qdisk/README	2006/07/21 18:01:38	1.4.2.1
+++ cluster/cman/qdisk/README	2006/08/16 14:54:18	1.4.2.2
@@ -1,274 +1 @@
-qdisk 1.0 - a disk-based quorum algorithm for Linux-Cluster
-
-(C) 2006 Red Hat, Inc.
-
-1. Overview
-
-1.1. Problem
-
-In some situations, it may be necessary or desirable to sustain
-a majority node failure of a cluster without introducing the need for
-asymmetric (client-server, or heavy-weighted voting nodes).
-
-1.2. Design Requirements
-
-* Ability to sustain 1..(n-1)/n simultaneous node failures, without the
-danger of a simple network partition causing a split brain.  That is, we
-need to be able to ensure that the majority failure case is not merely
-the result of a network partition.
-
-* Ability to use external reasons for deciding which partition is the 
-the quorate partition in a partitioned cluster.  For example, a user may
-have a service running on one node, and that node must always be the master
-in the event of a network partition.  Or, a node might lose all network
-connectivity except the cluster communication path - in which case, a
-user may wish that node to be evicted from the cluster.
-
-* Integration with CMAN.  We must not require CMAN to run with us (or
-without us).  Linux-Cluster does not require a quorum disk normally -
-introducing new requirements on the base of how Linux-Cluster operates
-is not allowed.
-
-* Data integrity.  In order to recover from a majority failure, fencing
-is required.  The fencing subsystem is already provided by Linux-Cluster.
-
-* Non-reliance on hardware or protocol specific methods (i.e. SCSI
-reservations).  This ensures the quorum disk algorithm can be used on the
-widest range of hardware configurations possible.
-
-* Little or no memory allocation after initialization.  In critical paths
-during failover, we do not want to have to worry about being killed during
-a memory pressure situation because we request a page fault, and the Linux
-OOM killer responds...
-
-
-1.3. Hardware Configuration Considerations
-
-1.3.1. Concurrent, Synchronous, Read/Write Access
-
-This daemon requires a shared block device with concurrent read/write
-access from all nodes in the cluster.  The shared block device can be
-a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI
-target, or even GNBD.  The quorum daemon uses O_DIRECT to write to the
-device.
-
-1.3.2. Bargain-basement JBODs need not apply
-
-There is a minimum performance requirement inherent when using disk-based
-cluster quorum algorithms, so design your cluster accordingly.  Using a
-cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause 
-problems at the first load spike.  Plan your loads accordingly; a node's
-inability to write to the quorum disk in a timely manner will cause the
-cluster to evict the node.  Using host-RAID or multi-initiator parallel
-SCSI configurations with the qdisk daemon is unlikely to work, and will
-probably cause administrators a lot of frustration.  That having been
-said, because the timeouts are configurable, most hardware should work
-if the timeouts are set high enough.
-
-1.3.3. Fencing is Required
-
-In order to maintain data integrity under all failure scenarios, use of
-this quorum daemon requires adequate fencing, preferrably power-based
-fencing.
-
-
-1.4. Limitations
-
-* At this time, this daemon only supports a maximum of 16 nodes.
-
-* Cluster node IDs must be statically configured in cluster.conf and
-must be numbered from 1..16 (there can be gaps, of course).
-
-* Cluster node votes should be more or less equal.
-
-* CMAN must be running before the qdisk program can start.  This
-limitation will be removed before a production release.
-
-* CMAN's eviction timeout should be at least 2x the quorum daemon's
-to give the quorum daemon adequate time to converge on a master during a
-failure + load spike situation.
-
-* The total number of votes assigned to the quorum device should be
-equal to or greater than the total number of node-votes in the cluster.
-While it is possible to assign only one (or a few) votes to the quorum
-device, the effects of doing so have not been explored.
-
-* Currently, the quorum disk daemon is difficult to use with CLVM if
-the quorum disk resides on a CLVM logical volume.  CLVM requires a
-quorate cluster to correctly operate, which introduces a chicken-and-egg
-problem for starting the cluster: CLVM needs quorum, but the quorum daemon
-needs CLVM (if and only if the quorum device lies on CLVM-managed storage).
-One way to work around this is to *not* set the cluster's expected votes
-to include theh quorum daemon's votes.  Bring all nodes online, and start
-the quorum daemon *after* the whole cluster is running.  This will allow
-the expected votes to increase naturally.
-
-
-2. Algorithms
-
-2.1. Heartbeating & Liveliness Determination
-
-Nodes update individual status blocks on the quorum disk at a user-
-defined rate.  Each write of a status block alters the timestamp, which
-is what other nodes use to decide whether a node has hung or not.  If,
-after a user-defined number of 'misses' (that is, failure to update a
-timestamp), a node is declared offline.  After a certain number of 'hits'
-(changed timestamp + "i am alive" state), the node is declared online.
-
-The status block contains additional information, such as a bitmask of
-the nodes that node believes are online.  Some of this information is
-used by the master - while some is just for performace recording, and
-may be used at a later time.  The most important pieces of information
-a node writes to its status block are:
-
-  - timestamp
-  - internal state (available / not available)
-  - score
-  - max score
-  - vote/bid messages
-  - other nodes it thinks are online
-
-
-2.2. Scoring & Heuristics
-
-The administrator can configure up to 10 purely arbitrary heuristics, and
-must exercise caution in doing so.  By default, only nodes scoring over
-1/2 of the total maximum score will claim they are available via the
-quorum disk, and a node (master or otherwise) whose score drops too low
-will remove itself (usually, by rebooting).
-
-The heuristics themselves can be any command executable by 'sh -c'.  For
-example, in early testing, I used this:
-
-    <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
-
-This is a literal sh-ism which tests for the existence of a file called
-"/quorum".  Without that file, the node would claim it was unavailable.
-This is an awful example, and should never, ever be used in production,
-but is provided as an example as to what one could do...
-
-Typically, the heuristics should be snippets of shell code or commands which
-help determine a node's usefulness to the cluster or clients.  Ideally, you
-want to add traces for all of your network paths (e.g. check links, or
-ping routers), and methods to detect availability of shared storage.
-
-
-2.3. Master Election
-
-Only one master is present at any one time in the cluster, regardless of
-how many partitions exist within the cluster itself.  The master is
-elected by a simple voting scheme in which the lowest node which believes
-it is capable of running (i.e. scores high enough) bids for master status.
-If the other nodes agree, it becomes the master.  This algorithm is 
-run whenever no master is present.
-
-If another node comes online with a lower node ID while a node is still
-bidding for master status, it will rescind its bid and vote for the lower
-node ID.  If a master dies or a bidding node dies, the voting algorithm
-is started over.  The voting algorithm typically takes two passes to
-complete.
-
-Master deaths take marginally longer to recover from than non-master
-deaths, because a new master must be elected before the old master can
-be evicted & fenced.
-
-
-2.4. Master Duties
-
-The master node decides who is or is not in the master partition, as
-well as handles eviction of dead nodes (both via the quorum disk and via
-the linux-cluster fencing system by using the cman_kill_node() API).
-
-
-2.5. How it All Ties Together
-
-When a master is present, and if the master believes a node to be online,
-that node will advertise to CMAN that the quorum disk is avilable.  The
-master will only grant a node membership if:
-
-   (a) CMAN believes the node to be online, and
-   (b) that node has made enough consecutive, timely writes to the quorum
-       disk.
-
-
-3. Configuration
-
-3.1. The <quorumd> tag
-
-This tag is a child of the top-level <cluster> tag.
-
-   <quorumd
-    interval="1"          This is the frequency of read/write cycles
-    tko="10"              This is the number of cycles a node must miss
-                          in order to be declared dead.
-    votes="3"             This is the number of votes the quorum daemon
-                          advertises to CMAN when it has a high enough
-                          score.
-    log_level="4"         This controls the verbosity of the quorum daemon
-                          in the system logs. 0 = emergencies; 7 = debug
-    log_facility="local4" This controls the syslog facility used by the
-			  quorum daemon when logging.
-    status_file="/foo"    Write internal states out to this file
-			  periodically ("-" = use stdout).
-    min_score="3"	  Absolute minimum score to be consider one's
-			  self "alive".  If omitted, or set to 0, the
-			  default function "floor((n+1)/2)" is used.
-    device="/dev/sda1"    This is the device the quorum daemon will use.
-			  This device must be the same on all nodes.
-    label="mylabel"/>     This overrides the device field if present.
-			  If specified, the quorum daemon will read
-			  /proc/partitions and check for qdisk signatures
-			  on every block device found, comparing the label
-			  against the specified label.  This is useful in
-			  configurations where the block device name
-			  differs on a per-node basis.
-
-
-3.2.  The <heuristic> tag
-
-This tag is a child of the <quorumd> tag.
-
-   <heuristic
-    program="/test.sh"    This is the program used to determine if this
-                          heuristic is alive.  This can be anything which
-                          may be executed by "/bin/sh -c".  A return value
-                          of zero indicates success.
-    score="1"             This is the weight of this heuristic.  Be careful
-                          when determining scores for heuristics.
-    interval="2"/>        This is the frequency at which we poll the
-                          heuristic.
-
-3.3. Example
-
-  <quorumd interval="1" tko="10" votes="3" device="/dev/gnbd/qdisk">
-    <heuristic program="ping routerA -c1 -t1" score="1" interval="2"/>
-    <heuristic program="ping routerB -c1 -t1" score="1" interval="2"/>
-    <heuristic program="ping routerC -c1 -t1" score="1" interval="2"/>
-  </quorumd>
-
-3.4. Heuristic score considerations
-
-* Heuristic timeouts should be set high enough to allow the previous run
-of a given heuristic to complete.
-
-* Heuristic scripts returning anything except 0 as their return code 
-are considered failed.
-
-* The worst-case for improperly configured quorum heuristics is a race
-to fence where two partitions simultaneously try to kill each other.
-
-3.5. Creating a quorum disk partition
-
-3.5.1. The mkqdisk utility.
-
-The mkqdisk utility can create and list currently configured quorum disks
-visible to the local node.
-
-  mkqdisk -L		List available quorum disks.
-
-  mkqdisk -f <label>	Find a quorum device by the given label.
-
-  mkqdisk -c <device> -l <label>
-			Initialize <device> and name it <label>.  This
-			will destroy all data on the device, so be careful
-			when running this command.
+See qdisk(5) for setup and other information