[Linux-cluster] lost quorum, but the cluster services and GFS are still up

Mon Mar 13 22:24:03 UTC 2006

On Mon, Mar 13, 2006 at 10:31:18AM -0700, Dex Chen wrote:
> Hi,
>  
> I believe that I saw something unusual here.
>  
> I have a 3 node cluster (with GFS) using CMAN. After I shutdown 2 nodes
> in short time span, the cluster shows it lost quorum, but I run the
> clustat on the third node, and clustat shows the cluster has 3 nodes (2
> are offline) and the other services are up. I was able to access/read
> the share storage. CMAN_TOOL shows cluster lost quorum and the activity
> is blocked. What I expected is that I should not allow accessing the
> shared storage and other services at all when the cluster lost the
> quorum. Anyone has seen the similar things? What/where should I look
> into?

Quorum is the normal method of preventing an instance of some cluster
subsystem or application (a gfs mount-group, a dlm lock-space, an
rgmanager service/app/resource, etc) from being enabled on both sides of a
partitioned cluster.  It does this by preventing the creation of new
instances in inquorate clusters and by preventing recovery (re-enabling)
of existing instances in inquorate clusters.

There's one special case where we also rely on fencing to prevent an
instance from being enabled on both sides of a split at once.  It's where
all the nodes using the instance before the failure/partition, also exist
on the inquorate side of the split afterward.  If a quorate partition then
forms, the first thing it does is fence all nodes it can't talk with,
which are the nodes on the inquorate side.  The quorate side then enables
instances of dlm/gfs/etc, the fencing having guaranteed there are none
elsewhere.

Apart from this, each service/instance/system responds internally to the
loss of quorum in its own way.  In the special case I described where all
the nodes using the instance remain after the event, dlm and gfs both
continue to run normally on the inquorate nodes; there's been no reason to
do otherwise.

I suspect what you saw is that nodes A and B failed/shutdown but weren't
using any of the dlm/gfs instances that C was.  C was then this special
case and dlm/gfs continued to run normally.  If A and B had come back and
formed a partitioned, quorate cluster, they would have fenced C before
enabling any dlm or gfs instances.

Dave