[linux-lvm] new to cLVM - some principal questions

Fri Nov 25 17:49:48 UTC 2011

Digimer wrote:

> >>
> >> Fencing and Stonith are two names for the same thing; Fencing was
> >> traditionally used in Red Hat clusters and STONITH in
> >> heartbeat/pacemaker clusters. It's arguable which is
> >> preferable, but I
> >> personally prefer fencing as it more directly describes the goal of
> >> "fencing off" (isolating) a failed node from the rest of
> the cluster.

Yes, but "STONITH" is a wonderful acronym.

> >>
> >> Now let's talk about how fencing fits;
> >>
> >> Let's assume that Node 1 hangs or dies while it still
> holds the lock.
> >> The fenced daemon will be triggered and it will notify DLM
> >> that there is
> >> a problem, and DLM will block all further requests. Next,
> >> fenced tries
> >> to fence the node using one of it's configured fence
> methods. It will
> >> try the first, then the second, then the first again,
> looping forever
> >> until one of the fence calls succeeds.
> >>
> >> Once a fence call succeeds, fenced notifies DLM that the
> node is gone
> >> and then DLM will clean up any locks formerly held by Node 1. After
> >> this, Node 2 can get a lock, despite Node 1 never itself
> releasing it.
> >>
> >> Now, let's imagine that a fence agent returned success but the node
> >> wasn't actually fenced. Let's also assume that Node 1 was
> >> hung, not dead.
> >>
> >> So DLM thinks that Node 1 was fenced, clears it's old locks
> >> and gives a
> >> new one to Node 2. Node 2 goes about recovering the
> >> filesystem and the
> >> proceeds to write new data. At some point later, Node 1 unfreezes,
> >> thinks it still has an exclusive lock on the LV and finishes
> >> writing to
> >> the disk.
> >
> > But you said "So DLM thinks that Node 1 was fenced, clears
> it's old locks and gives a
> > new one to Node 2" How can node 1 get access after
> unfreezing, when the lock is cleared ?
>
> DLM clears the lock, but it has no way of telling Node 1 that
> the lock
> is no longer valid (remember, it thinks the node has been
> ejected from
> the cluster, removing any communication). Meanwhile, Node 1 has no
> reason to think that the lock it holds is no longer valid, so it just
> goes ahead and accesses the storage figuring it has exclusive
> access still.

But does DLM not prevent node 1 in this situation accessing the filesystem ?
DLM "knows" that the lock from node 1 has been cleared. Can't DLM "say" to node 1:
"You think you have a valid lock, but don't have. Sorry, no access !"

Bernd

Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671