[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [linux-lvm] new to cLVM - some principal questions

On 11/25/2011 12:49 PM, Lentes, Bernd wrote:

Digimer wrote:

Fencing and Stonith are two names for the same thing; Fencing was
traditionally used in Red Hat clusters and STONITH in
heartbeat/pacemaker clusters. It's arguable which is
preferable, but I
personally prefer fencing as it more directly describes the goal of
"fencing off" (isolating) a failed node from the rest of
the cluster.

Yes, but "STONITH" is a wonderful acronym.

Now let's talk about how fencing fits;

Let's assume that Node 1 hangs or dies while it still
holds the lock.
The fenced daemon will be triggered and it will notify DLM
that there is
a problem, and DLM will block all further requests. Next,
fenced tries
to fence the node using one of it's configured fence
methods. It will
try the first, then the second, then the first again,
looping forever
until one of the fence calls succeeds.

Once a fence call succeeds, fenced notifies DLM that the
node is gone
and then DLM will clean up any locks formerly held by Node 1. After
this, Node 2 can get a lock, despite Node 1 never itself
releasing it.

Now, let's imagine that a fence agent returned success but the node
wasn't actually fenced. Let's also assume that Node 1 was
hung, not dead.

So DLM thinks that Node 1 was fenced, clears it's old locks
and gives a
new one to Node 2. Node 2 goes about recovering the
filesystem and the
proceeds to write new data. At some point later, Node 1 unfreezes,
thinks it still has an exclusive lock on the LV and finishes
writing to
the disk.

But you said "So DLM thinks that Node 1 was fenced, clears
it's old locks and gives a
new one to Node 2" How can node 1 get access after
unfreezing, when the lock is cleared ?

DLM clears the lock, but it has no way of telling Node 1 that
the lock
is no longer valid (remember, it thinks the node has been
ejected from
the cluster, removing any communication). Meanwhile, Node 1 has no
reason to think that the lock it holds is no longer valid, so it just
goes ahead and accesses the storage figuring it has exclusive
access still.

But does DLM not prevent node 1 in this situation accessing the filesystem ?
DLM "knows" that the lock from node 1 has been cleared. Can't DLM "say" to node 1:
"You think you have a valid lock, but don't have. Sorry, no access !"


Nope, it doesn't work that way. There is no way for DLM to tell the server to discard any locks. First of all, DLM thinks the node is gone anyway. Secondly, Node 1 could have hung in the middle of a write. When it recovers, it could be quite literally in the middle of a write which is finished. DLM doesn't act as a barrier to the raw data... it's merely a lock manager.

E-Mail:              digimer alteeve com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]