[Linux-cluster] DLM behavior after lockspace recovery

Sun Oct 17 01:58:33 UTC 2004

Saturday, October 16, 2004, 8:50:04 PM, Daniel Phillips wrote:

> On Saturday 16 October 2004 17:14, Jeff wrote:
>> In your example of a counter which tracks the # of operations
>> in progress, regenerating the LVB value during failover from
>> the last known good value among the surviving nodes doesn't
>> do any good. There is no way to avoid recalculating the correct
>> value during the failover process.
>>
>> OTOH, in my example where the value in the lock value block
>> is used as a block version # it makes perfect sense to use
>> the last known value from the surviving nodes.

> Going back over the early part of the thread, you weren't originally
> advocating this, you just thought that might be the way vaxcluster did
> it, and you thought RSB$L_VALSEQNUM might have something to do with it.

You are correct, this is a guess on my part. I suspect it is
accurate although I haven't tested it. Given that the VALNOTVALID
state only exists (on VMS systems) until the LVB is written, I don't
that it could work any other way. Also, I don't see any reason for
the VALSEQNUM field if it isn't going to be used in this fashion.

> Let's keep hunting around for a way of handling this with flags alone,
> ok?  Sequencing the lvbs is a rather, ahem, heavyweight approach that
> consumes memory in every lvb user (more probably, every dlm user 
> whether they use lvbs or not) and benefits only a small subset of lvb
> applications.  This extra sequence number has to travel over the net,
> through sockets and through various other interface bits.  More bloat
> in this department has to be seen as a bad thing.

That's fine. I don't see this as all that much bloat. We're talking
about one 32-bit counter per resource, not per lock. The sequence #
only has to travel with lock requests which request value blocks so
we talking about 33x32-bit #'s instead of 32 (discounting everything
else in the packet).

As for benefiting a small subset of lvb applications, how many
do you think exist today? Of those that exist, what model do you
think they are organized around?

> It seems to me that the VALNOTVALID flag by itself isn't enough for you
> because one of your nodes might update the lvb, and consequently some
> other node may not ever see the VALNOTVALID flag, and therefore not 
> know that it should reset its cached counter.  So how about an 
> additional flag, say, INVALIDATED, that the lock master hands out to
> any lvb reader the first time it reads an lvb for which recovery was
> not possible, whether the lvb was subsequently written or not.  Your
> application looks at INVALIDATED to know that it has to reset its 
> counter and ignores VALNOTVALID.  Does this work for you?

This works for me assuming that the INVALIDATED flag is only passed
out to nodes which hold locks at the time of the failure. A node
which requests a new lock should only see the VALNOTVALID flag
if the LVB is still invalid. With this scheme it would be best if
any one of the current LVB's was picked to be the LVB after failover
rather than zeroing it.

This also avoids the nasty problem of what the cluster does to
deal with rollover in the VALSEQNUM field.

> I agree that VALNOTVALID is a useful flag that we should have, but isn't
> useful here.  I also agree with your discomfort about setting the lvb
> arbitrarily to zero, but having a flag to detect that reduces the 
> annoyance considerably, don't you think?

>> Another example is a lock who's LVB doesn't change once it has
>> been initialized. In this case it doesn't matter whether the
>> value block is marked invalid or not. The contents are still
>> useful.

> But this value could still disappear if all the readers drop the lock,
> so presumably you have a way of recovering it, in which case the above
> flags would work for this problem as well.

The value which matters can't disappear. If any surviving nodes
have a lock then the latest value among the surviving nodes
is all that matters. If no one has the lock, the contents of the
value block don't matter.

> Also, in this case, randomly picking a value from one of the surviving
> nodes (and setting VALNOTVALID) will do as well as a sequence number
> scheme.  Such an api change would require changes to the gfs harness
> plugin, which isn't such a big deal at this point since gfs+gdlm is 
> still pre-alpha.