[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover



Neil Brown wrote:

One thing that has been bothering me is that sometimes the
"filesystem" (in the guise of an fsid) is used to talk to the kernel
about failover issues (when flushing locks or restarting the grace
period) and sometimes the local network address is used (when talking
with statd).

This is a perception issue - it depends on how the design is described. More on this later.

I would rather use a single identifier.  In my previous email I was
leaning towards using the filesystem as the single identifier.  Today
I'm leaning the other way - to using the local network address.
Guess you're juggling with too many things so forget why we came down to this route ? We started the discussion using network interface (to drop the locks) but found it wouldn't work well on local filesytems such as ext3. There is really no control on which local (sever side) interface NFS clients will use (shouldn't be hard to implement one though). When the fail-over server starts to remove the locks, it needs a way to find *all* of the locks associated with the will-be-moved partition. This is to allow umount to succeed. The server ip address alone can't guarantee that. That was the reason we switched to fsid. Also remember this is NFS v2/v3 - clients have no knowledge of server migration.

Now, let's move back to first paragraph. An active-active failover can be described as a 5-steps process:

Step 1. Quiesce the floating network address.
Step 2. Move the exported filesystem directories from Server A to Server B.
Step 3. Re-enable the network interface.
Step 4. Inform clients about the changes via NSM (Network Status Monitor) Protocol.
Step 5. Grace period.

I was told last week that, independent of lockd, some cluster filesystems do have their own implementation of grace period. It is on the wish list that this feature is taken into consideration. IMHO, the overall process should be viewed as a collaboration between filesystem, network interface, and NFS protocol itself. Mixing the filesystem and network operations are unavoidable.

On the other hand, the current proposed interface is expandable .. say, prefix a non-numerical string "DEV" or "UUID" to ask for dropping locks as in:
shell> echo "DEV12390 > /proc/fs/nfsd/nlm_unlock;

or allow individual grace period of 10 seconds as:
shell> echo "1234 10" > nlm_set_grace_for_fsid

With above said, some of the following flow confuses me ... comment inlined as below ..

It works like this:

 We have a module parameter for lockd something like
 "virtual_server".
 If that is set to 0, none of the following changes are effective.
 If it is set to 1:
ok with me ...

  The destination address for any lockd request becomes part of the
  key to find the nsm_handle.

As explained above, the address along can't guarantee the associated locks get cleaned up for one particular filesystem.

  The my_name field in SM_MON requests and SM_UNMON requests is set
  to a textual representation of that destination address.

That's what the current patch does.

  The reply to SM_MON (currently completely ignored by all versions
  of Linux) has an extra value which indicates how many more seconds
  of grace period there is to go.  This can be stuffed into res_stat
  maybe.
  Places where we currently check 'nlmsvc_grace_period', get moved to
  *after* the nlmsvc_retrieve_args call, and the grace_period value
  is extracted from host->nsm.
ok with me but I don't see the advantages though ?

 This is the full extent of the kernel changes.

 To remove old locks, we arrange for the callbacks registered with
 statd for the relevant clients to be called.
 To set the grace period, we make sure statd knows about it and it
 will return the relevant information to lockd.
 To notify clients of the need to reclaim locks, we simple use the
 information stored by statd, which contains the local network
 address.

I'm lost here... help ?

The only aspect of this that gives me any cause for concern is
overloading the return value for SM_MON.  Possibly it might be cleaner
to define an SM_MON2 with different args or whatever.
As this interface is entirely local to the one machine, and as it can
quite easily be kept back-compatible, I think the concept is fine.
Agree !

Statd would need to pass the my_name field to the ha callout rather
than replacing it with "127.0.0.1", but other than that I don't think
any changes are needed to statd (though I haven't thought through that
fully yet).

That's the current patch does.

Comments?


I feel we're in the loop again... If there is any way I can shorten this discussion, please do let me know.

-- Wendy


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]