[Cluster-devel] Re: [PATCH 1/2] NLM failover unlock commands

Thu Jan 24 20:19:10 UTC 2008

On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> In practice, it seems that both the unlock_ip and unlock_pathname
>> methods that revoke locks are going to be called together.  The two
>> separate calls therefore seem a little redundant.  The reason we *need*
>> both is that it's possible that a misconfigured client could grab locks
>> for a (server ip, export) combination that it isn't supposed to.
>>   
>
> That is not a correct assumption. The two commands (unlock_ip and  
> unlock_pathname) are not necessarily called together. It is ok for local  
> filesystem (ext3) but not for cluster filesystem where the very same  
> filesystem (or subtree) can be exported from multiple servers using  
> different subtrees.

Ouch.  Are people really doing that, and why?  What happens if the
subtrees share files (because of hard links) that are locked from both
nodes?

> Also as we discussed before, it is  
> "unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree  
> exports) due to implementation difficulties (see the "Implementation  
> Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).

Unless I misread the latest patch, it's actually matching on the
vfsmount, right?

I guess that means we *could* handle the above situation by doing a

	mount --bind /path/to/export/point /path/to/export/point

on each export, at which point there will be a separate vfsmount for
each export point?

But I don't think that's what we really want.  The goal is to ensure
that the nfs server holds no locks on a disk filesystem so we can
unmount it completely from this machine and mount it elsewhere.  So we
should really be removing all locks for the superblock, not just for a
particular mount of that superblock.  Otherwise we'll have odd problems
if someone happens to do the unlock_filesystem downcall from a different
namespace or something.

>> So it makes sense to me to restrict locking from the beginning to
>> prevent that from happening.  Therefore I'd like to add a call at the
>> beginning like:
>>
>> 	echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>>   
>
> My second patch set is about to be sent out (doing text description at  
> this moment .. sorry for the delay).

Good, thanks.

>> before any exports are set up, which both starts a grace period, and
>> tells nfs to allow locks on the filesystem /exports/example only if
>> they're addressed to the server ip 192.168.1.1.  Then on shutdown,
>>
>> 	echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>>
>> should be sufficient to guarantee that nfsd/lockd no longer holds locks
>> on /exports/example.
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)
>>
>> The other advantage to having the server-ip from the start is that at
>> the time we make lock requests to the cluster filesystem, we can tell it
>> that the locks associated with 192.168.1.1 are special: they may migrate
>> as a group to another node, and on node failure they should (if
>> possible) be held to give a chance for another node to take them over.
>>
>> Internally I'd like to have an object like
>>
>> 	struct lock_manager {
>> 		char *lm_name;
>> 		...
>> 	}
>>
>> for each server ip address.  A pointer to this structure would be passed
>> with each lock request, allowing the filesystem to associate locks to
>> lock_manager's.  The name would be a string derived from the server ip
>> address that the cluster can compare to match reclaim requests with the
>> locks that they're reclaiming from another node.
>>   
>
> I still don't have a warm feeling about adding this (at this moment) -  
> somehow feel we over-engineer the lock failover issues.

I agree that that's a risk.

> Remember lock  failover is just a small portion of the general NFS
> server failover (for  both HA and load balancing purpose) issues.
> Maybe we should have  something simple and usable for 2.6 kernel
> before adding this type of  complication ?

Yeah.  We should aim to include basic failover functionality in 2.6.26,
one way or another--I think that dealing with the other issues I'm
worried about won't actually be a great deal more complicated, but if
that doesn't pan out then fine.  I would like to at least make sure this
is all working for nfsv4 as well, though.  (Currently only locks held by
v2/v3 clients are dropped.)

--b.