[Linux-cluster] How to set up NFS HA service

Tue Apr 19 18:47:49 UTC 2005

I think my first attempt to answer ended up in the bit bucket because of a 
wlan problem while I saved it to the drafts folder. Sigh...

Lon Hohberger wrote:

> On Tue, 2005-04-19 at 15:08 +0200, birger wrote:
> 
> Known bug/feature:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151669
> 
> You can change this behavior if you wanted to by adding <child type=...>
> to service.sh's "special" element in the XML meta-data.

I thought about trying just that, but believed it couldn't be that simple... :-D

>>I'm also a bit puzzled about why the file systems don't get unmounted 
>>when I disable all services.
> 
> 
> They're GFS.  Add force_unmount="1" to the <fs> elements if you want
> them to be umounted.  GFS is nice because you *don't* have to umount
> it.

That was exactly why I wanted to mount the gfs file systems outside the 
service. I am very happy with this unexpected behaviour. I want the file 
systems to be there. :-)

I was afraid they didn't unmount because of some problem.

> FYI, NFS services on traditional file systems don't cleanly stop right
> now due to an EBUSY during umount from the kernel.  Someone's looking in
> to it on the NFS side (apparently, not all the refs are getting cleared
> if a node has an NFS mount ref and we unexport the FS, or something).

I saw a very similar problem some years ago on Solaris with Veritas 
FirstWatch. fuser and lofs came up empty, but still the file system was busy 
when I tried to umount. I found a workaround... Restarting statd and lockd 
and then umount. Seems like they had their paws in the file system somehow.
Since FirstWatch was mostly a bunch of sh scripts it was easy to modify the 
nfs umount code to do this.

Regarding lockd, I think my solution is valid given the 2 restraints:
- The cluster nodes should not be NFS clients (and thanks to GFS I don't 
need that)
- There should only be one NFS service running on any cluster node. And I 
only have one NFS service.

When I set the name for statd to the name of the service IP address and 
relocate the status dir to a cluster disk, a takeover should behave just 
like a server reboot, shouldn't it?

  >>Apr 19 14:42:58 server1 clurgmgrd[7498]: <notice> Service nfssvc started
>>Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
>>Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
>>Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
>>Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
> 
> 
> Hmm, that's odd, it could be a bug in the status phase which is related
> to NIS exports.  Does this only happen after a failover, or does it
> happen all the time?

My cluster only has one node (even if I have defined 2 nodes). I have to get 
the first node production ready and migrate everything over first. Then make 
the old file server a second cluster node.

I'll have a look around and see if I can find a solution.

-- 
birger