[Linux-cluster] gfs_grow

Andrew A. Neuschwander andrew at ntsg.umt.edu
Wed Oct 8 21:12:39 UTC 2008


James,

I have a CentOS 5.2 cluster where I would see the same nfs errors under 
certain conditions. If I did anything that introduced latency to my gfs 
operations on the node that served nfs, the nfs threads couldn't service 
requests faster than they came in from clients. Eventually my nfs 
threads would all be busy and start dropping nfs requests. I kept an eye 
on my nfsd thread utilization (/proc/net/rpc/nfsd) and kept bumping up 
the number of threads until they could handle all the requests while the 
gfs had a higher latency.

In my case, I had EMC Networker streaming data from my gfs filesystems 
to a local scsi tape device on the same node that served nfs. I 
eventually separated them onto different nodes.

I'm sure gfs_grow would slow down your gfs enough that your nfs threads 
couldn't keep up. NFS on gfs seems to be very latency sensitive. I have 
a quick an dirty perl script to generate a historgram image from nfs 
thread stats if you are interested.

-Andrew
--
Andrew A. Neuschwander, RHCE
Linux Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310


James Chamberlain wrote:
> Hi all,
> 
> I'd like to thank Bob Peterson for helping me solve the last problem I 
> was seeing with my storage cluster.  I've got a new one now.  A couple 
> days ago, site ops plugged in a new storage shelf and this triggered 
> some sort of error in the storage chassis.  I was able to sort that out 
> with gfs_fsck, and have since gotten the new storage recognized by the 
> cluster.  I'd like to make use of this new storage, and it's here that 
> we run into trouble.
> 
> lvextend completed with no trouble, so I ran gfs_grow.  gfs_grow has 
> been running for over an hour now and has not progressed past:
> 
> [root at s12n01 ~]# gfs_grow /dev/s12/scratch13
> FS: Mount Point: /scratch13
> FS: Device: /dev/s12/scratch13
> FS: Options: rw,noatime,nodiratime
> FS: Size: 4392290302
> DEV: Size: 5466032128
> Preparing to write new FS information...
> 
> The load average on this node has risen from its normal ~30-40 to 513 
> (the number of nfsd threads, plus one), and the file system has become 
> slow-to-inaccessible on client nodes.  I am seeing messages in my log 
> files that indicate things like:
> 
> Oct  8 16:26:00 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when 
> sending 140 bytes - shutting down socket
> Oct  8 16:26:00 s12n01 last message repeated 4 times
> Oct  8 16:26:00 s12n01 kernel: nfsd: peername failed (err 107)!
> Oct  8 16:26:58 s12n01 kernel: nfsd: peername failed (err 107)!
> Oct  8 16:27:56 s12n01 last message repeated 2 times
> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when 
> sending 140 bytes - shutting down socket
> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when 
> sending 140 bytes - shutting down socket
> Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when 
> sending 140 bytes - shutting down socket
> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when 
> sending 140 bytes - shutting down socket
> Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
> Oct  8 16:28:34 s12n01 last message repeated 2 times
> Oct  8 16:30:29 s12n01 last message repeated 2 times
> 
> I was seeing similar messages this morning, but those went away when I 
> mounted this file system on another node in the cluster, turned on 
> statfs_fast, and then moved the service to that node.  I'm not sure what 
> to do about it given that gfs_grow is running.  Is this something anyone 
> else has seen?  Does anyone know what to do about this?  Do I have any 
> option other than to wait until gfs_grow is done?  Given my recent 
> experiences (see "lm_dlm_cancel" in the list archives), I'm very 
> hesitant to hit ^C on this gfs_grow.  I'm running CentOS 4 for x86-64, 
> kernel 2.6.9-67.0.20.ELsmp.
> 
> Thanks,
> 
> James
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 




More information about the Linux-cluster mailing list