[Linux-cluster] OOM issues with GFS, NFS, Samba on RHEL3-AS cluster

Sun Jan 23 23:12:18 UTC 2005

Sorry about the duplicate message--I had sent this when I had a mistake 
in my email address.  When I fixed it, this message apparently went 
through to the list.

jonathan

Jonathan Woytek wrote:

> Hello.  I've tried to read-up on the lists here to see what I can find 
> about these sorts of issues, but the information appears to be somewhat 
> sparse.
> 
> Here's my situation:  I have a two-member cluster built on RHEL 3 AS 
> (with all current updates installed).  That means kernel 
> 2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1) 
> built from SRPMS distributed by RedHat.  My storage is iSCSI-based over 
> gigabit ethernet.  Hardware are Dell PowerEdge 1860's with 4GB of RAM 
> and dual 2.4GHz processors.
> 
> My problem is that the node serving disk via NFS and Samba gets into a 
> strange mode where it starts to get kernel-based out-of-memory errors, 
> which start to kill things off.  The machine reboots itself and comes 
> back up with no issues.  In the process, of course, it wreaks havoc with 
> lock_gulmd and a host of other things, and makes a bunch of users upset 
> (it probably didn't help that we've been dealing with unstable storage 
> here for a while, and I put this system together with the idea that it 
> would be more reliable).
> 
> I plan on trying to add a third node, which would fix the lock_gulmd 
> craziness.  That's not my big problem, though.  I NEED to figure out why 
> this is happening.  My analysis so far seems to indicate that the 
> crashes are caused mostly when there are a lot of files open (or at 
> least a lot of disk activity).  The failures seem to occur most often 
> when people are accessing data (on GFS) from the server over an NFS 
> mount to another machine, but they also seem to occur if the machine has 
> seen a day's worth of that sort of usage and the backup system tries to 
> get its nightly backup between 11PM and 2AM.  When memory starts to get 
> low, kswapd shows up and starts eating serious cycles, along with the 
> nfsd's.  I've tried increasing the number of nfsd's, but that didn't 
> seem to have an effect.
> 
> Any ideas on things I should be checking?  Interestingly enough, no swap 
> seems to be used when this happens.  The load average normally creeps up 
> right before death, and the machine gets down to less than 18MB free 
> (though a lot the 4GB is tied up in cache).
> 
> jonathan
> -- 
> Jonathan Woytek                 w: 412-681-3463         woytek+ at cmu.edu
> NREC Computing Manager          c: 412-401-1627         KB3HOZ
> PGP Key available upon request
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Jonathan Woytek                 w: 412-681-3463         woytek+ at cmu.edu
NREC Computing Manager          c: 412-401-1627         KB3HOZ
PGP Key available upon request