[Linux-cluster] OOM issues with GFS, NFS, Samba on RHEL3-AS cluster

Thu Jan 20 21:56:03 UTC 2005

Hello.  I've tried to read-up on the lists here to see what I can find 
about these sorts of issues, but the information appears to be somewhat 
sparse.

Here's my situation:  I have a two-member cluster built on RHEL 3 AS 
(with all current updates installed).  That means kernel 
2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1) 
built from SRPMS distributed by RedHat.  My storage is iSCSI-based over 
gigabit ethernet.  Hardware are Dell PowerEdge 1860's with 4GB of RAM 
and dual 2.4GHz processors.

My problem is that the node serving disk via NFS and Samba gets into a 
strange mode where it starts to get kernel-based out-of-memory errors, 
which start to kill things off.  The machine reboots itself and comes 
back up with no issues.  In the process, of course, it wreaks havoc with 
lock_gulmd and a host of other things, and makes a bunch of users upset 
(it probably didn't help that we've been dealing with unstable storage 
here for a while, and I put this system together with the idea that it 
would be more reliable).

I plan on trying to add a third node, which would fix the lock_gulmd 
craziness.  That's not my big problem, though.  I NEED to figure out why 
this is happening.  My analysis so far seems to indicate that the 
crashes are caused mostly when there are a lot of files open (or at 
least a lot of disk activity).  The failures seem to occur most often 
when people are accessing data (on GFS) from the server over an NFS 
mount to another machine, but they also seem to occur if the machine has 
seen a day's worth of that sort of usage and the backup system tries to 
get its nightly backup between 11PM and 2AM.  When memory starts to get 
low, kswapd shows up and starts eating serious cycles, along with the 
nfsd's.  I've tried increasing the number of nfsd's, but that didn't 
seem to have an effect.

Any ideas on things I should be checking?  Interestingly enough, no swap 
seems to be used when this happens.  The load average normally creeps up 
right before death, and the machine gets down to less than 18MB free 
(though a lot the 4GB is tied up in cache).

jonathan
--
Jonathan Woytek                 w: 412-681-3463         woytek+ at cmu.edu
NREC Computing Manager          c: 412-401-1627         KB3HOZ
PGP Key available upon request