[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] OOM issues with GFS, NFS, Samba on RHEL3-AS cluster

Hello. I've tried to read-up on the lists here to see what I can find about these sorts of issues, but the information appears to be somewhat sparse.

Here's my situation: I have a two-member cluster built on RHEL 3 AS (with all current updates installed). That means kernel 2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1) built from SRPMS distributed by RedHat. My storage is iSCSI-based over gigabit ethernet. Hardware are Dell PowerEdge 1860's with 4GB of RAM and dual 2.4GHz processors.

My problem is that the node serving disk via NFS and Samba gets into a strange mode where it starts to get kernel-based out-of-memory errors, which start to kill things off. The machine reboots itself and comes back up with no issues. In the process, of course, it wreaks havoc with lock_gulmd and a host of other things, and makes a bunch of users upset (it probably didn't help that we've been dealing with unstable storage here for a while, and I put this system together with the idea that it would be more reliable).

I plan on trying to add a third node, which would fix the lock_gulmd craziness. That's not my big problem, though. I NEED to figure out why this is happening. My analysis so far seems to indicate that the crashes are caused mostly when there are a lot of files open (or at least a lot of disk activity). The failures seem to occur most often when people are accessing data (on GFS) from the server over an NFS mount to another machine, but they also seem to occur if the machine has seen a day's worth of that sort of usage and the backup system tries to get its nightly backup between 11PM and 2AM. When memory starts to get low, kswapd shows up and starts eating serious cycles, along with the nfsd's. I've tried increasing the number of nfsd's, but that didn't seem to have an effect.

Any ideas on things I should be checking? Interestingly enough, no swap seems to be used when this happens. The load average normally creeps up right before death, and the machine gets down to less than 18MB free (though a lot the 4GB is tied up in cache).

Jonathan Woytek                 w: 412-681-3463         woytek+ cmu edu
NREC Computing Manager          c: 412-401-1627         KB3HOZ
PGP Key available upon request

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]