Good morning, y'all:
A bit of background on our current setup before I get into the issues we've been having, and my questions for the list :)
We've been running a 32node computer cluster with two head nodes attached to a 4TB EMC CX300 for ~9 months now. Attached to the storage device are two additional four-way systems. The 2 heads (call them f0 and1) and one of the four-ways (call it e0) are currently configured as GFS servers, and the second four-way is strictly a client to the GFS system (call it c0). The 4TB are divided into 4 RAID5 arrays: 2x 800GB, 1x730GB, 1x 1.4TB (roughly), each formatted and included in the GFS side of things. f0 and f1 are dual Opteron 248s w/4GB RAM, e0 is a quad Opty 848 w/16GB RAM and c0 is a . An additional system (t0) is planned to replace e0 as the third GFS server, relegating e0 to client-only status, as soon as I can finish building it (it's a dual Opty 248 w/6GB RAM).
First, the GFS side of things is currently sharing the cluster's internal network for it's communications, mostly because we didn't have a second switch to dedicate to the task. While the cluster is currently lightly used, how sub-optimal is this? I'm currently searching for another switch that a partnering department has/had, but I don't know if they even know where it is at this point.
Second: GFS likes to fence "e0" off on a fairly regular/common basis (once every other week or so, if not more often). This is really rather bad for us, from an operational standpoint - e0 is vital to the operation of our Biostatistics Department (Samba/NFS, user authentication, etc...). There is also some pretty nasty latency on occasion, with logins taking upwards of 30seconds to return to a prompt, providing it doesn't time out to begin with.
In trying to figure out -why- it's constantly being fenced off, and in trying to solve the latency/performance issues, I've noticed a -very- large number of "notices" from GFS like the following:
Feb 15 10:56:10 front-1 lock_gulmd_LT000: Lock count is at 1124832 which is more than the max 1048576. Sending Drop all req to clients
Easy enough to gather that we're blowing away the current lock highwater mark.
Is upping the highwater point a feasable thing to do -and- would it have an affect on performance, and what would that affect be?
This weekend, we also noticed another weirdness (for us, anyways...) - e0 was fenced off on Saturday morning at 0504.09am, almost precisely 24 hours later e0 decided that the problem was the previous GFS master (f0), arbitrated itself to be Master, took over, fenced off F0 and then proceeded to hose the entire thing by the time I heard about things and was able to get on-site to bring it all back up (at 1am Monday morning). What is this apparent 24-hour timer, and is this expected behaviour?
Finally - would increasing the heartbeat timer and the number of acceptable misses an appropriate and acceptable way to help decreases the frequency of e0 being fenced off?
Sorry for the long post and the rambling. I've been fighting with this mess off and on for 6 months now. It's only now coming to a head because people are no longer able to get an "real" work done, computationally, on the system. And with researchers, not being able to get your computations done in time for your presentation next week is a really -=bad=- thing.
Jerry Gilyeat, RHCE
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health