[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS 6.0 Questions

Gerald G. Gilyeat wrote:

First, the GFS side of things is currently sharing the cluster's internal network for it's communications, mostly because we didn't have a second switch to dedicate to the task. While the cluster is currently lightly used, how sub-optimal is this? I'm currently searching for another switch that a partnering department has/had, but I don't know if they even know where it is at this point.

It really depends on how much the actual link is used. The more data that the other apps are pushing over the ethernet, the less of it gulm can use. It is also rather (unfortunately) difficult to tell gulm to use a different network device in the current releases. There is a fix pending for this, but its not out yet.

Second: GFS likes to fence "e0" off on a fairly regular/common basis (once every other week or so, if not more often). This is really rather bad for us, from an operational standpoint - e0 is vital to the operation of our Biostatistics Department (Samba/NFS, user authentication, etc...). There is also some pretty nasty latency on occasion, with logins taking upwards of 30seconds to return to a prompt, providing it doesn't time out to begin with.

If the machine is getting this kind of delay, it is completely possible that the delay is also causing heartbeats to be missed.

In trying to figure out -why- it's constantly being fenced off, and in trying to solve the latency/performance issues, I've noticed a -very- large number of "notices" from GFS like the following:
Feb 15 10:56:10 front-1 lock_gulmd_LT000[4073]: Lock count is at 1124832 which is more than the max 1048576. Sending Drop all req to clients

Easy enough to gather that we're blowing away the current lock highwater mark.
Is upping the highwater point a feasable thing to do -and- would it have an affect on performance, and what would that affect be?

cluster.ccs: cluster { lock_gulm { .... lt_high_locks = <int> } }

The highwater mark is an attempt to keep the amount of memory lock_gulmd uses down. When the highwater is hit, the lock server tells all gfs mounts to try and release locks. It does this every 10 seconds until the lock count falls below the highwater mark. This requires cycles, and so not doing it means less cycles used. The higher the highwater mark is, the more memory the gulm lock servers and gfs will use to store locks. The number is just the count of locks (in <=6.0) and not an actual representation of ram used.

In short summery, in your case, a higher highwater mark may give some performance gained, at the loss of some memory available to other programs.

This weekend, we also noticed another weirdness (for us, anyways...) - e0 was fenced off on Saturday morning at 0504.09am, almost precisely 24 hours later e0 decided that the problem was the previous GFS master (f0), arbitrated itself to be Master, took over, fenced off F0 and then proceeded to hose the entire thing by the time I heard about things and was able to get on-site to bring it all back up (at 1am Monday morning). What is this apparent 24-hour timer, and is this expected behaviour?

No, it sounds like some kind of freak chance. A very icky thing indeed. Very much sounds like a higher heartbeat_rate is needed.

Finally - would increasing the heartbeat timer and the number of acceptable misses an appropriate and acceptable way to help decreases the frequency of e0 being fenced off?

Certainly. The default values for the heartbeat_rate and allowed_misses are just suggestions. Certain setups may require different values, and as far as I know the only way to figure this out is to try it. Sounds very much like you could use larger values.

michael conrad tadpol tilstra
<my wit is my doom>

Attachment: signature.asc
Description: OpenPGP digital signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]