[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Wendy Cheng s.wendy.cheng at gmail.com
Tue Apr 8 09:13:52 UTC 2008


On Mon, Apr 7, 2008 at 9:36 PM, christopher barry <
Christopher.Barry at qlogic.com> wrote:

> Hi everyone,
>
> I have a couple of questions about the tuning the dlm and gfs that
> hopefully someone can help me with.



There are lots to say about this configuration.. It is not a simple tuning
issue.


>
> my setup:
> 6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
> not new stuff, but corporate standards dictated the rev of rhat.



Putting a load balancer in front of cluster filesystem is tricky to get it
right (to say the least). This is particularly true between GFS and LVS,
mostly because LVS is a general purpose load balancer that is difficult to
tune to work with the existing GFS locking overhead.


The cluster is a developer build cluster, where developers login, and
> are balanced across nodes and edit and compile code. They can access via
> vnc, XDMCP, ssh and telnet, and nodes external to the cluster can mount
> the gfs home via nfs, balanced through the director. Their homes are on
> the gfs, and accessible on all nodes.



Direct login into GFS nodes (via vnc, ssh, telnet, etc) is ok but nfs client
access in this setup will have locking issues. It is *not* only a
performance issue. It is *also* a function issue - that is, before 2.6.19
Linux kernel, NLM locking (used by NFS client) doesn't get propagated into
clustered NFS servers. You'll have file corruption if different NFS clients
do file lockings and expect the lockings can be honored across different
clustered NFS servers. In general, people needs to think *very* carefully to
put a load balancer before a group of linux NFS servers using any
before-2.6.19 kernel. It is not going to work if there are multiple clients
that invoke either posix locks and/or flocks on files that are expected to
get accessed across different linux NFS servers on top  *any* cluster
filesystem (not only GFS). .


>
>
> I'm noticing huge differences in compile times - or any home file access
> really - when doing stuff in the same home directory on the gfs on
> different nodes. For instance, the same compile on one node is ~12
> minutes - on another it's 18 minutes or more (not running concurrently).
> I'm also seeing weird random pauses in writes, like saving a file in vi,
> what would normally take less than a second, may take up to 10 seconds.
>
> * From reading, I see that the first node to access a directory will be
> the lock master for that directory. How long is that node the master? If
> the user is no longer 'on' that node, is it still the master? If
> continued accesses are remote, will the master state migrate to the node
> that is primarily accessing it?



Cluster locking is expensive. As the result, GFS caches its glocks and there
is an one-to-one correspondence between GFS glock and DLM locks. Even an
user is no longer "on" that node, the lock stays on that node unless:

1. some other node requests an exclusive access of this lock (file write);
or
2. the node has memory pressure that kicks off linux virtual memory manager
to reclaim idle filesystem structures (inode, dentries, etc); or
3. abnormal events such as crash, umount, etc.

Check out: ,
http://open-sharedroot.org/Members/marc/blog/blog-on-gfs/glock-trimming-patch/?searchterm=gfs
for details.


I've set LVS persistence for ssh and
> telnet for 5 minutes, to allow multiple xterms fired up in a script to
> land on the same node, but new ones later will land on a different node
> - by design really. Do I need to make this persistence way longer to
> keep people only on the first node they hit? That kind of horks my load
> balancing design if so. How can I see which node is master for which
> directories? Is there a table I can read somehow?



You did the right thing here (by making the connection persistence). There
is a gfs glock dump command that can print out all the lock info (name,
owner, etc) but I really don't want to recommend it - since automating this
process is not trivial and there is no way to do this by hand, i.e.
manually.


>
> * I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
> mount noatime,noquota,nodiratime, and David Teigland recommended I set
> dlm_dropcount to '0' today on irc, which I did, and I see an improvement
> in speed on the node that appears to be master for say 'find' command
> runs on the second and subsequent runs of the command if I restart them
> immediately, but on the other nodes the speed is awful - worse than nfs
> would be. On the first run of a find, or If I wait >10 seconds to start
> another run after the last run completes, the time to run is
> unbelievably slower than the same command on a standalone box with ext3.
> e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
> a different node it can take over 2 minutes! Yet an immediate re-run on
> the cluster, on what I think must be the master is sub-second. How can I
> speed up the first access time, and how can I keep the speed up similar
> to immediate subsequent runs. I've got a ton of memory - I just do not
> know which knobs to turn.


The more memory you have, the more gfs locks (and their associated gfs file
structures) will be cached in the node. It, in turns, will make both dlm and
gfs lock queries take longer. The glock_purge (on RHEL 4.6, not on RHEL 4.5)
should be able to help but its effects will be limited if you ping-pong the
locks quickly between different GFS nodes. Try to play around with this
tunable (start with 20%) to see how it goes (but please reset gfs_scand and
gfs_inoded back to their defaults while you are experimenting glock_purge).

So assume this is a build-compile cluster, implying large amount of small
files come and go, The tricks I can think of:

1. glock_purge ~ 20%
2. glock_inode shorter than default (not longer)
3. persistent LVS session if all possible

>
>
> Am I expecting too much from gfs? Did I oversell it when I literally
> fought to use it rather than nfs off the NetApp filer, insisting that
> the performance of gfs smoked nfs? Or, more likely, do I just not
> understand how to optimize it fully for my application?



GFS1 is very good on large sequential IO (such as vedio-on-demand) but works
poorly in the environment you try to setup. However, I'm in an awkward
position to do further comments  I'll stop here.

-- Wendy

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080408/a4bb3a31/attachment.htm>


More information about the Linux-cluster mailing list