[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Kadlecsik Jozsef wrote:
On Thu, 10 Apr 2008, Kadlecsik Jozsef wrote:

But this is a good clue to what might bite us most! Our GFS cluster is an almost mail-only cluster for users with Maildir. When the users experience temporary hangups for several seconds (even when writing a new mail), it might be due to the concurrent scanning for a new mail on one node by the MUA and the delivery to the Maildir in another node by the MTA.

I personally don't know much about mail server. But if anyone can explain more about what these two processes (?) do, say, how does that "MTA" deliver its mail (by "rename" system call ?) and/or how mails are moved from which node to where, we may have a better chance to figure this puzzle out.

Note that "rename" system call is normally very expensive. Minimum 4 exclusive locks are required (two directory locks, one file lock for unlink, one file lock for link), plus resource group lock if block allocation is required. There are numerous chances for deadlocks if not handled carefully. The issue is further worsen by the way GFS1 does its lock ordering - it obtains multiple locks based on lock name order. Most of the locknames are taken from inode number so their sequence always quite random. As soon as lock contention occurs, lock requests will be serialized to avoid deadlocks. So this may be a cause for these spikes where "rename"(s) are struggling to get lock order straight. But I don't know for sure unless someone explains how email server does its things. BTW, GFS2 has relaxed this lock order issue so it should work better.

I'm having a trip (away from internet) but I'm interested to know this story... Maybe by the time I get back on my laptop, someone has figured this out. But please do share the story :) ...

-- Wendy

What is really strange (and distrurbing) that such "hangups" can take 10-20 seconds which is just too much for the users.

Yesterday we started to monitor the number of locks/held locks on two of the machines. The results from the first day can be found at http://www.kfki.hu/~kadlec/gfs/.

It looks as Maildir is definitely a wrong choice for GFS and we should consider to convert to mailbox format: at least I cannot explain the spikes in another way.
In order to look at the possible tuning options and the side effects, I list what I have learned so far:

- Increasing glock_purge (percent, default 0) helps to trim back the unused glocks by gfs_scand itself. Otherwise glocks can accumulate and gfs_scand eats more and more time at scanning the larger and larger table of glocks. - gfs_scand wakes up every scand_secs (default 5s) to scan the glocks, looking for work to do. By increasing scand_secs one can lessen the load produced by gfs_scand, but it'll hurt because flushing data can be delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing often helps to avoid burstiness *and* to prolong another nodes' lock access. Question is, what are the side effects of small
  demote_secs values? (Probably there is no much point to choose
  smaller demote_secs value than scand_secs.)

Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.

Best regards,
E-mail : kadlec mail kfki hu, kadlec blackhole kfki hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

Linux-cluster mailing list
Linux-cluster redhat com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]