What is really strange (and distrurbing) that such "hangups" can take
10-20 seconds which is just too much for the users.
Yesterday we started to monitor the number of locks/held locks on two of
the machines. The results from the first day can be found at
http://www.kfki.hu/~kadlec/gfs/.
It looks as Maildir is definitely a wrong choice for GFS and we should
consider to convert to mailbox format: at least I cannot explain the
spikes in another way.
In order to look at the possible tuning options and the side effects, I
list what I have learned so far:
- Increasing glock_purge (percent, default 0) helps to trim back the
unused glocks by gfs_scand itself. Otherwise glocks can accumulate and
gfs_scand eats more and more time at scanning the larger and
larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,
looking for work to do. By increasing scand_secs one can lessen the load
produced by gfs_scand, but it'll hurt because flushing data can be
delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing
often helps to avoid burstiness *and* to prolong another nodes'
lock access. Question is, what are the side effects of small
demote_secs values? (Probably there is no much point to choose
smaller demote_secs value than scand_secs.)
Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
Best regards,
Jozsef
--
E-mail : kadlec mail kfki hu, kadlec blackhole kfki hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster