[Linux-cluster] 100% CPU load of dlm_controld

Thu Feb 14 10:03:04 UTC 2013

Hello,

I am currently investigating an issue with dlm_controld.

After we did some performance improvements the cpu load of dlm_controld
becomes nearly 100% on all 3 nodes and locking goes down from 45.000/s to
3/s ...

I have a feeling this has something to do with plock_rate_limit which we
disabled in cluster.conf by

        <dlm plock_ownership="1" plock_rate_limit="0"/>
        <gfs_controld plock_rate_limit="0" />

We are still on RHEL 6.2 and I'm not sure if there are major improvements
in dlm_controld for RHEL 6.3 (looking at the Github repo of dlm there seem
to be quite some improvements in general, e.g. fencing).

Would anybody have a suggestion what we could test?

All in all, here are some specs about the systems:

- 3 nodes running RHEL 6.2
- 128GB Ram
- 64 Cores
- FCoE SAN
- 3 NIC: 1x SAN, 1x LAN, 1x Cluster LAN
- mainly running SAS and related jobs
- fencing enabled with fence_ipmilan

Other performance related settings:
- tuned-adm profile enterprise-storage
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- blockdev --setra 1024 (for each FC block device)
- vm.dirty_background_ratio = 0
- vm.vfs_cache_pressure = 0
- vm.swappiness = 45
- vm.min_free_kbytes = 1976531
- echo 16384 > /sys/kernel/config/dlm/cluster/lkbtbl_size (set before GFS2
mount)
- echo 16384 > /sys/kernel/config/dlm/cluster/rsbtbl_size (set before GFS2
mount)
- echo 16384 > /sys/kernel/config/dlm/cluster/dirtbl_size (set before GFS2
mount)

With these settings we get quite good performance at the beginning but
dlm_controld gets stuck after half an hour or so.

I thought about setting plock_rate_limit=500 or something like this. Do you
think this would be a better setting instead of using unlimited?

Cheers,
Julian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20130214/dcbff879/attachment.htm>