I am currently investigating an issue with dlm_controld.
After we did some performance improvements the cpu load of dlm_controld becomes nearly 100% on all 3 nodes and locking goes down from 45.000/s to 3/s ...
I have a feeling this has something to do with plock_rate_limit which we disabled in cluster.conf by
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0" />
We are still on RHEL 6.2 and I'm not sure if there are major improvements in dlm_controld for RHEL 6.3 (looking at the Github repo of dlm there seem to be quite some improvements in general, e.g. fencing).
Would anybody have a suggestion what we could test?
All in all, here are some specs about the systems:
- 3 nodes running RHEL 6.2
- 128GB Ram
- 64 Cores
- FCoE SAN
- 3 NIC: 1x SAN, 1x LAN, 1x Cluster LAN
- mainly running SAS and related jobs
- fencing enabled with fence_ipmilan
Other performance related settings:
- tuned-adm profile enterprise-storage
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- blockdev --setra 1024 (for each FC block device)
- vm.dirty_background_ratio = 0
- vm.vfs_cache_pressure = 0
- vm.swappiness = 45
- vm.min_free_kbytes = 1976531
- echo 16384 > /sys/kernel/config/dlm/cluster/lkbtbl_size (set before GFS2 mount)
- echo 16384 > /sys/kernel/config/dlm/cluster/rsbtbl_size (set before GFS2 mount)
- echo 16384 > /sys/kernel/config/dlm/cluster/dirtbl_size (set before GFS2 mount)
With these settings we get quite good performance at the beginning but dlm_controld gets stuck after half an hour or so.
I thought about setting plock_rate_limit=500 or something like this. Do you think this would be a better setting instead of using unlimited?