[Linux-cluster] optimising DLM speed?

Wed Feb 16 11:13:38 UTC 2011

Hi,

On Tue, 2011-02-15 at 21:07 +0100, Marc Grimme wrote:
> Hi Steve,
> I think lately I observed a very similar behavior with RHEL5 and gfs2.
> It was a gfs2 filesystem that had about 2Mio files with sum of 2GB in a directory. When I did a du -shx . in this directory it took about 5 Minutes (noatime mountoption given). Independently on how much nodes took part in the cluster (in the end I only tested with one node). This was only for the first time running all later executed du commands were much faster.
> When I mounted the exact same filesystem with lockproto=lock_nolock it took about 10-20 seconds to proceed with the same command.
> 
> Next I started to analyze this with oprofile and observed the following result:
> 
> opreport --long-file-names:
> CPU: AMD64 family10, speed 2900.11 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        symbol name
> 200569   46.7639  search_rsb_list
The resource table size is by default 256 entries in size. Assuming that
you have enough ram that all 4m locks (for 2m files) are in memory at
the same time, that is approx 15625 resources per hash chain, so it
would make sense that this would start to slow things down a bit.

There is a config option to increase the resource table size though, so
perhaps you could try that?

> 118905   27.7234  create_lkb
This reads down a hash chain in the lkb table. That table is larger by
default (1024), which is probably why there is less cpu time burned
here. On the other hand, the hash chain might be read more than once if
there is a collision on the lock ids. Again it is a config option, so it
should be possible to increase the size of the table.

> 32499     7.5773  search_bucket
> 4125      0.9618  find_lkb
> 3641      0.8489  process_send_sockets
> 3420      0.7974  dlm_scan_rsbs
> 3184      0.7424  _request_lock
> 3012      0.7023  find_rsb
> 2735      0.6377  receive_from_sock
> 2610      0.6085  _receive_message
> 2543      0.5929  dlm_allocate_rsb
> 2299      0.5360  dlm_hash2nodeid
> 2228      0.5195  _create_message
> 2180      0.5083  dlm_astd
> 2163      0.5043  dlm_find_lockspace_global
> 2109      0.4917  dlm_find_lockspace_local
> 2074      0.4836  dlm_lowcomms_get_buffer
> 2060      0.4803  dlm_lock
> 1982      0.4621  put_rsb
> ..
> 
> opreport --image /gfs2
> CPU: AMD64 family10, speed 2900.11 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        symbol name
> 9310     15.5600  search_bucket
This should get better in RHEL6.1 and above, due to the new design of
glock hash table. The patch is already in upstream. The glock hash table
is much larger than the dlm hash table, though there are still
scalability issues due to the locking and that we cannot currently grow
the hash table.

> 6268     10.4758  do_promote
The result in do_promote is interesting, as I wouldn't have expected
that to show up here really, so I'll look into that when I have a moment
and try to figure out what is going on.

> 2704      4.5192  gfs2_glock_put
> 2289      3.8256  gfs2_glock_hold
> 2286      3.8206  gfs2_glock_schedule_for_reclaim
> 2204      3.6836  gfs2_glock_nq
> 2204      3.6836  run_queue
> 2001      3.3443  gfs2_holder_wake
> ..
> 
> opreport --image /dlm
> CPU: AMD64 family10, speed 2900.11 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        symbol name
> 200569   46.7639  search_rsb_list
> 118905   27.7234  create_lkb
> 32499     7.5773  search_bucket
> 4125      0.9618  find_lkb
> 3641      0.8489  process_send_sockets
> 3420      0.7974  dlm_scan_rsbs
> 3184      0.7424  _request_lock
> 3012      0.7023  find_rsb
> 2735      0.6377  receive_from_sock
> 2610      0.6085  _receive_message
> 2543      0.5929  dlm_allocate_rsb
> 2299      0.5360  dlm_hash2nodeid
> 2228      0.5195  _create_message
> ..
> 
> This very much reminded me on a similar test we've done years ago with gfs (see http://www.open-sharedroot.org/Members/marc/blog/blog-on-dlm/red-hat-dlm-__find_lock_by_id/profile-data-with-diffrent-table-sizes).
> 
> Does this not show that during the du command 46% of the time the kernel stays in the dlm:search_rsb_list function while looking out for locks. It still looks like the hashtable for the lock in dlm is much too small and searching inside the hashmap is not constant anymore?
> 
> I would be really interesting how long the described backup takes when the gfs2 filesystem is mounted exclusively on one node without locking.
> For me it looks like you're facing a similar problem with gfs2 that has been worked around with gfs by introducing the glock_purge functionality that leads to a much smaller glock->dlm->hashtable and makes backups and the like much faster.
> 
> I hope this helps.
> 
> Thanks and regards
> Marc.
> 
Many thanks for this information, it is really helpful to get feedback
like this which helps identify issues in the code,

Steve.