[Linux-cluster] GFS tuning for combined batch / interactive use

Fri Dec 17 19:43:53 UTC 2010

Hi,

On Fri, 2010-12-17 at 20:06 +0100, Kevin Maguire wrote:
> Hi
> 
> > You can get a glock dump via debugfs which may show up contention, looks 
> > for type 2 glocks which have lots of lock requests queued but not 
> > granted. The lock requests (holders) are tagged with the relevant 
> > process.
> 
> Note I am currently using GFS, not GFS2. And before going further I ran 
> the ping_pong test on my cluster and see only about 100 locks/second even 
> on just 1 node.  So maybe I should look at plock_rate_limit parameter, 
> though not sure if that is our core problem.
> 
The same thing applies to GFS as GFS2, its just that the format of the
debugfs file is different. GFS2 uses a rather smaller format which makes
a big difference in the dump size for larger machines.

The plock_rate_limit can usually be turned off quite safely, but unless
your app uses plocks, it won't make any difference to the performance.

> Anyways, As I write this my test cluster is being heavily used with batch 
> jobs, and thus I have a window of opportunity to study it under load (but 
> not change it).  I have debugfs mounted. There are 10 nodes in this test 
> cluster. My filesystem is called mygfs, and was created via
> 
> mkfs.gfs -O -t dfoxen-cluster:mygfs -p lock_dlm -j 10 -r 2048 /dev/mapper/vggfs-lvgfs
> 
> This is what I have in debugfs:
> 
> # find /sys/kernel/debug/ -type f -exec wc -l {} \;
> 2309 /sys/kernel/debug/dlm/mygfs_locks
> 0 /sys/kernel/debug/dlm/mygfs_waiters
> 16258 /sys/kernel/debug/dlm/mygfs
> 2 /sys/kernel/debug/dlm/clvmd_locks
> 0 /sys/kernel/debug/dlm/clvmd_waiters
> 7 /sys/kernel/debug/dlm/clvmd
> 
> The lock dump file has content like:
> 
> # cat /sys/kernel/debug/dlm/mygfs_locks
> id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
> 14f19eb 0 0 1038 0 0 0 2 3 -1 0 0 24 "       5         cec3e6d"
> 3da1a67 0 0 31861 0 0 0 2 3 -1 0 0 24 "       5         a0fafc2"
> 1120003 1 16f0019 3552 0 408 0 2 0 -1 0 1 24 "       3        2d8b9091"
> af0002 1 10024 3552 0 408 0 2 0 -1 0 1 24 "       3        2053fbf8"
> ...
> 
> But I don't really see how to work our which type of lock is which from 
> this file - sorry. Given $2 is the nodeid I can work our who has locks and 
> that leads to a minor strangeness
> 
> node1 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n
>     2142 0
>     1619 2
>     2001 3
>     1586 4
>     1566 5
>     1624 6
>     1610 7
>     1733 8
>     1592 9
>     1612 10
> 
> These numbers are much bigger than the counts on the 9 other nodes, e.g.
> 
> node2 # awk 'NR>1{print $2}' /sys/kernel/debug/dlm/mygfs_locks | sort | uniq -c | sort -k +2n
>      441 0
>     1630 1
>       75 3
>        2 4
>       10 5
>       25 7
>       15 8
>       38 10
> 
> Is that normal ?
> 
> Using gfs_tool's lockdump I see
> 
> node1 # gfs_tool lockdump /newcache | egrep '^Glock' | sed 's?(\([0-9]*\).*)?\1?g' | sort | uniq -c
>        3 Glock 1
>      308 Glock 2
>     1538 Glock 3
>        2 Glock 4
>      233 Glock 5
>        2 Glock 8
> 
> Only type 2 and type 5 counts seem to change. Across the cluster there is 
> one node with a lot more (10x more) Glock type 2 and Glock type 5 locks.
> 
This lock dump is what you want to look at first. The dlm dumps are
really only for when something has gone wrong and you need to check
whether dlm has a different idea of what is going on to gfs. The type 2
glocks relate to inodes (as do type 5, but they don't have any bearing
on the performance in this case). Type 3 glocks relates to resource
groups.

It is this gfs_tool lockup output that contains the info that you need.
The interesting locks are those which have a number of "Holders"
attached to them which are on the "Waiters" queues (i.e. not granted)
and the more of those holder there are on a lock, the more interesting
it is from a performance point of view.

In the case of type 2 glocks, the other part of the glock number, is
also the inode number, so when the system it otherwise idle, a find
-inum will tell you which inode was causing the problems, provided it
wasn't a temporary file, of course :-)

If you have access to the Red Hat kbase system, then this is all
described in the docs on that site.

> # gfs_tool counters /newcache
> 
>                                    locks 2313
>                               locks held 781
>                             freeze count 0
>                            incore inodes 230
>                         metadata buffers 1061
>                          unlinked inodes 28
>                                quota IDs 2
>                       incore log buffers 28
>                           log space used 1.46%
>                meta header cache entries 1304
>                       glock dependencies 185
>                   glocks on reclaim list 0
>                                log wraps 91
>                     outstanding LM calls 0
>                    outstanding BIO calls 0
>                         fh2dentry misses 0
>                         glocks reclaimed 2125924
>                           glock nq calls 801437507
>                           glock dq calls 796261692
>                     glock prefetch calls 319835
>                            lm_lock calls 6396763
>                          lm_unlock calls 1031709
>                             lm callbacks 7669741
>                       address operations 1267096416
>                        dentry operations 35815146
>                        export operations 0
>                          file operations 233333825
>                         inode operations 61818196
>                         super operations 148712313
>                            vm operations 87114
>                          block I/O reads 0
>                         block I/O writes 0
> 
> Not sure if anyone can make anything from all these numbers ...
> 
There is nothing that stands out as being a problem there, but the
counters are generally not very useful,

Steve.

> Thanks,
> Kevin
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster