[Linux-cluster] GFS2 and backups (performance tuning)

Fri Dec 4 09:44:30 UTC 2009

Hi,

I'd suggest filing a bug in the first instance. I can't see anything
obviously wrong with what you are doing. The fcntl() locks go via the
dlm and dlm_controld not via the glock_workqueues, so I don't think that
is likely to be the issue,

Steve.

On Thu, 2009-12-03 at 12:42 -0800, Ray Van Dolson wrote:
> We have a two node cluster primarily acting as an NFS serving
> environment.  Our backup infrastructure here uses NetBackup and,
> unfortunately, NetBackup has no PPC client (we're running on IBM JS20
> blades) so we're approaching the backup strategy in two different ways:
> 
>   - Run netbackup client from another machine and point it to NFS share
>     on one of our two cluster nodes
>   - Run rsyncd on our cluster nodes and rsync from a remote machine.
>     NetBackup then backs up that machine.
> 
> The GFS2 filesystem in our cluster only is storing about 90GB of data,
> but has about one million files (inodes used reported via df -i) on it.
> 
> (For the curious, this is a home directory server and we do break
> thinsg up under a top level hierarchy of a folder for each first letter
> of a username).
> 
> The NetBackup over NFS route is extremely slow and spikes the load up
> on whichever server is being backed up from.  We made the following
> adjustments to try and improve performance:
> 
>   - Set the following in our cluster.conf file:
> 
>     <dlm plock_ownership="1" plock_rate_limit="0"/>
>     <gfs_controld plock_rate_limit="0"/>
> 
>     ping_pong will give me about 3-5k locks/sec now.
>   
>   - Mounted filesystem with noatime,nodiratime,quota=off
> 
> This seems to have helped a bit, but things are still taking a long
> time.  I should note here that I tried running ping_pong to one of our
> cluster nodes via one of its NFS exports of the GFS2 filesystem.  While
> I can get 3000-5000 locks/sec locally, over NFS it was about... 2 or 3
> (not thousand, literally 2 or 3).  tcpdump of the NLM port shows the
> NFS lock manager on the node responding NLM_BLOCK most of the time.
> I'm not sure if GFS2 or our NFS daemon is to blame... in any case...
> 
> .. I've set up rsyncd on the cluster nodes and am sync'ing from a
> remote server now (all of this via Gigabit ethernet).  I'm over an hour
> in and the client is still generatin the file list.  strace confirms
> that rsync --daemon is still trolling through, generating a list of
> files on the filesystem...
> 
> I've done a blktrace dump on my GFS2 filesystem's block device and can
> clearly see glock_workqueue showing up the most by far.  However, I
> don't know what else I can glean from these results.
> 
> Anyone have any tips or suggestions on improving either our NFS locking
> or rsync --daemon performance beyond what I've already tried?  It might
> almost be quicker for us to do a full backup each time than to spend
> hours building file lists for differential backups :)
> 
> Details of our setup:
> 
>   - IBM DS4300 Storage (12 drive RAID5 + 2 spares)
>     - Exposed as two LUNs (one per controller)
>     - Don't believe this array does hardware snapshots :(
>   - Two (2) IBM JS20 Blades (PPC)
>     - QLogic ISP2312 2Gb HBA's
>     - RHEL 5.4 Advanced Platform PPC
>     - multipathd
>     - clvm aggregates two LUNs
>     - GFS2 on top of clvm
>       - Configured with quotas originally, but disabled later by
>         mounting quota=off
>       - Mounted with noatime,nodiratime,quota=off
> 
>   # gfs2_tool gettune /domus1
>   new_files_directio = 0
>   new_files_jdata = 0
>   quota_scale = 1.0000   (1, 1)
>   logd_secs = 1
>   recoverd_secs = 60
>   statfs_quantum = 30
>   stall_secs = 600
>   quota_cache_secs = 300
>   quota_simul_sync = 64
>   statfs_slow = 0
>   complain_secs = 10
>   max_readahead = 262144
>   quota_quantum = 60
>   quota_warn_period = 10
>   jindex_refresh_secs = 60
>   log_flush_secs = 60
>   incore_log_blocks = 1024
> 
>   # gfs2_tool getargs /domus1
>   data 2
>   suiddir 0
>   quota 0
>   posix_acl 1
>   upgrade 0
>   debug 0
>   localflocks 0
>   localcaching 0
>   ignore_local_fs 0
>   spectator 0
>   hostdata jid=1:id=196610:first=0
>   locktable 
>   lockproto 
> 
> Thanks in advance for any advice.
> 
> Ray
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster