[Linux-cluster] GFS2 and backups (performance tuning)

Thu Dec 3 20:42:57 UTC 2009

We have a two node cluster primarily acting as an NFS serving
environment.  Our backup infrastructure here uses NetBackup and,
unfortunately, NetBackup has no PPC client (we're running on IBM JS20
blades) so we're approaching the backup strategy in two different ways:

  - Run netbackup client from another machine and point it to NFS share
    on one of our two cluster nodes
  - Run rsyncd on our cluster nodes and rsync from a remote machine.
    NetBackup then backs up that machine.

The GFS2 filesystem in our cluster only is storing about 90GB of data,
but has about one million files (inodes used reported via df -i) on it.

(For the curious, this is a home directory server and we do break
thinsg up under a top level hierarchy of a folder for each first letter
of a username).

The NetBackup over NFS route is extremely slow and spikes the load up
on whichever server is being backed up from.  We made the following
adjustments to try and improve performance:

  - Set the following in our cluster.conf file:

    <dlm plock_ownership="1" plock_rate_limit="0"/>
    <gfs_controld plock_rate_limit="0"/>

    ping_pong will give me about 3-5k locks/sec now.

  - Mounted filesystem with noatime,nodiratime,quota=off

This seems to have helped a bit, but things are still taking a long
time.  I should note here that I tried running ping_pong to one of our
cluster nodes via one of its NFS exports of the GFS2 filesystem.  While
I can get 3000-5000 locks/sec locally, over NFS it was about... 2 or 3
(not thousand, literally 2 or 3).  tcpdump of the NLM port shows the
NFS lock manager on the node responding NLM_BLOCK most of the time.
I'm not sure if GFS2 or our NFS daemon is to blame... in any case...

.. I've set up rsyncd on the cluster nodes and am sync'ing from a
remote server now (all of this via Gigabit ethernet).  I'm over an hour
in and the client is still generatin the file list.  strace confirms
that rsync --daemon is still trolling through, generating a list of
files on the filesystem...

I've done a blktrace dump on my GFS2 filesystem's block device and can
clearly see glock_workqueue showing up the most by far.  However, I
don't know what else I can glean from these results.

Anyone have any tips or suggestions on improving either our NFS locking
or rsync --daemon performance beyond what I've already tried?  It might
almost be quicker for us to do a full backup each time than to spend
hours building file lists for differential backups :)

Details of our setup:

  - IBM DS4300 Storage (12 drive RAID5 + 2 spares)
    - Exposed as two LUNs (one per controller)
    - Don't believe this array does hardware snapshots :(
  - Two (2) IBM JS20 Blades (PPC)
    - QLogic ISP2312 2Gb HBA's
    - RHEL 5.4 Advanced Platform PPC
    - multipathd
    - clvm aggregates two LUNs
    - GFS2 on top of clvm
      - Configured with quotas originally, but disabled later by
        mounting quota=off
      - Mounted with noatime,nodiratime,quota=off

  # gfs2_tool gettune /domus1
  new_files_directio = 0
  new_files_jdata = 0
  quota_scale = 1.0000   (1, 1)
  logd_secs = 1
  recoverd_secs = 60
  statfs_quantum = 30
  stall_secs = 600
  quota_cache_secs = 300
  quota_simul_sync = 64
  statfs_slow = 0
  complain_secs = 10
  max_readahead = 262144
  quota_quantum = 60
  quota_warn_period = 10
  jindex_refresh_secs = 60
  log_flush_secs = 60
  incore_log_blocks = 1024

  # gfs2_tool getargs /domus1
  data 2
  suiddir 0
  quota 0
  posix_acl 1
  upgrade 0
  debug 0
  localflocks 0
  localcaching 0
  ignore_local_fs 0
  spectator 0
  hostdata jid=1:id=196610:first=0
  locktable 
  lockproto 

Thanks in advance for any advice.

Ray