[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] gfs tuning



Wendy Cheng wrote:
Terry wrote:
On Tue, Jun 17, 2008 at 5:22 PM, Terry <td3201 gmail com> wrote:
On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng <s wendy cheng gmail com> wrote:
Hi, Terry,
I am still seeing some high load averages.  Here is an example of a
gfs configuration.  I left statfs_fast off as it would not apply to
one of my volumes for an unknown reason.  Not sure that would have
helped anyways. I do, however, feel that reducing scand_secs helped a
little:

Sorry I missed scand_secs (was mindless as the brain was mostly occupied by
day time work).

To simplify the view, glock states include exclusive (write), share (read),
and not-locked (in reality, there are more). Exclusive lock has to be
demoted (demote_secs) to share, then to not-locked (another demote_secs) before it is scanned (every scand_secs) to get added into reclaim list where it can be purged. Between exclusive and share state transition, the file
contents need to get flushed to disk (to keep file content cluster
coherent). All of above assume the file (protected by this glock) is not
accessed (idle).

You hit an area that GFS normally doesn't perform well. With GFS1 in
maintenance mode while GFS2 seems to be so far away, ext3 could be a better answer. However, before switching, do make sure to test it thoroughly (since
Ext3 could have the very same issue as well - check out:
http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).

Did you look (and test) GFS "nolock" protocol (for single node GFS)? It bypasses some locking overhead and can be switched to DLM in the future (just make sure you reserve enough journal space - the rule of thumb is one journal per node and know how many nodes you plan to have in the future).

-- Wendy
Good points.  I could try the nolock feature I suppose.  Not quite
clear on how to reserve journal space.  I forgot to post the cpu time,
check out this:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4822 root      10  -5     0    0    0 S    1  0.0   2159:15 dlm_recv
 4820 root      10  -5     0    0    0 S    1  0.0 368:09.34 dlm_astd
 4821 root      10  -5     0    0    0 S    0  0.0 153:06.80 dlm_scand
 3659 root      10  -5     0    0    0 S    0  0.0 134:40.14 scsi_wq_4
 4823 root      11  -5     0    0    0 S    1  0.0 109:33.33 dlm_send
 367 root      10  -5     0    0    0 S    0  0.0 103:33.74 kswapd0

gfs_glockd is further below so not so concerned with that right now.
It appears turning on nolock would do the trick.  The times aren't
extremely accurate because I have failed this cluster between nodes
while testing.


Here is some more testing information....

I created a new volume on my iscsi san of 1 TB and formatted it for
ext3. I then used dd to create a 100G file.  This yielded roughly 900
Mb/sec.  I then stopped my application and did the same thing with an
existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
iscsi issue.  This appears to be a load issue and the number of I/O
occurring on these volumes.  That said, I would expect that performing
the changes I did would result in a major performance improvement.
Since it didn't, what are my other points I could consider?   If its a
GFS issue, ext3 is the way to go.  Maybe even switch to using
active-active on my NFS cluster.   If its a backend disk issue, I
would expect to see the throughput on my iscsi link (bond1) be fully
utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
san with 30 sata disks.  Just bouncing some thoughts around to see if
anyone has any more thoughts.

Really need to focus on my day time job - its worload has been climbing ... but can't help to place a quick comment here ..

The 900 MB/s vs. 850 KB/s difference looks like a caching issue - that is, for 900 MB/s, it looks like the data was still lingering in the system cache while in 850 KB/s case, the data might already hit disk. Cluster filesystem normally syncs more by its nature. In general, ext3 does perform better in single node environment but the difference should not be as big as above. There are certainly more tuning knobs available (such as journal size and/or network buffer size) to make GFS-iscsi "dd" run better but it is pointless. To deploy a cluster filesystem for production usage, the tuning should not be driven by such a simple-mind command. You also have to consider the support issues when deploying a filesystem. GFS1 is a little bit out of date and any new development and/or significant performance improvements would likely be in GFS2, not in GFS1. Research GFS2 (googling to see how other people said about it) to understand whether its direction fits your need (so you can migrate from GFS1 to GFS2 if you bump into any show stopper in the future). If not, ext3 (with ext4 actively developed) is a fine choice if I read your configuration right from previous posts.

Or .. there is a known GFS1 writepage issue if most of your files are all very big .. The problem is fixed in RHEL kernels though. What is your kernel version ?

-- Wendy


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]