[Linux-cluster] Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS

Tue Jun 28 05:55:09 UTC 2011

Hi everyone,

I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on
top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting
points are mounted with noatime, nodiratime, data=writeback and localflocks
options, and also the SAN and servers are fast (4Gbps and 8Gb, dual
controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster
has been doing its work (failover working fine...), however
and unfortunately I have seen hight I/Owait rates, sometimes around 60-70%
(on which is very bad), and a couple of glock_workqueue jobs, so I get a
bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show
me "W", only "G" and "H".

Have you guys seen it before?
Looks like some glock's contention?
How could I get it fixed and what does it mean?

Thank you very much

Jun 27 18:48:05  kernel: INFO: task gfs2_quotad:19066 blocked for more than
120 seconds.
Jun 27 18:48:05  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Jun 27 18:48:05  kernel: gfs2_quotad   D 0000000000000004     0 19066      2
0x00000080
Jun 27 18:48:05  kernel: ffff880bb01e1c20 0000000000000046 0000000000000000
ffffffffa045ec6d
Jun 27 18:48:05  kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50
00000001051d8b46
Jun 27 18:48:05  kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598
ffff880be4865af8t
Jun 27 18:48:05  kernel: Call Trace:
Jun 27 18:48:05  kernel: [<ffffffffa045ec6d>] ? dlm_put_lockspace+0x1d/0x40
[dlm]
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa0525c5e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db87f>] __wait_on_bit+0x5f/0x90
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db928>]
out_of_line_wait_on_bit+0x78/0x90
Jun 27 18:48:05  kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
Jun 27 18:48:05  kernel: [<ffffffffa0526816>] gfs2_glock_wait+0x36/0x40
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa0529011>] gfs2_glock_nq+0x191/0x370
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8107a11b>] ?
try_to_del_timer_sync+0x7b/0xe0
Jun 27 18:48:05  kernel: [<ffffffffa05427f8>] gfs2_statfs_sync+0x58/0x1b0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db52a>] ? schedule_timeout+0x19a/0x2e0
Jun 27 18:48:05  kernel: [<ffffffffa05427f0>] ? gfs2_statfs_sync+0x50/0x1b0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa053a787>] quotad_check_timeo+0x57/0xb0
[gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa053aa14>] gfs2_quotad+0x234/0x2b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8108e100>] ?
autoremove_wake_function+0x0/0x40
Jun 27 18:48:05  kernel: [<ffffffffa053a7e0>] ? gfs2_quotad+0x0/0x2b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
Jun 27 18:48:05  kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Jun 27 18:48:05  kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
Jun 27 18:48:05  kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Jun 27 19:49:07  kernel: __ratelimit: 57 callbacks suppressed
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 20:00:58  kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140
bytes - shutting down socket
Jun 27 20:00:58  kernel: __ratelimit: 40 callbacks suppressed
qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000)
qdisk cycle took more than 1 second to complete (1.120000)

Thanks
James S.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110627/14ac4efb/attachment.htm>