[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] gfs_grow experience



Yesterday evening we grew our 6-node 2.4 TB GFS 6.1 filesystem to 4.5 TB.

Here is our experience, which I hope others can benefit from

Having grown the underlying LUN (on an EMC CX500) a couple of weeks ago, we got bit by this parted bug: Parted segfaults because of extended devices with GPT partition table
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194238
(for which we just got a hotfix from RH after extensive testing of test packages)

The partitioning and LVM all went smoothly. gfs_grow -T (test) showed nothing funny. When we started the real gfs_grow, things started out smoothly. At about 7 - 10 minutes, the GFS was withdrawn from 5 of the nodes (the only one not withdrawing was the one on which the gfs_grow was running):

Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Trying to acquire journal lock..
.
Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Busy
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
...
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
Aug 23 00:58:29 host kernel: dm-0: rw=0, want=7803155736, limit=5044961280
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: fatal: I/O error
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: block = 975394466 Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: function = gfs_dreread Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: file = /usr/src/build/765788-x86_64/B
UILD/gfs-kernel-2.6.9-58/smp/src/gfs/dio.c, line = 576
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: time = 1156287509 Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: about to withdraw from the cluster Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: waiting for outstanding I/O Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: telling LM to withdraw
Aug 23 00:58:32 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 00:58:32 host kernel: GFS: fsid=webmail:gfs_mail.4: withdrawn


Definitely not a nice message to see for something as suspenseful as a gfs_grow, which you cannot rollback, and interrupting/resuming seems to be not recommended. Even more so since the fs needs to be mounted for it to be grown. While the GFS is being grown, I/O to the fs is blocked. The only way I could get an idea that the gfs_grow was still busy doing something, was to run strace on its PID.

After 16 very long minutes, the grow completed. the GFS on 2 of the nodes could be brought back by a simple 'service gfs restart'. The others had to be bounced. After 30 minutes of everything being up, the 2 nodes also lost the FS with the same error message as above and had to be bounced.

When I disabled quotas (we were still in our maintenance window) , I mistakenly ran the command 'gfs_tool settune /mnt/san quota_account 0' on more than one node since the quota value was not updated quickly enough on other nodes after I ran it on the first node. The FS was withdrawn again on 2 nodes, with error:

Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: fatal: filesystem
consistency error
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   function =
trans_go_xmote_bh
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   file =
/usr/src/build/765787-i686/BUI
LD/gfs-kernel-2.6.9-58/smp/src/gfs/glops.c, line = 542
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   time =
1156290932
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: about to withdraw
from the cluster
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: waiting for
outstanding I/O
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: telling LM to
withdraw
Aug 23 01:55:37 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 01:55:37 host kernel: GFS: fsid=webmail:gfs_mail.0: withdrawn

After bouncing them, all seemed well. To Red Hat: would it make sense to log bugzillas for these withdraw scenarios (what seems to be bugs in gfs_grow and gfs_tune/quota, unless the withdraw on gfs_grow works as intended and/or despite the latter probably being pebcak / incorrect usage)? I will not be able to easily replicate and we are fine now (hopefully) despite these hickups. (e.g. I have no reason to open Service Requests) I am sure others might run into these aswell.

greetings
Riaan
begin:vcard
fn:Riaan van Niekerk
n:van Niekerk;Riaan
org:Obsidian Systems;Obsidian Red Hat Consulting
email;internet:riaan obsidian co za
title:Systems Architect
tel;work:+27 11 792 6500
tel;fax:+27 11 792 6522
tel;cell:+27 82 921 8768
x-mozilla-html:FALSE
url:http://www.obsidian.co.za
version:2.1
end:vcard


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]