[Date Prev][Date Next] [Thread Prev][Thread Next]
Re: [Linux-cluster] managing GFS corruption on large FS
- From: Robert Peterson <rpeterso redhat com>
- To: linux clustering <linux-cluster redhat com>
- Subject: Re: [Linux-cluster] managing GFS corruption on large FS
- Date: Wed, 29 Nov 2006 10:04:47 -0600
Riaan van Niekerk wrote:
We have a large GFS consisting of 4 TB of maildir data. We have a
corruption on this GFS which causes nodes to be withdrawn intermittently.
The cause of the fs corruption is due to user error and lack of
documentation (initially not having the clustered flag enabled on the
VG when growing the LV/GFS). We now know better, and will avoid this
particular cause of corruption. However, management wants to know from
us how we can prevent corruption, or minimize the downtime incurred if
this should happen again.
For the current problem, since a gfs_fsck will take too long (we
cannot afford the 1 - 3 days of downtime it will take to complete the
fsck), we are planning to migrate the data to a new GFS, and at the
same time set up the new environment optimally to cause the minimum of
downtime, if a corruption were to happen again.
One option is to split the one big GFS into a number of smaller GFS's.
Unfortunately, our environment does not lean itself to being split up
in (for example) a number of 200GB GFS's. Also, this negates a lot of
the advantages of GFS (e.g. having your storage consolidated onto one
big GFS, and scaling it out by growing the GFS and adding nodes).
I would really like to know how others on this list manage the
threat/risk of FS corruption, and the corruption itself, if it does
happen. Also, w.r.t. data protection, if you do snapshots, SAN-based
mirroring, backup to disk/tape, I would really appreciate it if you
could give me detail information like
a) mechanism (e.g snaps, backup, etc)
b) type of data (e.g. many small files)
c) size of GFS
d) the time it takes to perform the action
Linux-cluster mailing list
Linux-cluster redhat com
You've raised a good question, and I thought I'd address some of your
I'm just throwing these out in no particular order.
Running gfs_fsck is understandably slow, but there are a few things to bear
1. A 4TB file system is not excessive by any means. As I stated in the
faq, a customer reported running gfs_fsck on a 45TB and it only took 48
hours, and that was slower than it should have been because it was
out of memory and started swapping to disk. Your 4TB file system should
take a lot less time since it's a tenth of the size. That depends,
of course, on
hardware issues as well. See:
2. I've recently figured out a couple of ways to improve the speed
of gfs_fsck. For example, for a recent bugzilla, I patched a memory
and combined passes through the file system inside the duplicate
code, pass1b. For a list of improvements, see this bugzilla, especially
I think this should be available in Rhel4 U5.
3. gfs_fsck takes a lot of memory to run, and when it runs out of memory,
it will start swapping to disk, and that will slow it down considerably.
So be sure to run it on a system with lots of memory.
4. We're continuing to improve the gfs_fsck code all the time.
Jon Brassow and I have done some brainstorming and hope to keep
making it faster. I've come up with some more memory saving ideas
that might make it faster, but I have yet to try them out. Maybe soon.
5. Another thing that slows down gfs_fsck is running it in verbose mode.
Sometimes it's useful to have the verbose mode, but it will slow you
down considerably. Don't use -v or -vv unless you have to.
If you're only using -v to figure out where fsck is in the process,
a couple of improvements: In the most recent version of gfs_fsck (for
the bugzilla above) I've added more "% complete" messages. Also, if
you interrupt that version by hitting <ctrl-c> it will tell you what
it's currently working on and allow you to continue. Again, I think
should be in RHEL4 U5.
6. I recently discovered an issue that impacts GFS performance for large
file systems, not only for gfs_fsck but for general performance as well.
The issue has to do with the size of the GFS resource groups, which is
an internal GFS structure for managing the data. This is an internal
GFS structure, not to be confused with rgmanager's Resource Groups.
Some file system slowdown can be blamed on having a large number
of RGs. The bigger your file system, the more RGs you need. By
gfs_mkfs carves your file system into 256MB RGs, but it allows you to
specify a preferred RG size. The default, 256MB, is good for average
size file systems, but you can increase performance on a bigger file
system by using a bigger RG size. For example, my 40TB file system
requires approximately 156438 RGs of 256MB each. Whenever GFS
has to run that linked list, it takes a long time. The same 40TB
can be created with bigger RGs--2048MB--requiring only 19555 of them.
The time savings is dramatic: It took nearly 23 minutes for my system
to read in all 156438 RG Structures (with 256MB RGs), but only 4
minutes to read in the 19555 RG Structures for my 2048MB RGs.
The time to do an operation like df on an empty file system dropped from
24 seconds with 256MB RGs, to under a second with 2048MB RGs.
I'm sure that increasing the size of the RGs would help gfs_fsck's
performance as well. I can't make any performance promises; I can only
tell you what I observed in this one case. The issue is documented
I'm going to try to see if I can get a KnowledgeBase article written up
about this, by the way, and I'll try to put something into the FAQ too.
For RHEL5, I'm changing gfs_mkfs so that it picks a more intelligent
RG size based on the file system size, to let users take advantage
performance benefit without ever knowing or caring about the RG size.
Unfortunately, there's no way to change the RG size once a file system
has been made. It only happens at gfs_mkfs time.
7. As for file system corruption, that's a tough issue. First of all,
rare. In virtually all the cases I've seen it was caused by
of GFS itself, like the case you mentioned: (1) someone swapping a
hard drive that resided in the middle of a GFS logical volume, (2)
running gfs_fsck while the volume was still mounted by a node, or (3)
someone messing with the SAN from a machine outside of GFS.
If there are other ways to cause GFS file corruption, we need the users
to open bugzillas up so we can work on the problem, and even so, it's
nearly impossible to tell how corruption occurs unless it can be
recreated here in our lab.
I'm going to continue to search for ways to improve the performance of
GFS and gfs_fsck because you're right: the needs of our users are
and people are using bigger and bigger file systems all the time.
Red Hat Cluster Suite
[Date Prev][Date Next] [Thread Prev][Thread Next]