[Linux-cluster] Cluster Project FAQ - GFS tuning section]

Tue Jan 23 13:39:32 UTC 2007

I don't know where that breaking point is but I believe _we've_ stepped
over it.

4-node RHEL3 and GFS6.0 cluster with (2) 2TB filesystems (GULM and no
LVM) versus
3-node RHEL4 (x86_64) and GFS6.1 cluster with (1) 8TB+ filesystem (DLM
and LVM and way faster hardware/disks)

This is a migration from the former to the latter, so quantity/size of
files/dirs is mostly identical. Files being transferred from customer
sites to the old servers never cause more than about 20% CPU load and
that usually (quickly) falls to 1% or less after the initial xfer
begins. The new servers run to 100% where they usually remain until the
transfer completes. The current thinking as far as reason is the same
thing being discussed here.

I was advised of this mailing list posting by a a RH tech from an open
ticket I have on this problem.

The older server has 11million and 13million files respectively across
it's two filesystems. The newer server has about 27million files on it's
one larger filesystem. It appears I'm going to be forced to blow this
away and create multiple smaller filesystems (there is a redundant
backup drive with all the files thank goodness). Unless I hear otherwise
I'm inclined to limit the new RAID to 2TB filesystems max. So as one
filesystem fills up faster than the others I suppose we'll forever be
moving things from one to the other and symlinking it to death. Oh well...

>Robert,
/>
>> What version of the gfs_mkfs code were you running to get this?/
>gfs_mkfs -V produced the following results:
>
>gfs_mkfs 6.1.6 (built May 9 2006 17:48:45)
>Copyright (C) Red Hat, Inc. 2004-2005 All rights reserved
>
>Thanks,
>Jon
>
>On 1/11/07, Robert Peterson <rpeterso@???> wrote:/
>> Jon Erickson wrote:
>> > I have a couple of question regarding the Cluster Project FAQ – GFS
>> > tuning section
(http://sources.redhat.com/cluster/faq.html#gfs_tuning).
>> >
>> > First:
>> > - Use –r 2048 on gfs_mkfs and mkfs.gfs2 for large file systems.
>> > I noticed that when I used the –r 2048 switch while creating my file
>> > system it ended up creating the file system with the 256MB resource
>> > group size. When I omitted the –r flag the file system was created
>> > with 2048Mb resource group size. Is there a problem with the –r flag,
>> > and does gfs_mkfs dynamically come up with the best resource group
>> > size based on your file system size? Another thing I did which ended
>> > up in a problem was executing the gfs_mkfs command while my current
>> > GFS file system was mounted. The command completed successfully but
>> > when I went into the mount point all the old files and directories
>> > still showed up. When I attempted to remove files bad things
>> > happened…I believe I received invalid metadata blocks error and the
>> > cluster went into an infinite loop trying to restart the service. I
>> > ended up fixing this problem by un-mounting my file system re-creating
>> > the GFS file system and re-mounting. This problem was caused by my
>> > user error, but maybe there should be some sort of check that
>> > determines whether the file system is currently mounted.
>> >
>> > Second:
>> > - Break file systems up when huge numbers of file are involved.
>> > This FAQ states that there is an amount of overhead when dealing with
>> > lots (millions) of files. What is a recommended limit of files in a
>> > file system? The theoretical limit of 8 exabytes for a file system
>> > does not seem at all realistic if you can't have (millions) of files
>> > in a file system.
>> >
>> > I just curious to see what everyone thinks about this. Thanks
>> >
>> >
>> Hi Jon,
>>
>> The newer gfs_mkfs (gfs1) and mkfs.gfs2 (gfs2) in the CVS HEAD will
>> create the RG size based on the size of the file system if "-r" is not
>> specified,
>> so that would explain why it used 2048 in the case where you didn't
>> specify -r.
>> The previous versions just always assumed 256MB unless -r was specified.
>>
>> If you specified -r 2048 and it used 256 for its rg size, that would be
>> a bug.
>> What version of the gfs_mkfs code were you running to get this?
>>
>> I agree that it would be very nice if all the userspace GFS-related
>> tools could
>> make sure the file system is not mounted anywhere first before running.
>> We even have a bugzilla from long ago about this regarding gfs_fsck:
>>
>> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=156012
>>
>> It's easy enough to check if the local node (the one running mkfs or
fsck)
>> has it mounted, but it's harder to figure out whether other nodes do
because
>> the userland tools can't assume access to the cluster infrastructure
>> like the
>> kernel code can. So I guess we haven't thought of an elegant solution to
>> this yet; we almost need to query every node and check its cman_tool
>> services output to see if it is using resources pertaining to the file
>> system,
>> but that would require some kind of socket or connection,
>> (e.g. ssh) but what should it do when it can't contact a node that's
powered
>> off, etc?
>>
>> Regarding the number of files in a GFS file system: I don't have any
kind
>> of recommendations because I haven't studied the exact performance
impact
>> based on the number of inodes. It would be cool if someone could do some
>> tests and see where the performance starts to degrade.
>>
>> The cluster team at Red Hat can work toward improving the performance
>> of GFS (in fact, we are; hence the change to gfs_mkfs for the rg size),
>> but many of the performance issues are already addressed with GFS2,
>> and since GFS2 was accepted by the upstream linux kernel, in a way
>> I think it makes more sense to focus more of our efforts there.
>>
>> One thing I thought about doing was trying to use btrees instead of
>> linked lists for some of our more critical resources, like the RGs and
>> the glocks. We'd have to figure out the impact of doing that; the
overhead
>> to do that might also impact performance. Just my $0.02.
>>
>> Regards,
>>
>> Bob Peterson
>> Red Hat Cluster Suite
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster@???
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>/
>
>

>-- 
>Jon