[Linux-cluster] GFS2 performance on large files

Thu Apr 23 09:41:20 UTC 2009

On Thu, 23 Apr 2009 06:45:49 +0100, Andy Wallace
<andy at linxit.eclipse.co.uk>
wrote:
> On Wed, 2009-04-22 at 20:37 -0300, Flavio Junior wrote:
>> On Wed, Apr 22, 2009 at 8:11 PM, Andy Wallace
<andy at linxit.eclipse.co.uk>
>> wrote:
>> 
>> > Although it's not as quick as I'd like, I'm getting about 150MB/s on
>> > average when reading/writing files in the 100MB - 1GB range. However,
>> > if
>> > I try to write a 10GB file, this goes down to about 50MB/s. That's
just
>> > doing dd to the mounted gfs2 on an individual node. If I do a get from
>> > an ftp client, I'm seeing half that; cp from an NFS mount is more like
>> > 1/5.
>> >
>> 
>> Have you tried the same thing with another filesystem? Ext3 maybe ?
>> Are you using RAID right? Did you check about RAID and LVM/partition
>> alignment?
>> 
>> If you will try ext3, see about -E stride and -E stripe_width values
>> on mkfs.ext3 manpage.
>> This calc should helps: http://busybox.net/~aldot/mkfs_stride.html
>> 
> 
> Yes, I have (by the way, do you know how long ext3 takes to create a 6TB
> filesystem???).

It depends on what you consider to be "long". The last box I built had
8x 1TB 7200rpm disks in software md RAID6 = 6 TB usable, DAS, consumer
grade motherboard, and 2 of the 8 ports were massively bottlenecked by
a 32-bit 33MHz PCI SATA controller, but all 8 ports had NCQ. This took
about 10-20 minutes (I haven't timed it exactly, about a cup of coffee
long ;)) to mkfs ext3 when the parameters were properly set up. With
default settings, it was taking around 10x longer, maybe even more.

My findings are that the default settings and old wisdom often taken
as axiomatically true are actually completely off the mark.

Here is the line of reasoning that I have found to lead to best results.

RAID block size of 64-256KB is way, way too big. It will kill
the performance of small IOs without yielding a noticeable increase
in performance for large IOs, and sometimes in fact hurting large
IOs, too.

To pick the optimum RAID block size, look at the disks. What is the
multi-sector transfer size they can handle? I have not seen any disks
to date that have this figure at anything other than 16, and
16sectors * 512 bytes/sector = 8KB.

So set the RAID block size to 8KB.

Make sure your stride parameter is set so that
ext3 block size (usually 4KB) * stride = block size,
in this case ext3 block size = 4, stride = 2, RAID block = 8KB.

So far so good, but we're not done yet. The last thing to consider
is the extent / block group size. The beginning of each block group
contains a superblock for that group. It is the top of that inode
tree, and needs to be checked to find any file/block in that group.
That means the beginning block of a block group is a major hot-spot
for I/O, as it has to be read for every read and written for every
write to that group. This, in turn, means that for anything like
reasonable performance you need to have the block group beginnings
distributed evenly across all the disks in your RAID array, or else
you'll hammer one disk while the others are sitting idle.

For example, the default for ext3 is 32768 blocks in a block group.
IIRC, on ext3, the adjustment can only be made in increments of 8
blocks (32KB assuming 4KB blocks). In GFS, IIRC, the minimum
adjustment increment is 1MB(!) (not a FS limitation but a mkfs.gfs
limitation, I'm told).

The optimum number of blocks in a group will depend on the RAID
level and the number of disks in the array, but you can simplify
it into a RAID0 equivalent for the purpose of this exercise
e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0.
Ideally you want the block group size to align to the stripe
width +/- 1 stride width so that the block group beginnings
rotate among the disks (upward for +1 stride, downward for
-1 stride, both will achieve the same purpose).

The stripe width in the case described is 8KB * 6 disks = 48KB.
So, you want block group to align to a multiple of
8KB * 7 disks = 56KB. But be careful here - you should aim for
a number that is a multiple of 56KB, but not a multiple of
48KB because if they line up, you haven't achieved anything
and you're back where you started!

56KB is 14 4KB blocks. Without getting involved in a major
factoring exercise, 28,000 blocks sounds good (default is
32768 for ext3, which is in a reasonable ball park).
28,000*4KB is a multiple of 56KB but not 48KB, so it looks
like a good choice in this example.

Obviously, you'll have to work out the optimal numbers for
your specific RAID configuration, the above example is for:
disk multi-sector = 16
ext3 block size = 4KB
RAID block size = 8KB
ext3 stride = 2
RAID = 6
disks = 8

This is one of the key reasons why I think LVM is evil. It abstracts
things and encourages no forward thinking. Adding a new volume is
the same as adding a new disk to a software RAID to stretch it.
It'll upset the block group size calculation and in one fell swoop
take away the advantage of load balancing across all the disks you
have. By doing this you can cripple the performance on some
operations from scaling linearly with the number of disks to being
bogged down to the performance of just one disk.

This can make a massive difference to IOPS figures you get out
of a storage system, but I suspect that enterprise storage vendors
are much happier being able to sell more (and more expensive)
equipment when the performance gets crippled through misconfiguration,
or even just lack of consideration of parameters such as the above.
This is also why quite frequently a cheap box made of COTS components
can complete blow away a similar enterprise grade box with 10-100x
the price tag.

> I've aligned the RAID and LVM stripes using various different values,
> and found slight improvements in performance as a result. My main
> problem is that when the file size hits a certain point, performance
> degrades alarmingly. For example, on NFS moving a 100M file is about 20%
> slower than direct access, with a 5GB file it's 80% slower (and the
> direct access itself is 50% slower).
> 
> As I said before, I'll be working with 20G-170G files, so I really have
> to find a way around this!

Have you tried increasing the resource group sizes? IIRC the default is
256MB (-r parameter to mkfs.gfs), which may well have a considerable
impact when you are shifting huge files around. Try upping it,
potentially by a large factor, and see how that affects your large file
performance.

Note - "resource group" in GFS is the same as the "block group" described
above for ext3 in the performance optimization example. However, while
ext3 is adjustable in multiples of 8 blocks (32KB), gfs is only adjustable
in increments of 1MB.

HTH

Gordan