[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS2 performance on large files / mkfs tuning

On Thu, Apr 23, 2009 at 10:41:20AM +0100, Gordan Bobic wrote:
> On Thu, 23 Apr 2009 06:45:49 +0100, Andy Wallace
> <andy linxit eclipse co uk>
> wrote:
> > On Wed, 2009-04-22 at 20:37 -0300, Flavio Junior wrote:
> >> On Wed, Apr 22, 2009 at 8:11 PM, Andy Wallace
> <andy linxit eclipse co uk>
> >> wrote:
> >> 
> >> > Although it's not as quick as I'd like, I'm getting about 150MB/s on
> >> > average when reading/writing files in the 100MB - 1GB range. However,
> >> > if
> >> > I try to write a 10GB file, this goes down to about 50MB/s. That's
> just
> >> > doing dd to the mounted gfs2 on an individual node. If I do a get from
> >> > an ftp client, I'm seeing half that; cp from an NFS mount is more like
> >> > 1/5.
> >> >
> >> 
> >> Have you tried the same thing with another filesystem? Ext3 maybe ?
> >> Are you using RAID right? Did you check about RAID and LVM/partition
> >> alignment?
> >> 
> >> If you will try ext3, see about -E stride and -E stripe_width values
> >> on mkfs.ext3 manpage.
> >> This calc should helps: http://busybox.net/~aldot/mkfs_stride.html
> >> 

Oh.. nice tool. I've been always calculating manually. Then again the math
is pretty simple..

> > 
> > Yes, I have (by the way, do you know how long ext3 takes to create a 6TB
> > filesystem???).
> It depends on what you consider to be "long". The last box I built had
> 8x 1TB 7200rpm disks in software md RAID6 = 6 TB usable, DAS, consumer
> grade motherboard, and 2 of the 8 ports were massively bottlenecked by
> a 32-bit 33MHz PCI SATA controller, but all 8 ports had NCQ. This took
> about 10-20 minutes (I haven't timed it exactly, about a cup of coffee
> long ;)) to mkfs ext3 when the parameters were properly set up. With
> default settings, it was taking around 10x longer, maybe even more.

Interesting information.. what settings did you tweak for faster mkfs? Just
the things mentioned below?

> My findings are that the default settings and old wisdom often taken
> as axiomatically true are actually completely off the mark.
> Here is the line of reasoning that I have found to lead to best results.
> RAID block size of 64-256KB is way, way too big. It will kill
> the performance of small IOs without yielding a noticeable increase
> in performance for large IOs, and sometimes in fact hurting large
> IOs, too.
> To pick the optimum RAID block size, look at the disks. What is the
> multi-sector transfer size they can handle? I have not seen any disks
> to date that have this figure at anything other than 16, and
> 16sectors * 512 bytes/sector = 8KB.

Hmm.. how can you determine multi-sector transfer size from a disk? 

> So set the RAID block size to 8KB.
> Make sure your stride parameter is set so that
> ext3 block size (usually 4KB) * stride = block size,
> in this case ext3 block size = 4, stride = 2, RAID block = 8KB.
> So far so good, but we're not done yet. The last thing to consider
> is the extent / block group size. The beginning of each block group
> contains a superblock for that group. It is the top of that inode
> tree, and needs to be checked to find any file/block in that group.
> That means the beginning block of a block group is a major hot-spot
> for I/O, as it has to be read for every read and written for every
> write to that group. This, in turn, means that for anything like
> reasonable performance you need to have the block group beginnings
> distributed evenly across all the disks in your RAID array, or else
> you'll hammer one disk while the others are sitting idle.
> For example, the default for ext3 is 32768 blocks in a block group.
> IIRC, on ext3, the adjustment can only be made in increments of 8
> blocks (32KB assuming 4KB blocks). In GFS, IIRC, the minimum
> adjustment increment is 1MB(!) (not a FS limitation but a mkfs.gfs
> limitation, I'm told).
> The optimum number of blocks in a group will depend on the RAID
> level and the number of disks in the array, but you can simplify
> it into a RAID0 equivalent for the purpose of this exercise
> e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0.
> Ideally you want the block group size to align to the stripe
> width +/- 1 stride width so that the block group beginnings
> rotate among the disks (upward for +1 stride, downward for
> -1 stride, both will achieve the same purpose).
> The stripe width in the case described is 8KB * 6 disks = 48KB.
> So, you want block group to align to a multiple of
> 8KB * 7 disks = 56KB. But be careful here - you should aim for
> a number that is a multiple of 56KB, but not a multiple of
> 48KB because if they line up, you haven't achieved anything
> and you're back where you started!
> 56KB is 14 4KB blocks. Without getting involved in a major
> factoring exercise, 28,000 blocks sounds good (default is
> 32768 for ext3, which is in a reasonable ball park).
> 28,000*4KB is a multiple of 56KB but not 48KB, so it looks
> like a good choice in this example.
> Obviously, you'll have to work out the optimal numbers for
> your specific RAID configuration, the above example is for:
> disk multi-sector = 16
> ext3 block size = 4KB
> RAID block size = 8KB
> ext3 stride = 2
> RAID = 6
> disks = 8
> This is one of the key reasons why I think LVM is evil. It abstracts
> things and encourages no forward thinking. Adding a new volume is
> the same as adding a new disk to a software RAID to stretch it.
> It'll upset the block group size calculation and in one fell swoop
> take away the advantage of load balancing across all the disks you
> have. By doing this you can cripple the performance on some
> operations from scaling linearly with the number of disks to being
> bogged down to the performance of just one disk.
> This can make a massive difference to IOPS figures you get out
> of a storage system, but I suspect that enterprise storage vendors
> are much happier being able to sell more (and more expensive)
> equipment when the performance gets crippled through misconfiguration,
> or even just lack of consideration of parameters such as the above.
> This is also why quite frequently a cheap box made of COTS components
> can complete blow away a similar enterprise grade box with 10-100x
> the price tag.
> > I've aligned the RAID and LVM stripes using various different values,
> > and found slight improvements in performance as a result. My main
> > problem is that when the file size hits a certain point, performance
> > degrades alarmingly. For example, on NFS moving a 100M file is about 20%
> > slower than direct access, with a 5GB file it's 80% slower (and the
> > direct access itself is 50% slower).
> > 
> > As I said before, I'll be working with 20G-170G files, so I really have
> > to find a way around this!
> Have you tried increasing the resource group sizes? IIRC the default is
> 256MB (-r parameter to mkfs.gfs), which may well have a considerable
> impact when you are shifting huge files around. Try upping it,
> potentially by a large factor, and see how that affects your large file
> performance.
> Note - "resource group" in GFS is the same as the "block group" described
> above for ext3 in the performance optimization example. However, while
> ext3 is adjustable in multiples of 8 blocks (32KB), gfs is only adjustable
> in increments of 1MB.

Very good information you have here.. 

Thanks for posting it.

-- Pasi

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]