[Linux-cluster] Ext3/ext4 in a clustered environement
Alan Brown
ajb2 at mssl.ucl.ac.uk
Wed Nov 9 16:00:39 UTC 2011
Steven Whitehouse wrote:
>> We see appreciable knee points in GFS directory performance at 512, 4096
>> and 16384 files/directory, with progressively worse performance
>> deterioration between each knee pair. (It's a 2^n type problem)
>>
> That is a bit strange. The GFS2 directory entries are sized according to
> (length of file name + length of fixed size info) which means that
> generally the number of blocks required to store a specific number of
> files is not constant unless the file names are all the same length.
Generally they are, as are file sizes.
> Also, once a directory has been unstuffed, the hash table will grow
> until it is 128k in size, which is 16k pointers. So with 16384 directory
> entries, you should be a long way from having a full hash table, since
> each leaf block should contain around 80 entries (again depending on
> filename length), so thats not too far off 1m entries.
Should be, but performance becomes unusable long before that happens.
> So for all unstuffed directories with fewer than about 1m entries, I'd
> expect to see all accesses resulting in the following I/O pattern:
> 1. Look up hash table block
> 2. Look up dir leaf block
> 3. Look up inode (if this is a ->lookup rather than readdir)
>
> What test are you using to generate the performance figures in this
> case?
"ls -l" - which is what the clients are using as they import data for
number crunching work. Rsync uses a raw directory read but the stat()
calls on individual files are pretty similar.
Once the information is cached, accessing the directory is fast until
the cache expires (3-10 minutes)
There is definite and very measurable performance degradation as more
files are added to a directory - even on things as simple as an
incremental backup the number of files opened/second falls away rapidly
as directories get larger.
More information about the Linux-cluster
mailing list