optimising filesystem for many small files

Eric Sandeen sandeen at redhat.com
Sun Oct 18 15:34:19 UTC 2009


Viji V Nair wrote:
> On Sun, Oct 18, 2009 at 3:56 AM, Theodore Tso <tytso at mit.edu> wrote:
>> On Sat, Oct 17, 2009 at 11:26:04PM +0530, Viji V Nair wrote:
>>> these files are not in a single directory, this is a pyramid
>>> structure. There are total 15 pyramids and coming down from top to
>>> bottom the sub directories and files  are multiplied by a factor of 4.
>>>
>>> The IO is scattered all over!!!! and this is a single disk file system.
>>>
>>> Since the python application is creating files, it is creating
>>> multiple files to multiple sub directories at a time.
>> What is the application trying to do, at a high level?  Sometimes it's
>> not possible to optimize a filesystem against a badly designed
>> application.  :-(
> 
> The application is reading the gis data from a data source and
> plotting the map tiles (256x256, png images) for different zoom
> levels. The tree output of the first zoom level is as follows
> 
> /tiles/00
> `-- 000
>     `-- 000
>         |-- 000
>         |   `-- 000
>         |       `-- 000
>         |           |-- 000.png
>         |           `-- 001.png
>         |-- 001
>         |   `-- 000
>         |       `-- 000
>         |           |-- 000.png
>         |           `-- 001.png
>         `-- 002
>             `-- 000
>                 `-- 000
>                     |-- 000.png
>                     `-- 001.png
> 
> in each zoom level the fourth level directories are multiplied by a
> factor of four. Also the number of png images are multiplied by the
> same number.
>> It sounds like it is generating files distributed in subdirectories in
>> a completely random order.  How are the files going to be read
>> afterwards?  In the order they were created, or some other order
>> different from the order in which they were read?
> 
> The application which we are using are modified versions of mapnik and
> tilecache, these are single threaded so we are running 4 process at a
> time. We can say only four images are created at a single point of
> time. Some times a single image is taking around 20 sec to create. I
> can see lots of system resources are free, memory, processors etc
> (these are 4G, 2 x 5420 XEON)
> 
> I have checked the delay in the backend data source, it is on a 12Gbps
> LAN and no delay at all.

The delays are almost certainly due to the drive heads seeking like mad 
as they attempt to write data all over the disk; most filesystems are 
designed so that files in subdirectories are kept together, and new 
subdirectories are placed at relatively distant locations to make room 
for the files they will contain.

In the past I've seen similar applications also slow down due to new 
inode searching heuristics in the inode allocator, but that was on ext3 
and ext4 is significantly different in that regard...

> These images are also read in the same manner.
> 
>> With a sufficiently bad access patterns, there may not be a lot you
>> can do, other than (a) throw hardware at the problem, or (b) fix or
>> redesign the application to be more intelligent (if possible).
>>
>>                                                    - Ted
>>
> 
> The file system is crated with "-i 1024 -b 1024" for larger inode
> number, 50% of the total images are less than 10KB. I have disabled
> access time and given a large value to the commit also. Do you have
> any other recommendation of the file system creation?

I think you'd do better to change, if possible, how the application behaves.

I probably don't know enough about the app but rather than:

/tiles/00
`-- 000
     `-- 000
         |-- 000
         |   `-- 000
         |       `-- 000
         |           |-- 000.png
         |           `-- 001.png

could it do:

/tiles/00/000000000000000000.png
/tiles/00/000000000000000001.png

...

for example?  (or something similar)

-Eric

> Viji




More information about the Ext3-users mailing list