[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: The maximum number of files under a folder



The OS will have to search the directory to see if the file already exists before creating it.

Well, if you hash it such that it splits up something like:
jobid(upper part)/jobid(lower- part)[/-]timestamp-process,
 you'll find that your access times will be must faster (especially if you don't use H-Trees).  This also applies if  you're just creating a file, because you'll have to search the entire directory to see if that filename exists

With regular directories, searching through them to see if a file already exist increases linearly with the number of entries.  If you hash on 3 levels with 8-bits per level, you'll have to open 2 or 3 extra inodes, but you'll cut your directory search times down by a factor of 20000-1.  You'll also skip having to deal with any sort of directory-size limit. (=2^24/256/3)

I did something similar on a Solaris box which had 200000 emails in the /var/spool/mqueue directory. That many messages was slowing the system to a crawl.  I hashed it into 100 directories with 2000  entries each,   it sped things up enormously.

On Tue, Mar 18, 2008 at 3:56 PM, Andreas Dilger <adilger sun com> wrote:
On Mar 17, 2008  09:32 -0400, Theodore Ts'o wrote:
> On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
> > Theodore Tso,
> >
> >     In 64bit system, directory size can not be bigger than 2GB?
>
> No, because the high 32-bits for i_size are overloaded to store the
> directory creation acl.

I think we should change the code (kernel and e2fsprogs) to allow
i_size_high for directories also.

> In practice, you really don't want to have a directory that huge
> anyway.  Iterating through it all with readdir() gets horribly slow,
> and applications that try do anything with really huge directories
> would be well advised to use a database, because they will get *much*
> better performance that way....

Actually, for many HPC applications they never do readdir at all.
The job creates 1 file/process and always uses a predefined filename
like {job}-{timestamp}-{process} that it will directly look up.

Cheers, Andreas



--
Stephen Samuel http://www.bcgreen.com
778-861-7641
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]