[Date Prev][Date Next] [Thread Prev][Thread Next]
Re: Many small files, best practise.
- From: pg_ext3 ext3 for sabi co UK (Peter Grandi)
- To: Andreas Dilger <adilger sun com>
- Cc: Linux ext3 users <ext3-users redhat com>
- Subject: Re: Many small files, best practise.
- Date: Mon, 21 Sep 2009 16:37:25 +0100
[ ... whether datasets like 1G records for a total of 7TB should be
stored as one-record-per-file in a filesystem or as a database ... ]
>>> When you deal with systems that store millions of files,
>> Millions of files may work; but 1 billion is an utter absurdity.
>> A filesystem that can store reasonably 1 billion small files in
>> 7TB is an unsolved research issue...
> I'd disagree. We have Lustre filesystems with 500M files on
> the ext4(ish) metadata server, and these are only 4TB. Note
> there is NO DATA in the metadata files, so it isn't quite like
> a normal filesystem.
That is possible, but to me seems quite unreasonable. How long
does that take to RSYNC, for example? To just backup? What about
doing a 'find'? These are mad things.
This is the special case of an MDS as you mention, but it is still
Just like many other similar choices (e.g. 19+1 RAID5 arrays), it
works (not so awesomely) as long as it works, and when it breaks it
is very bad.
I like the Lustre idea, and to me it is currently the best of a not
very enthusing lot, but the MDT is by far the weakest bit, and the
``lots of tiny files'' idea is one of the big deals.
In particular size of MDTs is a significant scalability issue
with Lustre, which was designed in older gentler times for
purposes to which metadata scalability might not have been so
essential. Like most good ideas it has been scaled up beyond
expectations (UNIX-style), and perhaps it is reaching the end
of its useful range.
Fortunately sensible Lustre people keep frequent and wholesame
MDS backups, and restoring a backup, and even a 500M 800B file
backup/restore is hopefully much faster than an 'fsck' if there
> It also depends on what you mean by "small files". We've
> previously discussed storing small file data in an extended
> attribute, and if you are tuning for this and the file size is
> small enough (3kB or less) the file data could be stored
> inside the inode (i.e. zero seek data IO).
If I were to use a filesystem as a makeshift database I would
indeed use one of those filesystems that store small files or file
tails in the metadata, as I wrote:
>> And for cases where a filesystem still makes sense I would
>> rather use, instead of the inane manylevel directory
>> structure above, a file system design with proper tree
>> indexes and perhaps even one with the ability to store
>> small files into inodes.
You might consider storing Lustre MDTs on Reiser3 instead of
But this is backwards; the database guys have spent the past
several decades working on the ``lots of small records reliably''
problem (and with "bushy" indices), and the main work by the file
system guys has been solving the ``massive massively parallel
files'' one. To the point that people like Reiser who did work
(with database like techniques) on the small files problems for
filesystems have been at best ignored.
[ ... ]
> I think you aren't backing your comments with any facts.
You may think that -- but that's only because you think wrong,
as you haven't read my comments or you want to misrepresent
I made at the very start a clear example of a case with 1M small
files engendering a difference between more than 15 hours vs. 6
minutes for just creation.
For amusement I just rerun it in a nicer form on a somewhat faster
base$ rm /fs/jugen/tmp/manysmall.db
base$ time perl manymake.pl /fs/jugen/tmp/manysmall.db 1000000 50 100
1 percent done, 990000 to go
2 percent done, 980000 to go
3 percent done, 970000 to go
98 percent done, 20000 to go
99 percent done, 10000 to go
100 percent done, 0 to go
base$ ls -ld /fs/jugen/tmp/manysmall.db
-rw------- 1 pcg pcg 98197504 Sep 21 16:19 /fs/jugen/tmp/manysmall.db
That's 1M records in 10MB in less than a minute or 20K records/s,
for around 1.5MB/s, which is fairly typical for random access to a
fairly standard 1TB consumer drive in its latter half.
base$ sudo sysctl vm.drop_caches=1
vm.drop_caches = 1
base$ time perl manyseek.pl /fs/jugen/tmp/manysmall.db 1000000 10000
1 percent done, 9900 to go
2 percent done, 9800 to go
3 percent done, 9700 to go
98 percent done, 200 to go
99 percent done, 100 to go
100 percent done, 0 to go
average length: 69.3816
Seeking of course is not awesome, and we get 10K records in 2m, or
around 80 records/s. Ah well. I need an SSD :-).
And as to the 'fsck', I confess that I had a list of cases in
mind but was waiting for the usual worn out dodgy technique of
quoting undamaged filesystem times:
> The e2fsck time on our MDS filesystems with 500M IN USE inodes
> is on the order of 4 hours (disk-based RAID-1+0 array). If
> this was on a RAID-1+0 SSD it could be noticably faster. Ric
> also commented previously about single-digit hours for e2fsck
> on a test 1B file ext4 filesystem.
That is a classic "benchmark" -- undamaged filesystem 'fsck'
tests, like the other favourite, freshly loaded filesystem
benchmarks, are just dodgy marketing tools.
And even so! 1 hour per TB, or 1h per 100M files. To me keeping
what may be production filesystem with 500M files unavailable
for 4 hours because one occasionally has to run 'fsck' (even if
in fact there is no damage) with an upside risk of weeks or
months sounds not such a good idea. But who knows.
There are been reports, which are sadly familiar to those who
work as sysadms, of single digit TB filesystems taking weeks to
months to repair, if damaged. The difference of course is
between scanning the metadata and crawling it.
Which is of course perfectly obvious, as RAIDs allow for
parallelizing of read/write but not easily for scanning and less so
for crawls. Scaling 'fsck' is not easy, is an unsolved research
problem, even if things like Lustre help somewhat (minus the MDTs
Now I feel a bit preachy, I'll mention some wider concepts (mostly
from the database guys) that should fit well in this discussion:
* A "database" is defined as something including a dataset whose
working set does not fit in memory (it thrashes -- every access
involves at least one IO). There are several types of databases,
structured/unstructured, factual/textual/...; a filesystem is a
kind of database, as that definition applies. But to me and
several decades of practice and theory it is a database of
record _containers_ (as suggested by the very word "file"), not
of records. It is exceptionally hard to do a DBMS that handles
equally well records and record containers.
* A "very large database" is a database that cannot be practically
backed up (or checked) offline, as backup (or check) take too
long wrt to requirements. Many filesystems are moving into the
"very large database" category (can your customers accept that it
might take 4 hours or 4 weeks to check, and 4 days to restore,
their filesystem?). Storing small records (or small containers
even) in a filesystem makes it much more likely that it becomes a
"very large database", and while the technology for "very large
databases" DBMSes is mature, that for "very large database" file
system designs is not there or at least not as mature, even if
the fun guys at Sun have been trying lately with ZFS.
* These are not novel or little know concepts and experiences. 'ar'
files have been around for a long time, for some good reason.
[Date Prev][Date Next] [Thread Prev][Thread Next]