[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [GFS2 Patch] [Try 4] GFS2: Reduce file fragmentation



Hi,

On Fri, 2012-07-13 at 08:55 -0400, Bob Peterson wrote:
> ----- Original Message -----
> | > allows future block allocations to follow in line. This continuity
> | > may allow it to be somewhat faster than the previous version.
> | > Thanks,
> | > Steve!
> | > 
> | Yes, that would be interesting to know. I'll have a look in more
> | detail
> | a bit later, but some comments follow....
> 
> Preliminary results are in: In the hour-long test I've been using,
> this improvement shaves about 4 minutes off (just under 1 percent).
> 
Sounds good

> | > +	/* Tricky: The newly created inode needs a reservation so it can
> | > +	   allocate xattrs. At the same time, we don't want the directory
> | > +	   to retain its reservation, and here's why: With directories,
> | > items
> | > +	   are often created and deleted in the directory in the same
> | > breath,
> | > +	   which can create "holes" in the reservation. By holes I mean
> | > that
> | > +	   your next "claim" may not be the next free block in the
> | > reservation.
> | > +	   In other words, we could get into situations where two or more
> | > +	   blocks are reserved, then used, then one or more of the
> | > earlier
> | > +	   blocks is freed. When we delete the reservation, the rs_free
> | > +	   will be off due to the hole, so the rgrp's rg_free count can
> | > get
> | > +	   off. The solution is that we transfer ownership of the
> | > reservation
> | > +	   from the directory to the new inode. */
> | 
> | This comment still doesn't make sense to me. What are these
> | operations
> | that are freeing up blocks in the directory? There should be no
> | blocks
> | freed in a directory unless we deallocate the entire directory at the
> | moment.
> 
> Again, the "holes" I'm talking about are in the reservation, not in the
> directory. Suppose you have an empty directory and "touch" files
> a,b,c,d and e. Suppose the directory gets a multi-block reservation of
> 8 blocks for those allocations. After the 5 files are created, the
> directory's reservation has rs_start=S, rs_len=8, and rs_free=3.
> The bitmap representing those dinodes, which corresponds to the
> reservation, looks something like this: 11 11 11 11 11 00 00 00.
> 
> Now suppose you delete file "b". The directory's blocks won't change,
> nor will its hash table. However, the dinode for "b" will be deleted
> and the corresponding bitmap for the dinodes will then look something like:
> 11 00 11 11 11 00 00 00. The corresponding reservation will have:
> rs_start=S, rs_len=8 and rs_free=4.
> 
This is where the problem lies.... If the reservation for the directory
is of length 8, then the dinode "b" has no business being anywhere
within those 8 blocks, since only directory blocks should be there. It
is a different inode and should have its own reservation and its own
allocation state. The directory's reservation should only be dealing
with blocks belonging to the directory.

The fact that the child inode is currently allocated using the parent
directory's allocation state is the bug I referred to in the last email.

Another problem is that we don't have a good way to predict how many
blocks a directory might grow to ahead of time, so it is a very tricky
task to create a sensibly sized reservation at directory creation time.
We don't want big gaps between the directory blocks and its contents if
the contents are just a few small files. If on the other hand, we have a
large directory, then we want to keep the directory blocks together in
large chunks as much as possible, and the child inodes together with
their indirect and data blocks following (or not too far, at least) from
the directory.

As the directory grows, we get more information about the likelihood of
further entries being added, so we can make larger reservations more
confidently. In fact making a reservation based upon the current size of
the directory might not be a bad plan, since we have little else to go
on.

Also, I don't mind leaving directory reservations until later - just so
long as we don't regress in the mean time. Whatever is easiest really I
guess.

For the Orlov allocator, we need to be able to select a random rgrp when
creating a new directory if the parent has the T flag set. So we must
have a separate allocation state for the child directory in order to be
able to do that.

> The problem is if you now create file f in that directory,
> it essentially "claims" the blocks at rs_start + rs_len - rs_free,
> but S + 8 - 4 = S + 4, and that block is already claimed by file "e".
> 
> The alternative, as I stated in an earlier email, is to make the
> starting block, S, a moving target, adjusting it with each allocation.
> In that case, block S + 1, which was freed when file b was deleted,
> will be "left behind" and add to the fragmentation.
> 
> We can't just keep marching rs_free forward because then we get into
> rgrp accounting problems if/when the reservation is freed and has
> unclaimed blocks that we need to return to the pool of blocks in the rgrp.

The reason that this happened is because the search started from the
wrong place. When looking for blocks to make a reservation, the starting
point always needs to be based on the existing location of the inode's
blocks (if we are extending the file, then the last block in the file).

Using an algorithm which marches forward in the rgrp will fail as soon
as we have a filesystem which has been in use for some period of time,
and thus this particular location could be pointing almost anywhere.
When the next request for allocation is made, we ought to be trying to
do it as close as possible to the existing inode's blocks, and not using
this point in the rgrp.

Your algorithm will work very well while you have a filesystem that is
being filled up from new, but it is not optimal on a filesystem that has
lots of obstacles already in place, due to previous allocations, and
where the rgrps current position pointer may be fairly randomly placed
within the rgrp.

That was why I mentioned the need to use i_goal as the starting point
rather than using a pointer in the rgrp in my previous comments, as it
tends to keep blocks for each inode together where possible. I know that
i_goal is not ideal (since it will not reflect whether the write is
extending the file or is at some random point inside it) but it is at
least in the right ball park. It would be better still to calculate the
goal block depending on the requested file offset, but that kind of
thing can come later.

I suspect that you are worried about the situation of filling up a
directory with files and then having to search through the reservations
and/or allocations of previous files for each new file to find some
empty space. That can be a problem - there are various ways to resolve
it. We could take a leaf from OCFS2's book and keep an extent cache
(your rbtree is already half way there) which would make it much quicker
to skip over the unuseable space. Another alternative is to keep the
rgrp pointer which moves forward, but to be prepared to reset it to the
goal block of the current inode if there is some event which might have
created a hole in the rgrp since its last use (i.e. glock has been
dropped on the rgrp or some deallocation has taken place). There may be
other solutions too.

Does that sound ok? I'd just like to try and ensure that we maintain
good performance even after the initial fill of the filesystem as much
as we possibly can,

Steve.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]