[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [GFS2 Patch] GFS2: Reduce file fragmentation


On Thu, 2012-07-05 at 17:52 -0400, Bob Peterson wrote:
> ----- Original Message -----
> | Hi,
> | 
> | Some comments...
> (snip)
> | > +	/* Make sure the reservation rgrp is in the rlist too, because
> | > +	   we need to free it. */
> | > +	if (gfs2_rs_active(ip))
> | > +		gfs2_rlist_add(ip, &rlist, ip->i_res->rs_start);
> | > +
> | I don't understand this... why do we need to add the reservation to
> | the
> | rlist here? It would be better to just drop it if we start to
> | deallocate
> | blocks since we'll probably want to start again in a different place
> | if
> | we then being allocating again.
> You're absolutely right. I had added this on an earlier prototype that
> had a slightly different take on claiming reserved blocks: Instead of
> claiming them one by one, it claimed them ALL, and if some blocks weren't
> used when the reservation was freed, the unused blocks were refunded.
> This favors apps that use all their reserved blocks, but it requires
> one more glock on the rgrp for the refund. I forgot to remove the glock,
> in the current implementation. The replacement patch removes this.
> | Why do we need to take an exclusive cluster lock here? Reservations
> | are
> | local things, so we should be able to drop them without taking
> | cluster
> | locks.
> You're right. I've fixed this in the latest patch.
> | I think it ought to protect more than just the reservation tree. We
> | probably should be using it to protect the rest of the rgrp fields
> | too,
> | so lets just call it rd_mutex.
> Renamed as per your suggestion.
> | > +	struct gfs2_inode *rs_ip;     /* pointer to the inode */
> | This doesn't hold a reference on the inode, so what prevents the
> | inode
> | from vanishing? It would be better to arrange to drop the
> | reservations
> | by scanning the inodes, rather than the rgrps, so that we don't need
> | to
> | store this pointer. Either that or find a way to allow the
> | reservations
> | to be dropped later when the inode is disposed of, and just mark them
> | invalid when disposing of the rgrps.
> Good idea. I replaced the inode pointer with the ip->i_no_addr.
> Also, it makes more sense to pass the reservation pointer around
> rather than the ip pointer in most cases anyway. I've adjusted the
> latest patch so that the i_no_addr is only saved in the reservation
> for reference: it helps a great deal in reading the new reservation
> trace points. The structure is a bit big, however, so we could just
> eliminate the field.
> | I'm not sure I understand where this goal block has come from... the
> | goal block should be ip->i_goal in the normal case.
> What I'm doing here is keeping track of where start looking for the
> next reservation, which is immediately after the most recent reservation.
> My theory is that it's more efficient to start after the last reservation
> than it is to start where the ip left off, because if there are lots
> of files open, and writes being done to many of them, there's no
> need to search the areas again that already been reserved by other writes.
Well I think there are two cases we need to consider here. The first one
is when the fs is being filled up, to start with. In that case, you are
right that it makes sense to do this. The other case though is when the
disk has been in use for some time. There maybe gaps in various places,
and we want to ensure that the closest gap to the existing end of the
inode is used (and even that assumes that we are extending an existing
inode and not write to random areas of a sparse file). So then we really
do want to start from the inode's goal block, assuming that has been set
to the last allocated block.

> | > +do_search:
> | > +	/* We multiply the size hint by the number of active
> | > reservations.
> | > +	   That makes heavily contended resource groups grab bigger
> | > +	   reservations so they don't bump into one another. */
> | This doesn't tell you whether the rgrp is contended or not. It is
> | possible to have multiple reservations in an rgrp which are all being
> | used locally by inodes which are filling up very slowly and not to
> | have
> | any contention at all on the rgrp.
> Yep. This comment was vestigial; left over from a different patch that
> attempted to do ext3-like reservation size doubling. It turned out that
> the scheme did more harm than good. We spent more time doing calculations
> and such than just doing the straight searches. I pulled the patch (at
> least temporarily) but I forgot to pull this comment. I removed it now.
> | > +		/* We have to keep the reservations aligned on u64 boundaries
> | > +		   otherwise we could get situations where a byte can't be
> | > +		   used because it's after a reservation, but a free bit still
> | > +		   is within the reservation's area. */
> | I'm not sure I understand. If there are free blocks after an existing
> | inode, then the reservation must start at that point and continue
> | until
> | it runs into something preventing it from extending further. Finding
> | aligned blocks is ok, but only if we've already exhausted the
> | possibility of adding on to an existing extent first.
> The comment is a bit misleading. In fact, it does what you're saying.
> The reservations are aligned to u64 boundaries and reservations are
> entirely used up (their blocks are "claimed"). When there are no more
> to claim, a new reservation is acquired, and it searches for an appropriate
> place for it immediately after the last reservation for that rgrp.
Again, if I'm understanding the issue correctly, this is related to the
new vs. old filesystem issue. With a new fs, and reasonably large sized
inodes, there is no problem to align the start of reservations. If we
have existing inodes though, we want to always try and start a
reservation immediately adjoining the existing blocks for that inode,
assuming no blockages preventing that. Once those contiguous blocks are
exhausted, then looking for aligned blocks is ok, but we must not do
that as a general rule.

So if we have a an inode, with some allocated blocks, ending on block X
which is part way though a u64 boundary (I assume in bitmap terms) but
for which there is no existing reservation, then we should create a
reservation which starts at block X+1 and continues until the required
extent has been reserved, or until some blockage prevents further
progress. Hopefully that makes it a bit clearer.

> | > +			nonzero = memchr_inv(ptr, 0, search_bytes);
> | > +			/* If the lot is all zeroes, reserve the whole size. If
> | > +			   there's enough zeroes to satisfy the request, use
> | > +			   what we can. If there's not enough, keep looking. */
> | Why not use gfs2_bitfit to search for the first free block and then
> | see
> | how long the extent is? That seems to make more sense than writing
> | something new from scratch, bearing in mind that we've already spent
> | a
> | lot of time optimising it.
> The reason is efficiency. Yes, gfs2_bitfit is well optimized,
> but "seeing how long the extent is" and looping if it's too small,
> is inefficient. memchr_inv is optimized and works pretty well afaict.
We need to loop over the extents anyway. I think bitfit should be used
to find the first block of a certain state, the looping over the
individual blocks would be a separate loop - so could be made more
efficient. I wonder if Linus "word at a time" code would be useful
here... it is actually quite similar to what we did in bitfit a little
while back.

The reason for doing this, is that if there are no extents of free
blocks large enough, we need to return the largest of the extents which
do exist, so we have to loop over them to discover which is the largest.

Obviously if we make that "largest extent" search approximate, then
thats ok if we can gain a significant increase in speed as a result.

I've not looked at the new patch yet, thats next on my list,


> | > +	gfs2_print_dbg(seq, "All reservations:\n");
> | This doesn't match the format for this file
> I hadn't thought about making the output abbreviated, but it makes
> good sense. Great idea. In my latest patch, I've reformatted the
> output to be like the existing file format.
> | > rr%lu",
> | Missing : after rr
> Good catch. Fixed in my latest. 
> | > +	TP_printk("%u,%u rg:%llu rf:%lu rr:%lu bmap %llu rs:%llu blks:%lu
> | > %s "
> | > +		  "c:%llu f:%lu",
> | This format is incorrect for a bmap tracepoint - it needs to match
> | the
> | common fields for the other bmap tracepoints.
> Again, good suggestion; I've implemented your suggestion in the latest.
> I'll post the replacement patch in an separate email.
> Regards,
> Bob Peterson
> Red Hat File Systems

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]