[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [GFS2 Patch] [Try 4] GFS2: Reduce file fragmentation



Hi,

On Thu, 2012-07-12 at 15:28 -0400, Bob Peterson wrote:
> ----- Original Message -----
> | I've found a workload that seems to do something a bit odd...
> | 
> | 
> | [root chywoon /]# for i in `seq 0 9`; do dd if=/dev/zero of=file.${i}
> | bs=4096 count=1; done
> | 
> | (i.e. create 10 single block files, sequentially)
> | 
> | Then I looked at the allocation pattern using stat & filefrag. We get
> | 10
> | inodes, at positions: 33150, 33151, 33152, 33153, 33154, 33155,
> | 33156,
> | 33157, 33158, 33159 (so sequential with each other)
> | 
> | The lets look at where the data lands up: 33302, 33462, 33622, 33782,
> | 33942, 34102, 34262, 34422, 34582, 34742 (respectively for each
> | inode)
> | 
> | What I'd hope to see is that the data blocks are being allocated
> | contiguously with the inodes, at least in most cases. The data blocks
> | appear to be spaced out with gaps of 160 blocks between them. This
> | was
> | on newly created filesystem, btw, so there should be nothing to get
> | in
> | the way of the allocations.
> | 
> | Given that directories don't have reservations, and the reservations
> | relating to the individual inodes that are being created should be
> | cleared on close (i.e. before the subsequent creation) this should
> | not
> | have changed in behaviour from the current kernels,
> | 
> | Steve.
> 
> Hi,
> 
> This was not unexpected behavior, but I see the merits in making the
> allocations for the dinode follow the dinode when it's created.
> 
I have to say that worries me when you say that a patch which is
designed to reduce fragmentation is not unexpected to produce results
like that. We've got to be very careful here not to make things worse in
such cases, otherwise we are not making progress.

> This fourth version of the patch addresses the problem by allowing
> directories to have allocations, like files, but when a file is
> created, the allocation is moved to the new gfs2_inode. That solves
> the problem of preventing "holes" in the directory's reservation, plus
> allows future block allocations to follow in line. This continuity
> may allow it to be somewhat faster than the previous version. Thanks,
> Steve!
> 
Yes, that would be interesting to know. I'll have a look in more detail
a bit later, but some comments follow....

[snip]

> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 2b035e0..d4e6882 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -521,6 +521,9 @@ static int make_dinode(struct gfs2_inode *dip, struct gfs2_glock *gl,
>  	int error;
>  
>  	munge_mode_uid_gid(dip, &mode, &uid, &gid);
> +	error = gfs2_rindex_update(sdp);
> +	if (error)
> +		return error;
>  
>  	error = gfs2_quota_lock(dip, uid, gid);
>  	if (error)
> @@ -551,6 +554,10 @@ static int link_dinode(struct gfs2_inode *dip, const struct qstr *name,
>  	struct buffer_head *dibh;
>  	int error;
>  
> +	error = gfs2_rindex_update(sdp);
> +	if (error)
> +		return error;
> +
>  	error = gfs2_quota_lock(dip, NO_QUOTA_CHANGE, NO_QUOTA_CHANGE);
>  	if (error)
>  		goto fail;
> @@ -596,7 +603,8 @@ fail_end_trans:
>  	gfs2_trans_end(sdp);
>  
>  fail_ipreserv:
> -	gfs2_inplace_release(dip);
> +	if (alloc_required)
> +		gfs2_inplace_release(dip);
>  
>  fail_quota_locks:
>  	gfs2_quota_unlock(dip);
> @@ -647,7 +655,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
>  	const struct qstr *name = &dentry->d_name;
>  	struct gfs2_holder ghs[2];
>  	struct inode *inode = NULL;
> -	struct gfs2_inode *dip = GFS2_I(dir);
> +	struct gfs2_inode *dip = GFS2_I(dir), *ip;
>  	struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode);
>  	struct gfs2_inum_host inum = { .no_addr = 0, .no_formal_ino = 0 };
>  	int error;
> @@ -694,24 +702,35 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
>  	if (IS_ERR(inode))
>  		goto fail_gunlock2;
>  
> -	error = gfs2_inode_refresh(GFS2_I(inode));
> +	ip = GFS2_I(inode);
> +	error = gfs2_inode_refresh(ip);
>  	if (error)
>  		goto fail_gunlock2;
>  
> -	/* the new inode needs a reservation so it can allocate xattrs. */
> -	error = gfs2_rs_alloc(GFS2_I(inode));
> -	if (error)
> -		goto fail_gunlock2;
> +	/* Tricky: The newly created inode needs a reservation so it can
> +	   allocate xattrs. At the same time, we don't want the directory
> +	   to retain its reservation, and here's why: With directories, items
> +	   are often created and deleted in the directory in the same breath,
> +	   which can create "holes" in the reservation. By holes I mean that
> +	   your next "claim" may not be the next free block in the reservation.
> +	   In other words, we could get into situations where two or more
> +	   blocks are reserved, then used, then one or more of the earlier
> +	   blocks is freed. When we delete the reservation, the rs_free
> +	   will be off due to the hole, so the rgrp's rg_free count can get
> +	   off. The solution is that we transfer ownership of the reservation
> +	   from the directory to the new inode. */
> +	ip->i_res = dip->i_res;
> +	dip->i_res = NULL;

This comment still doesn't make sense to me. What are these operations
that are freeing up blocks in the directory? There should be no blocks
freed in a directory unless we deallocate the entire directory at the
moment.

This is really just a bug. We need to create the new inodes earlier so
that we can use their reservations when allocating them "on disk". Then
the directory and new inode reservations can be entirely separate.

The current problem is that we are using the directory's allocation
state to allocate new inodes, which is wrong and causes other problems
too. It has been on my list to fix, but it is complicated - we will have
to address it at some stage in order to make the Orlov allocator work
though,

Steve.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]