[Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end

Benjamin Marzinski bmarzins at redhat.com
Thu Sep 5 03:48:35 UTC 2013


On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > > it actually increased the file size.  If gfs2_fsync was called without
> > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > > returning, so any metadata or journaled data changes were not getting
> > > > fsynced. This meant that writes to the middle of files were not always
> > > > getting fsynced properly.
> > > > 
> > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > > updated during a write. It also make gfs2_sync flush the incore log
> > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > > will make sure that all incore logged data gets written to disk before
> > > > returning from a fsync.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> > > > ---
> > > >  fs/gfs2/aops.c |    9 +++++++--
> > > >  fs/gfs2/file.c |    4 +++-
> > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > > 
> > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > ===================================================================
> > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > > >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > > >  	unsigned int to = from + len;
> > > >  	int ret;
> > > > +	struct gfs2_trans *tr = current->journal_info;
> > > > +	BUG_ON(!tr);
> > > >  
> > > >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > > >  
> > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > > >  		goto failed;
> > > >  	}
> > > >  
> > > > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > -
> > > >  	if (gfs2_is_stuffed(ip))
> > > >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > > >  
> > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > > >  		gfs2_page_add_databufs(ip, page, from, to);
> > > >  
> > > >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > +	if (tr->tr_num_buf_new)
> > > > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > +	else
> > > > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > +
> > > I'm not sure I follow this bit... generic_write_end() will call
> > > mark_inode_dirty() if the file size has changed. So that case should be
> > > covered. That leaves only timestamps which I think should be
> > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > > practice?
> > 
> > The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> > metadata because of the write.
> 
> I think we ought to be updating the metadata for every write (although
> see case #3 below), since the timestamp should change each time. If the
> size of the file doesn't change then we don't need to flush the metadata
> on fdatasync, even though the timestamp has changed.
> 
> I think we have three cases:
> 
> 1. i_size changes: metadata must be flushed on fsync/fdatasync, so
> I_DIRTY_DATASYNC should be set (done in generic_write_end())
> 2. only timestamps change: metatdata must be flushed on fsync but not on
> fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
> problematic case)
> 3. write is of 0 bytes: I think we are allowed to skip all metadata
> updates (need to check that we do this)

I wasn't counting timestamp changes, since like you mention, those use
I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes
that cause block allocations.  The specific test in the bugzilla uses a
holey file.  The writes that aren't getting synced ARE causing an
allocation, but ARE NOT increasing i_size, since they are not at the end
of the file.  This is a fourth case, and it's the case where we were
falling down. I probably should have clarified that earlier.

> 
> >  Otherwise we flush the log for every
> > write, even if we aren't data journalling and haven't changed any
> > metadata.  If the write happened without adding any buffers to the
> > transaction (which happens if we are just writing over already allocated
> > space), then tr->tr_num_buf_new will be 0. This means we haven't
> > actually dirtied any metadata, and we don't need to flush it to the log
> > during the fsync.
> > 
> Does that happen for zero sized writes? Otherwise I'd be expecting us to
> be updating time stamps at least on each write.

In rewriting already allocated data, we will update the timestamp and
the inode will be added to the incore log, but we don't need to flush
that change in gfs2_fsync, since we are asking for fdatasync. I thought
the whole point of the seperation between I_DIRTY_SYNC and
I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only
timestamps had changed. The change will still get to disk, but if we
happen to crash before that, the mtime on the file might be wrong. Isn't
that just the price you pay for doing a datasync instead of a full sync?

> > With instrumentation in the code, I can definitely see that with the
> > check for tr->tr_num_buf_new, I_DIRTY_DATASYNC isn't set if we are just
> > rewriting over already allocated space, and gfs2_fsync doesn't need to
> > do all the extra I_DIRTY_DATASYNC work, cutting down the performance
> > impact of this patch.
> > 
> > If you are doing data journalling, you obviously need to flush the
> > changed datablocks to the journal in gfs2_fsync.  Otherwise it won't
> > write them back to their inplace locations. This is true even if the
> > write didn't update metadata, and so I_DIRTY_DATASYNC isn't set.  That's
> > why I have gfs2_fsync check for I_DIRTY_PAGES if an inode is using data
> > journalling.
> > 
> > Does that make sense, or am I missing something?
> > 
> Yes, I certainly agree with the changes at the fsync end wrt jdata. That
> all makes sense to me.
> 
> > Now the part of my patch that I don't understand is the bit after the
> > "else".  See my questions at
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=991596#c19
> > 
> > I added the "else" to keep gfs2_write_end always adding the inode to the
> > transaction, but I don't know why that is necessary. If we aren't
> > changing any metadata, why do we need to add the inode to the
> > transaction at all.
> > 
> > -Ben
> > 
> Ah, I see now. I'd not spotted those questions before. I think the
> answer to question #1 most likely relates to time stamps, and I hope
> that I've more or less answered it above.

I thought that mtime didn't get updated if the last modification was
less than a second ago. And doesn't the code that sets mtime do it's own
mark_inode_dirty(), which should call gfs2_dirty_inode() and add the
inode there.  It seems redundant to do it here as well, especially if
there are times when it's not actually necessary.

> I suspect that the answer to question #2 is no. We call
> filemap_fdatawrite/wait to writeback the pages of the two address
> spaces, and flush the log, etc in order to ensure that all the metadata
> is in a suitable state to be written back. So the issue is just to make
> sure that everything gets written back and waited for, and I think we
> are doing that correctly,
> 
> Steve.
> 
> > > 
> > > Otherwise I think the patch looks good,
> > > 
> > > Steve.
> > > 
> > > >  
> > > >  	if (inode == sdp->sd_rindex) {
> > > >  		adjust_fs_space(inode);
> > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > > ===================================================================
> > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > > @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
> > > >  {
> > > >  	struct address_space *mapping = file->f_mapping;
> > > >  	struct inode *inode = mapping->host;
> > > > -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > > +	int sync_state = inode->i_state & I_DIRTY;
> > > >  	struct gfs2_inode *ip = GFS2_I(inode);
> > > >  	int ret = 0, ret1 = 0;
> > > >  
> > > > @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
> > > >  			return ret1;
> > > >  	}
> > > >  
> > > > +	if (!gfs2_is_jdata(ip))
> > > > +		sync_state &= ~I_DIRTY_PAGES;
> > > >  	if (datasync)
> > > >  		sync_state &= ~I_DIRTY_SYNC;
> > > >  
> > > > 
> > > 
> 




More information about the Cluster-devel mailing list