[Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
Steven Whitehouse
swhiteho at redhat.com
Thu Sep 5 08:28:09 UTC 2013
Hi,
On Wed, 2013-09-04 at 22:48 -0500, Benjamin Marzinski wrote:
> On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> > > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > > > Hi,
> > > >
> > > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > > > it actually increased the file size. If gfs2_fsync was called without
> > > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > > > returning, so any metadata or journaled data changes were not getting
> > > > > fsynced. This meant that writes to the middle of files were not always
> > > > > getting fsynced properly.
> > > > >
> > > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > > > updated during a write. It also make gfs2_sync flush the incore log
> > > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > > > will make sure that all incore logged data gets written to disk before
> > > > > returning from a fsync.
> > > > >
> > > > > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> > > > > ---
> > > > > fs/gfs2/aops.c | 9 +++++++--
> > > > > fs/gfs2/file.c | 4 +++-
> > > > > 2 files changed, 10 insertions(+), 3 deletions(-)
> > > > >
> > > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > ===================================================================
> > > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > > > > unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > > > > unsigned int to = from + len;
> > > > > int ret;
> > > > > + struct gfs2_trans *tr = current->journal_info;
> > > > > + BUG_ON(!tr);
> > > > >
> > > > > BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > > > >
> > > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > > > > goto failed;
> > > > > }
> > > > >
> > > > > - gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > -
> > > > > if (gfs2_is_stuffed(ip))
> > > > > return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > > > >
> > > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > > > > gfs2_page_add_databufs(ip, page, from, to);
> > > > >
> > > > > ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > + if (tr->tr_num_buf_new)
> > > > > + __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > > + else
> > > > > + gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > +
> > > > I'm not sure I follow this bit... generic_write_end() will call
> > > > mark_inode_dirty() if the file size has changed. So that case should be
> > > > covered. That leaves only timestamps which I think should be
> > > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > > > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > > > practice?
> > >
> > > The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> > > metadata because of the write.
> >
> > I think we ought to be updating the metadata for every write (although
> > see case #3 below), since the timestamp should change each time. If the
> > size of the file doesn't change then we don't need to flush the metadata
> > on fdatasync, even though the timestamp has changed.
> >
> > I think we have three cases:
> >
> > 1. i_size changes: metadata must be flushed on fsync/fdatasync, so
> > I_DIRTY_DATASYNC should be set (done in generic_write_end())
> > 2. only timestamps change: metatdata must be flushed on fsync but not on
> > fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
> > problematic case)
> > 3. write is of 0 bytes: I think we are allowed to skip all metadata
> > updates (need to check that we do this)
>
> I wasn't counting timestamp changes, since like you mention, those use
> I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes
> that cause block allocations. The specific test in the bugzilla uses a
> holey file. The writes that aren't getting synced ARE causing an
> allocation, but ARE NOT increasing i_size, since they are not at the end
> of the file. This is a fourth case, and it's the case where we were
> falling down. I probably should have clarified that earlier.
>
Ok, that makes perfect sense to me. Lets go with the patch "as is".
Sorry for not spotting that earlier.
> >
> > > Otherwise we flush the log for every
> > > write, even if we aren't data journalling and haven't changed any
> > > metadata. If the write happened without adding any buffers to the
> > > transaction (which happens if we are just writing over already allocated
> > > space), then tr->tr_num_buf_new will be 0. This means we haven't
> > > actually dirtied any metadata, and we don't need to flush it to the log
> > > during the fsync.
> > >
> > Does that happen for zero sized writes? Otherwise I'd be expecting us to
> > be updating time stamps at least on each write.
>
> In rewriting already allocated data, we will update the timestamp and
> the inode will be added to the incore log, but we don't need to flush
> that change in gfs2_fsync, since we are asking for fdatasync. I thought
> the whole point of the seperation between I_DIRTY_SYNC and
> I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only
> timestamps had changed. The change will still get to disk, but if we
> happen to crash before that, the mtime on the file might be wrong. Isn't
> that just the price you pay for doing a datasync instead of a full sync?
>
Yes, that all sounds correct to me,
Steve.
More information about the Cluster-devel
mailing list