[Linux-cluster] About GFS1 and I/O barriers.

Wed Apr 2 14:26:58 UTC 2008

On Wed, Apr 2, 2008 at 5:53 AM, Steven Whitehouse <swhiteho at redhat.com>
wrote:

> Hi,
>
> On Mon, 2008-03-31 at 15:16 +0200, Mathieu Avila wrote:
> > Le Mon, 31 Mar 2008 11:54:20 +0100,
> > Steven Whitehouse <swhiteho at redhat.com> a écrit :
> >
> > > Hi,
> > >
> >
> > Hi,
> >
> > > Both GFS1 and GFS2 are safe from this problem since neither of them
> > > use barriers. Instead we do a flush at the critical points to ensure
> > > that all data is on disk before proceeding with the next stage.
> > >
> >
> > I don't think this solves the problem.
> >
> > Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by all my GFS
> > nodes; this disk has a write cache enabled, which means it will reply
> > that write requests are performed even if they are not really written
> > on the platters. The disk (like most disks nowadays) has some logic
> > that allows it to optimize writes by re-scheduling them. It is possible
> > that all writes are ACK'd before the power failure, but only a fraction
> > of them were really performed : some are before the flush, some are
> > after the flush.
> > --Not all blocks writes before the flush were performed but other
> > blocks after the flush are written -> the FS is corrupted.--
> > So, after the power failure all data in the disk's write cache are
> > forgotten. If the journal data was in the disk cache, the journal was
> > not written to disk, but other metadata have been written, so there are
> > metadata inconsistencies.
> >
> I don't agree that write caching implies that I/O must be acked before
> it has hit disk. It might well be reordered (which is ok), but if we
> wait for all outstanding I/O completions, then we ought to be able to be
> sure that all I/O is actually on disk, or at the very least that further
> I/O will not be reordered with already ACKed data. If devices are
> sending ACKs in advance of the I/O hitting disk then I think thats
> broken behaviour.

You seem to assume when disk subsystem acks back, the data is surely on
disk. That is not correct . You may consider it a brokoen behavior, mostly
from firmware bugs, but it occurs more often than you would expect. The
problem is extremely difficult to debug from host side. So I think the
proposal here is how the filesystem should protect itself from this
situation (though I'm fuzzy about what the actual proposal is without
looking into other subsystems, particularly volume manager, that are
involved)  You can not say "oh, then I don't have the responsibility. Please
go to talk to disk vendors". Serious implementations have been trying to
find good ways to solve this issue.

-- Wendy

Consider what happens if a device was to send an ACK for a write and
> then it discovers an uncorrectable error during the write - how would it
> then be able to report it since it had already sent an "ok"? So far as I
> can see the only reason for having the drive send an I/O completion back
> is to report the success or otherwise of the operation, and if that
> operation hasn't been completed, then we might just as well not wait for
> ACKs.
>
> > This is the problem that I/O barriers try to solve, by really forcing
> > the block device (and the block layer) to have all blocks issued before
> > the barrier to be written before any other after the barrier starts
> > begin written.
> >
> > The other solution is to completely disable the write cache of the
> > disks, but this leads to dramatically bad performances.
> >
> If its a choice between poor performance thats correct and good
> performance which might lose data, then I know which I would choose :-)
> Not all devices support barriers, so it always has to be an option; ext3
> uses the barrier=1 mount option for this reason, and if it fails (e.g.
> if the underlying device doesn't support barriers) it falls back to the
> same technique which we are using in gfs1/2.
>
> The other thing to bear in mind is that barriers, as currently
> implemented are not really that great either. It would be nice to
> replace them with something that allows better performance with (for
> example) mirrors where the only current method of implementing the
> barrier is to wait for all the I/O completions from all the disks in the
> mirror set (and thus we are back to waiting for outstanding I/O again).
>
> Steve.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080402/70dac5ef/attachment.htm>