[Linux-cluster] About GFS1 and I/O barriers.

Wed Apr 2 15:17:08 UTC 2008

Hi,

On Wed, 2008-04-02 at 10:26 -0400, Wendy Cheng wrote:
> 
> 
> On Wed, Apr 2, 2008 at 5:53 AM, Steven Whitehouse
> <swhiteho at redhat.com> wrote:
>         Hi,
>         
>         On Mon, 2008-03-31 at 15:16 +0200, Mathieu Avila wrote:
>         > Le Mon, 31 Mar 2008 11:54:20 +0100,
>         > Steven Whitehouse <swhiteho at redhat.com> a écrit :
>         >
>         > > Hi,
>         > >
>         >
>         > Hi,
>         >
>         > > Both GFS1 and GFS2 are safe from this problem since
>         neither of them
>         > > use barriers. Instead we do a flush at the critical points
>         to ensure
>         > > that all data is on disk before proceeding with the next
>         stage.
>         > >
>         >
>         > I don't think this solves the problem.
>         >
>         > Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by
>         all my GFS
>         > nodes; this disk has a write cache enabled, which means it
>         will reply
>         > that write requests are performed even if they are not
>         really written
>         > on the platters. The disk (like most disks nowadays) has
>         some logic
>         > that allows it to optimize writes by re-scheduling them. It
>         is possible
>         > that all writes are ACK'd before the power failure, but only
>         a fraction
>         > of them were really performed : some are before the flush,
>         some are
>         > after the flush.
>         > --Not all blocks writes before the flush were performed but
>         other
>         > blocks after the flush are written -> the FS is corrupted.--
>         > So, after the power failure all data in the disk's write
>         cache are
>         > forgotten. If the journal data was in the disk cache, the
>         journal was
>         > not written to disk, but other metadata have been written,
>         so there are
>         > metadata inconsistencies.
>         >
>         
>         I don't agree that write caching implies that I/O must be
>         acked before
>         it has hit disk. It might well be reordered (which is ok), but
>         if we
>         wait for all outstanding I/O completions, then we ought to be
>         able to be
>         sure that all I/O is actually on disk, or at the very least
>         that further
>         I/O will not be reordered with already ACKed data. If devices
>         are
>         sending ACKs in advance of the I/O hitting disk then I think
>         thats
>         broken behaviour.
> 
> You seem to assume when disk subsystem acks back, the data is surely
> on disk. That is not correct . You may consider it a brokoen behavior,
> mostly from firmware bugs, but it occurs more often than you would
> expect. The problem is extremely difficult to debug from host side. So
> I think the proposal here is how the filesystem should protect itself
> from this situation (though I'm fuzzy about what the actual proposal
> is without looking into other subsystems, particularly volume manager,
> that are involved)  You can not say "oh, then I don't have the
> responsibility. Please go to talk to disk vendors". Serious
> implementations have been trying to find good ways to solve this
> issue.
> 
> -- Wendy
> 
If the data is not physically on disk when the ACK it sent back, then
there is no way for the fs to know whether the data has (at a later
date) not been written due to some error or other. Even ignoring that
for the moment and assuming that such errors never occur, I don't think
its too unreasonable to expect at a minimum that all acknowledged I/O
will never be reordered with unacknowledged I/O. That is all that is
required for correct operation of gfs1/2 provided that no media errors
occur on write.

The message on lkml which Mathieu referred to suggested that there were
three kinds of devices, but it seems to be that type 2 (flushable)
doesn't exist so far as the fs is concerned since blkdev_issue_flush()
just issues a BIO with only a barrier in it. A device driver might
support the barrier request by either waiting for all outstanding I/O
and issuing a flush command (if required) or by passing the barrier down
to the device, assuming that it supports such a thing directly.

Further down the message (the url is http://lkml.org/lkml/2007/5/25/71
btw) there is a list of dm/md implementation status and it seems that
for a good number of the common targets there is little or no support
for barriers anyway at the moment.

Now I agree that it would be nice to support barriers in GFS2, but it
won't solve any problems relating to ordering of I/O unless all of the
underlying device supports them too. See also Alasdair's response to the
thread: http://lkml.org/lkml/2007/5/28/81

So although I'd like to see barrier support in GFS2, it won't solve any
problems for most people and really its a device/block layer issue at
the moment.

Steve.