[dm-devel] Barriers still not passing on simple dm devices...

Sat Apr 4 15:20:35 UTC 2009

On 04/03/2009 04:11 AM, Jens Axboe wrote:
> On Thu, Apr 02 2009, Mikulas Patocka wrote:
>    
>> On Tue, 31 Mar 2009, Jens Axboe wrote:
>>
>>      
>>> On Mon, Mar 30 2009, Mikulas Patocka wrote:
>>>        
>>>> On Thu, 26 Mar 2009, Jens Axboe wrote:
>>>>
>>>>          
>>>>> On Wed, Mar 25 2009, Mikulas Patocka wrote:
>>>>>
>>>>>            
>>>>>>>> So I think there should be flag (this device does/doesn't support data
>>>>>>>> consistency) that the journaled filesystems can use to mark the disk dirty
>>>>>>>> for fsck. And if you implement this flag, you can accept barriers always
>>>>>>>> to all kind of devices regardless of whether they support consistency. You
>>>>>>>> can then get rid of that -EOPNOTSUPP and simplify filesystem code because
>>>>>>>> they'd no longer need two commit paths and a clumsy way to restart
>>>>>>>> -EOPNOTSUPPed requests.
>>>>>>>>                  
>>>>>>> And my point is that this case isn't interesting, because most setups
>>>>>>> don't guarantee proper ordering.
>>>>>>>                
>>>>>> If the ordering isn't guaranteed, the filesystem should know about it, and
>>>>>> mark the partition for fsck. That's why I'm suggesting to use a flag for
>>>>>> that. That flag could be also propagated up through md and dm.
>>>>>>              
>>>>> We can do that, not a problem. The problem is that ordering is almost
>>>>> never preserved, SCSI does not use ordered tags because it hasn't
>>>>> verified that its error path doesn't reorder by mistake. So right now
>>>>> you can basically use 'false' as that flag.
>>>>>            
>>>> There are three ordering guarantees:
>>>>
>>>> 1. - nothing (for devices with write cache without cache control)
>>>>
>>>> 2. - non-cached ordering: the sequence [submit req a, end req a, submit
>>>> req b, end req b] will make the ordering. It is guaranteed that when the
>>>> request ends successfully, it is on medium. This is what all the
>>>> filesystems, md and dm assume about disks. This consistency model was used
>>>> long way before barriers came in.
>>>>
>>>> 3. - barrier ordering: ordering is done with barriers, [submit req a, end
>>>> req a, submit req b, end req b] won't guarantee ordering of a and b, a
>>>> barrier must be inserted.
>>>>          
>>> Plus the barrier also allows [submit req a, submit req b] and still
>>> count on ordering if either one of them is a barrier. It doesn't have to
>>> be sync, like the (2).
>>>
>>>        
>>>> --- so you can make a two bitflags that differentiate these models. In
>>>> current kernel, model (1) and (2) cannot be differentiated in any way. (3)
>>>> can be differentiated only after a trial write and it won't guarantee that
>>>> (3) will be valid further.
>>>>          
>>> But what's the point? Basically no devices are naturally ordered by
>>> default. Either you need cache flushes, or you need to tell the device
>>> not to reorder on a per-command basis.
>>>
>>>        
>>>>>> The reasoning: "write barriers aren't supported =>  the device doesn't
>>>>>> guarantee consistency" isn't valid.
>>>>>>              
>>>>> It's valid in the sense that it's the only RELIABLE primitive we have.
>>>>> Are you really suggestion that we just assume any device is fully
>>>>> ordered, unless proven otherwise?
>>>>>            
>>>> If someone implements "write barrier's aren't supported =>  run fsck", then
>>>> a lot of systems start fscking needlessly (for example those using md or
>>>> dm without write cache) and become inoperational for long time because of
>>>> that. So no one can really implement this logic and filesystems don't run
>>>> fsck at all when operated over a device that doesn't support ordering. So
>>>> you get data corruption if you get crash on those devices.
>>>>          
>>> Nobody is suggesting that, it's just not a feasible approach. But you
>>>        
>> I am saying that the filesystem should run fsck if journaled filesystem is
>> mounted on an unsafe device and crash happens.
>>
>>      
>>> have to warn if you don't know whether it provides the ordering
>>> guarantee you expect to provide consistency and integrity.
>>>        
>> The warning of missing barriers (or other actions) should be printed only
>> if write cache is enabled. But there's no way how a filesystem on the top
>> of several dm or md layers can find out if the disk is running with hdparm
>> -w 0 or hdparm -w 1.
>>      
>
> Right, you can't possibly now that. Hence we have to print the warning.
>
>    
>>>> The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen
>>>> submitted a patch that implements failing barriers for device mapper and
>>>> he says that md-raid1 does the same thing.
>>>>          
>>> You are right, if a device is reconfigured beneath you it may very well
>>> begin to return -EOPNOTSUPP much later. I didn't take that into account,
>>> I was considering only "plain" devices.
>>>
>>>        
>>>> Filesystems handle these randomly failed barriers but the downside is that
>>>> they must not submit any request concurrently with the barrier. Also, that
>>>> -EOPNOTSUPP restarting code is really crap, the request cannot be
>>>> restarted from bi_end_io, so bi_end_io needs to handle to another thread
>>>> for retry without barrier.
>>>>          
>>> It can, but it requires you to operate at the request level. So for file
>>> systems that is problematic, it wont work of course. It would not be
>>> THAT hard to provide a helper to reissue the request. Not that pretty,
>>> but...
>>>        
>> And it makes barriers useless for ordering.
>>
>> The filesystem can't do [submit req a], [submit barrier req b], [submit
>> req c] and assume that the requests will be ordered. If [b] fails with
>> -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption
>> has already happened. Even if you catch [b]'s error and resubmit it as
>> non-barrier request, it's too late.
>>
>> So, as a result of this complication, all the existing filesystems send
>> just one barrier request and do not try to overlay it with any other write
>> requests.
>>
>> So I'm wondering why Linux developers designed a barrier interface with
>> complex specification, complex implementation and the interface is useless
>> to provide any request ordering and it's no better than q->issue_flush_fn
>> method or whatever was there beffore. Obviously, the whole barrier thing
>> was designed by a person who never used it in a filesystem.
>>      
>
> That's not quite true, it was done in conjunction with file system
> people. At a certain level, we are restricted by what the hardware can
> actually do. It's certainly possible to make sure your storage stack can
> support barriers and be safe in that regard, but it's certainly also
> true that reconfiguring devices may void that guarantee. So it's not
> perfect, but it's the best we can do. The worst part is that it's
> virtually impossible to inform of such limitations.
>
> If we get rid of -EOPNOTSUPP and just warn in such cases, then you
> should never see -EOPNOTSUPP in the above sequence. You may not actually
> be safe, hence we print a warning. It'll also make the whole thing a lot
> less complex.
>
> And to wrap up with the history of barriers, there was NOTHING before.
> ->issue_flush_fn is a later addition to just force a flush for fsync()
> and friends, the original implementation was just a data bio/bh with
> barrier semantics, providing no reordering before and after the data
> passed.
>
> Nobody was interested in barriers when they were done. Nobody. The fact
> that it's taken 6 years or so to actually emerge as a hot topic for data
> consistency should make that quite obvious. So the original
> implementation was basically a joint effort with Chris on the reiser
> side and EMC as the hw vendor and me doing the block implementation.
>    

And I will restate that back at EMC we tested the original barriers 
(with reiserfs mostly, a bit on ext3 and ext2) and saw significant 
reduction in file system integrity issues after power loss.

The vantage point I had at EMC while testing and deploying the original 
barrier work done by Jens and Chris was pretty unique - full ability to 
do root cause failures of any component when really needed, a huge 
installed base which could send information home on a regular basis 
about crashes/fsck instances/etc and the ability (with customer 
permission) to dial into any box and diagnose issues remotely. Not to 
mention access to drive vendors to pressure them to make the flushes 
more robust. The application was also able to validate that all 
acknowledged writes were consistent.

Barriers do work as we have them, but as others have mentioned, it is 
not a "free" win - fsync will actually move your data safely out to 
persistent storage for a huge percentage of real users (including every 
ATA/S-ATA and SAS drive I was able to test).  The file systems I 
monitored in production use without barriers were much less reliable.

As others have noted, some storage does not need barriers or flushed 
(high end arrays, drives with no volatile write cache) and some need it 
but stink (low cost USB flash sticks?) so warning is a good thing to do...

ric