[dm-devel] [PATCH 7/7] Hold all write bios when errors are handled

Wed Nov 25 15:43:26 UTC 2009

On 11/25/09 08:19, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Nov 2009, malahal at us.ibm.com wrote:
> 
>> I need to look at the code again, but I thought any new writes to a
>> failed region go to a surviving leg. In that case, we end up returning
>> I/O's to the application after writing to a single leg.
> 
> Writes always go to all the legs, see do_write(). Anyway, dmeventd removes 
> the failed leg soon.

Is it correct? When the region is in the state of out of sync (NOSYNC), 
I/Os are not processed by do_write() but generic_make_request() in the
do_writes().

>>>> Also, we do need to do the above work only if "primary" leg fails. We
>>>> can continue to work just like the old code if "secondary" legs fail,
>>>> right? Not sure if this is worth optimizing though, but I would like to
>>>> see it implemented as it is just a few extra checks. We can have
>>>> primary_failure field like log_failure field.
>>  
>>> I thought about it too, but concluded that we need to hold bios even if 
>>> the primary leg fails.
>>>
>>> Imagine this scenario:
>>> * secondary leg fails
>>> * write fails on the secondaty leg and succeeds on the primary leg 
>>> and is successfully complete
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible and the secondary leg is 
>>> back online --- now raid1 would be returning stale data.
>>
>> The software can detect this case. We can fail this completely or use
>> the data from the secondary that could be "stale" with help from admin. 
>> Let us call this method 1.
> 
> You can't detect it because the computer crashed *before* you write the 
> information that the secondary leg failed to the metadata.
> 
> So, after a reboot, you can't tell if any mirror leg failed some requests 
> before the crash.
> 
>>> If we hold the bios if the secondary leg fails (as the patch does), one of 
>>> these two scenarios happen:
>>>
>>> * secondary leg fails
>>> * write succeeds on the primary leg and is held
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible and the secondary leg is
>>> back online --- but we haven't completed the write, so the transaction 
>>> wasn't reported as committed
>>>
>>> or
>>>
>>> * secondary leg fails
>>> * write succeeds on the primary leg and is held
>>> * dmeventd removes the secondary leg and the write succeeds
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible, the secondary leg was 
>>> already removed by dmeventd, so the array is considered inaccessible. So 
>>> it doesn't work but at least it doesn't revert already committed 
>>> transaction.
>>
>> How is this latter case (it doesn't need a crash anyway)
>> different/better from the case where we detect that 'primary' is missing
>> and ask admin if he wants to use the data on the secondary or not. At
>> least, the admin has a choice with "method 1" and this doesn't have that
>> choice.
> 
> If you ask the admin always if primary leg failed and wait for his action, 
> you lose fault-tolerance --- the computer would wait until the admin does 
> an action.
> 
> The requirements are:
> * if one of legs fail or log fails, you must automatically continue 
> without human intervention
> * if both legs fail, you must shut it down and not pretend that something 
> was written when it wasn't (this would break durability requirement of 
> transactions).

I agree with this point. lvm mirror could be used on filesystems such as
ext3 and each filesystem and application needs to take care those situation
to prevent data corruption. I don't think that it is realistic, and the
underlying layer should prevent data corruption. I now understand primary
and secondary disks need to be blocked.

Thanks,
Taka