[dm-devel] [PATCH 7/7] Hold all write bios when errors are handled
Takahiro Yasui
tyasui at redhat.com
Wed Nov 25 15:43:26 UTC 2009
On 11/25/09 08:19, Mikulas Patocka wrote:
>
>
> On Tue, 24 Nov 2009, malahal at us.ibm.com wrote:
>
>> I need to look at the code again, but I thought any new writes to a
>> failed region go to a surviving leg. In that case, we end up returning
>> I/O's to the application after writing to a single leg.
>
> Writes always go to all the legs, see do_write(). Anyway, dmeventd removes
> the failed leg soon.
Is it correct? When the region is in the state of out of sync (NOSYNC),
I/Os are not processed by do_write() but generic_make_request() in the
do_writes().
>>>> Also, we do need to do the above work only if "primary" leg fails. We
>>>> can continue to work just like the old code if "secondary" legs fail,
>>>> right? Not sure if this is worth optimizing though, but I would like to
>>>> see it implemented as it is just a few extra checks. We can have
>>>> primary_failure field like log_failure field.
>>
>>> I thought about it too, but concluded that we need to hold bios even if
>>> the primary leg fails.
>>>
>>> Imagine this scenario:
>>> * secondary leg fails
>>> * write fails on the secondaty leg and succeeds on the primary leg
>>> and is successfully complete
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible and the secondary leg is
>>> back online --- now raid1 would be returning stale data.
>>
>> The software can detect this case. We can fail this completely or use
>> the data from the secondary that could be "stale" with help from admin.
>> Let us call this method 1.
>
> You can't detect it because the computer crashed *before* you write the
> information that the secondary leg failed to the metadata.
>
> So, after a reboot, you can't tell if any mirror leg failed some requests
> before the crash.
>
>>> If we hold the bios if the secondary leg fails (as the patch does), one of
>>> these two scenarios happen:
>>>
>>> * secondary leg fails
>>> * write succeeds on the primary leg and is held
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible and the secondary leg is
>>> back online --- but we haven't completed the write, so the transaction
>>> wasn't reported as committed
>>>
>>> or
>>>
>>> * secondary leg fails
>>> * write succeeds on the primary leg and is held
>>> * dmeventd removes the secondary leg and the write succeeds
>>> * the computer crashes
>>> * after a reboot, the primary leg is inaccessible, the secondary leg was
>>> already removed by dmeventd, so the array is considered inaccessible. So
>>> it doesn't work but at least it doesn't revert already committed
>>> transaction.
>>
>> How is this latter case (it doesn't need a crash anyway)
>> different/better from the case where we detect that 'primary' is missing
>> and ask admin if he wants to use the data on the secondary or not. At
>> least, the admin has a choice with "method 1" and this doesn't have that
>> choice.
>
> If you ask the admin always if primary leg failed and wait for his action,
> you lose fault-tolerance --- the computer would wait until the admin does
> an action.
>
> The requirements are:
> * if one of legs fail or log fails, you must automatically continue
> without human intervention
> * if both legs fail, you must shut it down and not pretend that something
> was written when it wasn't (this would break durability requirement of
> transactions).
I agree with this point. lvm mirror could be used on filesystems such as
ext3 and each filesystem and application needs to take care those situation
to prevent data corruption. I don't think that it is realistic, and the
underlying layer should prevent data corruption. I now understand primary
and secondary disks need to be blocked.
Thanks,
Taka
More information about the dm-devel
mailing list