[dm-devel] DM-RAID1 data corruption
Mikulas Patocka
mpatocka at redhat.com
Tue Apr 14 20:46:20 UTC 2009
Hi
This is the scenario of data corruption that I was talking about:
Mirror has two legs, 0 and 1 and a log. Disk 0 is the default.
A write is propagated to both legs. The write fails on leg 0 and succeeds
on leg 1.
The function "write_callback" puts the bio to "failure" list (if
errors_handled was true). It also wakes userspace.
do_failures pops the bios from ms->log_failure and calls dm_rh_mark_nosync
on them to mark the region nosync. dm_rh_mark_nosync completes the bio
with success.
*the computer crahes* (before the userspace daemon had a chance to run)
On next reboot, disk is 0 revived (suppose that it temporarily failed
because of a loose cable, overheating, insufficient power or so, and the
condition is repaired), raid1 sees set bit in the dirty bitmap and starts
copying data from disk 0 to disk 1.
The result: write bio was ended as succes, but the data was lost. For
databases, this might have bad consequences - committed transactions being
forgotten.
-
If the above scenario can't happen, pls. describe why.
What would be a possible way to fix this?
Delay all bios until the userspace code removes the failed mirror?
Or store the number of the default mirror in the log?
Mikulas
More information about the dm-devel
mailing list