See below [HM].
On Wed, Apr 02, 2008 at 04:23:41PM -0400, Mikulas Patocka wrote:
Hi
Unfortunatelly, the bug with desychnronizing raid1 that someone pointed out
on Monday, is real. The bug happens when you modify the page while its
being written to raid1 device --- old version can be written to one mirror
leg, the new versions to the other mirror leg. Raid1 code does not notice
this, marks the region clean after the writes finish, and the volume stays
desynchronized.
The possibilities, how data can be modified while they are being written.
1. an application does O_DIRECT IO and modifies the memory underway.
--- this is a problem of the application and we don't have to care about
it.
2. an application maps file for writing. pdflush or kswapd daemon writes
the page on background while the application is modifying it.
3. an application writes to a page with write() syscall. This syscall can
race with pdflush or kswapd as well.
4. a filesystem modifies the buffer while its being written by pdflush or
kswapd daemons.
The pdflush and kswapd daemons run in background and do periodic writes of
the modified data. pdflush is triggered regularly and writes data in
specified interval (about 30 seconds), so that in case of crash, the image
on disk is not too old. kswapd is triggered when the free memory goes low
--- it writes file pages and filesystem buffers too.
In cases 2,3,4 the data may be modified while they are being written, but
the kernel writes them later again. The sequence is something like:
clear dirty bit
submit IO
--- if the data are modified while the IO is in progress, the dirty bit is
turned on again and the data will be written later and possible data
corruption is corrected. --- so as long as the system does not crash, there
can't be desynchronized mirror.
But if the system crashes before the data are written second time, the
blocks may stay desynchronized.
An example of data corruption on ext2:
We have a dirty bitmap buffer
Pdflush clears the dirty flag and starts writing the buffer
The write is submitted to dm-raid1, it makes two requests and submits them
to two mirror devices
This operation races with another thread allocating a block on ext2 and
doing:
[HM] And taking out a copy unlocked in the RAID driver ain't help
application data integrity, because it could still change the data while
the RAID driver is copying, hence leading to coherency on the
RAID set but holding incorrect application data.
One can argue that this is ok in case of a crash, because the application
failed to flush any page changes