[dm-devel] Snapshot To-Do List

Christophe Saout christophe at saout.de
Wed Dec 31 11:55:01 UTC 2003


Am Mi, den 31.12.2003 schrieb Kevin Corry um 18:25:

> As Joe mentioned last week, we've been tossing around some ideas for changes 
> and bug-fixes to the DM snapshot code. So here's a first crack at a to-do 
> list. Perhaps Joe can put a copy of this on his web site. Anyone else with 
> comments or ideas, feel free to add to this list.

Thank you. I'm currently going through the code myself because I've had
a server crash when the backup script tried to take snapshots
(unfortunately I couldn't see the oops and I've stopped the backup
script for now). This apparently didn't happen with an older snapshot
code version but perhaps it was just luck. ;)

I can reproduce massive data corruptions when taking snapshots with
reiserfs here (on the origin device!) so probably reiserfs caused the
oops.

I wanted to further investigate this.

while true; do
        cp -r /usr/src/linux-2.6.0/drivers/net /data/
        lvcreate -s -L 300M -n snap-data /dev/vg/data
        sync
        mount /dev/vg/snap-data /mnt/tmp
        rm -Rf /data/net
        umount /mnt/tmp
        lvremove -f /dev/vg/snap-data
done

Makes reiserfs go crazy.

> 1. Reads to the snapshot
> 
> Currently, a read for the snapshot is only submitted to the cow device when 
> there's a completed-exception. If there's a pending-exception, the request is 
> still sent to the origin device. Instead, the request should be queued on the 
> pending-exception, just like for the write requests.

I also noted this when looking through the code. Perhaps this is causing
the trouble I'm seeing. I wanted to experiment a bit and try to see if
changing this fixes the problem.

(Until now I was busy tracking down a bug in dm-crypt someone was
seeing, I think I found it. Nasty bug causing a race condition which I
can't reproduce here but is definitely a big bug...)

I also noticed that the snapshot code is reordering the BIOs, it uses
something like a stack then queueing single bios instead of a fifo.

And while flushing blocks even the generic code allows new bios to be
submitted in parallel instead of also delaying them.

Jens Axboe confirmed that this will cause trouble once there will be BIO
users that submit barriers.






More information about the dm-devel mailing list