[libvirt] RFCv2: virDomainSnapshotCreateXML enhancements

Thu Aug 11 10:00:46 UTC 2011

Am 11.08.2011 00:08, schrieb Eric Blake:
> [BCC'ing those who have responded to earlier RFC's]
> 
> I've posted previous RFCs for improving snapshot support:
> 
> ideas on managing a subset of disks:
> https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html
> 
> ideas on managing snapshots of storage volumes not tied to a domain
> https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html
> 
> After re-reading the feedback received on those threads, I think I've 
> settled on a pretty robust design for my first round of adding 
> improvements to the management of snapshots tied to a domain, while 
> leaving the door open for future extensions.
> 
> Sorry this email is so long (I've had it open in my editor for more than 
> 48 hours now as I keep improving it), but hopefully it is worth the 
> effort to read.  See the bottom if you want the shorter summary on the 
> proposed changes.

It was definitely a good read, thanks for writing it up.

Of course, I'm not really familiar with libvirt (now a bit more than
before :-)), so all my comments are from a qemu developer perspective.
Some of them may look like stupid questions or turn out to be
misunderstandings, but I hope it's still helpful for you to see how qemu
people understand things.

> 
> First, some definitions:
> ========================
> 
> disk snapshot: the state of a virtual disk used at a given time; once a 
> snapshot exists, then it is possible to track a delta of changes that 
> have happened since that time.
> 
> internal disk snapshot: a disk snapshot where both the saved state and 
> delta reside in the same file (possible with qcow2 and qed).  If a disk 
> image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.

QED doesn't support internal snapshots.

> external disk snapshot: a disk snapshot where the saved state is one 
> file, and the delta is tracked in another file.  For a disk image not in 
> use by qemu, this can be done with qemu-img to create a new qcow2 file 
> wrapping any type of existing file.  Recent qemu has also learned the 
> 'snapshot_blkdev' monitor command for creating external snapshots while 
> qemu is using a disk, and the goal of this RFC is to expose that 
> functionality from within existing libvirt APIs.
> 
> saved state: all non-disk information used to resume a guest at the same 
> state, assuming the disks did not change.  With qemu, this is possible 
> via migration to a file.

Is this terminology already used in libvirt? In qemu we tend to call it
the VM state.

> checkpoint: a combination of saved state and a disk snapshot.  With 
> qemu, the 'savevm' monitor command creates a checkpoint using internal 
> snapshots.  It may also be possible to combine saved state and disk 
> snapshots created while the guest is offline for a form of 
> checkpointing, although this RFC focuses on disk snapshots created while 
> the guest is running.
> 
> snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of 
> this email will attempt to use 'snapshot' where either form works, and a 
> qualified term where no ambiguity is intended.
> 
> Existing libvirt functionality
> ==============================
> 
> The virDomainSnapshotCreateXML currently manages a hierarchy of 
> "snapshots", although it is currently only used for "checkpoints", where 
> every snapshot has a name and a possibly empty parent.  The idea is that 
> once a domain has a snapshot, there is always a current snapshot, and 
> all new snapshots are created with a parent of a previously existing 
> snapshot (although there are still some bugs to be fixed in managing the 
> current snapshot over a libvirtd restart).  It is possible to have 
> disjoint hierarchies, if you delete a root snapshot that had more than 
> one child (making both children become independent roots).  The snapshot 
> hierarchy is maintained by libvirt (in a typical installation, the files 
> in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named 
> snapshot, using <domainsnapshot> XML); using additional metadata not 
> present in the qcow2 internal snapshot format (that is, while qcow2 can 
> maintain multiple snapshots, it does not maintain relations between 
> them).  Remember, the "current" snapshot is not the current machine 
> state, but the snapshot that would become the parent if you create a new 
> snapshot; perhaps we could have named it the "loaded" snapshot, but the 
> API names are set in stone now.
> 
> Libvirt also has APIs for listing all snapshots, querying the current 
> snapshot, reverting back to the state of another snapshot, and deleting 
> a snapshot.  Deletion comes with a choice of deleting just that named 
> version (removing one node in the hierarchy and re-parenting all 
> children) or that tree of the hierarchy (that named version and all 
> children).
> 
> Since qemu checkpoints can currently only be created via internal disk 
> snapshots, libvirt has not had to track any file name relationships - a 
> single "snapshot" corresponds to a qcow2 snapshot name within all qcow2 
> disks associated to a domain; furthermore, snapshot creation was limited 
> to domains where all modifiable disks were already in qcow2 format. 
> However, these "checkpoints" could be created on both running domains 
> (qemu savevm) or inactive domains (qemu-img snapshot -c), with the 
> latter technically being a case of just internal disk snapshots.
> 
> Libvirt currently has a bug in that it only saves <domain>/<uuid> rather 
> than the full domain xml along with a checkpoint - if any devices are 
> hot-plugged (or in the case of offline snapshots, if the domain 
> configuration is changed) after a snapshot but before the revert, then 
> things will most likely blow up due to the differences in devices in use 
> by qemu vs. the devices expected by the snapshot.

Offline snapshot means that it's only a disk snapshot, so I don't think
there is any problem with changing the hardware configuration before
restoring it.

Or does libvirt try to provide something like offline checkpoints, where
restoring would not only restore the disk but also roll back the libvirt
configuration?

I guess this paragraph could use some clarification.

> Reverting to a snapshot can also be considered as a form of data loss - 
> you are discarding the disk changes and ram state that have happened 
> since the last snapshot.  To some degree, this is by design - the very 
> nature of reverting to a snapshot implies throwing away changes; 
> however, it may be nice to add a safety valve so that by default, 
> reverting to a live checkpoint from an offline state works, but 
> reverting from a running domain should require some confirmation that it 
> is okay to throw away accumulated running state.
> 
> Libvirt also currently has a limitation where snapshots are local to one 
> host - the moment you migrate a snapshot to another host, you have lost 
> access to all snapshot metadata.
> 
> Proposed enhancements
> =====================
> 
> Note that these proposals merely add xml attribute and subelement 
> extensions, as well as API flags, rather than creating any new API, 
> which makes it a nice candidate for backporting the patch series based 
> on this RFC into older releases as appropriate.
> 
> Creation
> ++++++++
> 
> I propose reusing the virDomainSnapshotCreateXML API and 
> <domainsnapshot> xml for both "checkpoints" and "disk snapshots", all 
> maintained within a single hierarchy.  That is, the parent of a disk 
> snapshot can be a checkpoint or another disk snapshot, and the parent of 
> a checkpoint can be another checkpoint or a disk snapshot.  And, since I 
> defined "snapshot" to mean either "checkpoint" or "disk snapshot", this 
> single hierarchy of "snapshots" will still be valid once it is expanded 
> to include more than just "checkpoints".  Since libvirt already has to 
> maintain additional metadata to track parent-child relationships between 
> snapshots, it should not be hard to augment that XML to store additional 
> information needed to track external disk snapshots.
> 
> The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint, 
> while leaving qemu running; I propose two new flags to fine-tune things: 
> virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will 
> create the checkpoint then halt the qemu process, and 
> virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will 
> create a disk snapshot rather than a checkpoint (on qemu, by using a 
> sequence including the new 'snapshot_blkdev' monitor command). 
> Specifying both flags at once is a form of data loss (you are losing the 
> ram state), and I suspect it to be rarely used, but since it may be 
> worthwhile in testing whether a disk snapshot is truly crash-consistent, 
> I won't refuse the combination.
> 
> Other flags may be added in the future; I know of at least two features 
> in qemu that may warrant some flags once they are stable: 1. a guest 
> agent fsfreeze/fsthaw command will allow the guest to get the file 
> system into a stable state prior to the snapshot, meaning that reverting 
> to that snapshot can skip out on any fsck or journal replay actions.  Of 
> course, this is a best effort attempt since guest agent interaction is 
> untrustworthy (comparable to memory ballooning - the guest may not 
> support the agent or may intentionally send falsified responses over the 
> agent), so the agent should only be used when explicitly requested - 
> this would be done with a new flag 
> VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE.  2. there is thought of adding 
> a qemu monitor command to freeze just I/O to a particular subset of 
> disks, rather than the current approach of having to pause all vcpus 
> before doing a snapshot of multiple disks.  Once that is added, libvirt 
> should use the new monitor command by default, but for compatibility 
> testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to 
> require a full vcpu pause instead of the faster iopause mechanism.

How do you decide whether to use internal or external snapshots? Should
this be another flag? In fact we have multiple dimensions:

* Disk snapshot or checkpoint? (you have a flag for this)
* Disk snapshot stored internally or externally (missing)
* VM state stored internally or externally (missing)

qemu currently only supports (disk, ext), (disk, int), (checkpoint, int,
int). But other combinations could be made possible in the future, and I
think especially the combination (checkpoint, int, ext) could be
interesting.

[ Okay, some of it is handled later in this document, but I think it's
still useful to leave this summary in my mail. External VM state is
something that you don't seem to have covered yet - can't we do this
already with live migration to a file? ]

> My first xml change is that <domainsnapshot> will now always track the 
> full <domain> xml (prior to any file modifications), normally as an 
> output-only part of the snapshot (that is, a <domain> sublement of 
> <domainsnapshot> will always be present in virDomainGetXMLDesc, but is 
> generally ignored in virDomainSnapshotCreateXML - more on this below). 
> This gives us the capability to use XML ABI compatibility checks 
> (similar to those used in virDomainMigrate2, virDomainRestoreFlags, and 
> virDomainSaveImageDefineXML).  And, given that the full <domain> xml is 
> now present in the snapshot metadata, this means that we need to add 
> virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any 
> security-sensitive data doesn't leak out to read-only connections. 
> Right now, domain ABI compatibility is only checked for 
> VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot 
> <domain> will always be the inactive version (sufficient for starting a 
> new qemu), although I may end up changing my mind and storing the active 
> version (when attempting to revert from live qemu to another live 
> checkpoint, all while using a single qemu process, the ABI compatibility 
> checking may need enhancements to discover differences not visible in 
> inactive xml but fatally different between the active xml when using 
> 'loadvm', but which not matter to virsh save/restore where a new qemu 
> process is created every time).
> 
> Next, we need a way to control which subset of disks is involved in a 
> snapshot command.  Previous mail has documented that for ESX, the 
> decision can only be made at boot time - a disk can be persistent 
> (involved in snapshots, and saves changes across domain boots); 
> independent-persistent (is not involved in snapshots, but saves changes 
> across domain boots); or independent-nonpersistent (is not involved in 
> snapshots, and all changes during a domain run are discarded when the 
> domain quits).  In <domain> xml, I will represent this by two new 
> optional attributes:
> 
> <disk snapshot='no|external|internal' persistent='yes|no'>...</disk>
> 
> For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor 
> command does not yet support it, although it was documented as a 
> possible extension); I'm not sure whether ESX supports external, 
> internal, or both.  Likewise, both ESX and qemu will reject 
> persistent=no unless snapshot=no is also specified or implied (it makes 
> no sense to create a snapshot if you know the disk will be thrown away 
> on next boot), but keeping the options orthogonal may prove useful for 
> some future extension.  If either option is omitted, the default for 
> snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no, 
> and 'external' otherwise; and the default for persistent is 'yes' for 
> all disks (domain_conf.h will have to represent nonpersistent=0 for 
> easier coding with sane 0-initialized defaults, but no need to expose 
> that ugly name in the xml).  I'm not sure whether to reject an explicit 
> persistent=no coupled with <readonly>, or just ignore it (if the disk is 
> readonly, it can't change, so there is nothing to throw away after the 
> domain quits).  Creation of an external snapshot requires rewriting the 
> active domain XML to reflect the new filename.
> 
> While ESX can only select the subset of disks to snapshot at boot time, 
> qemu can alter the selection at runtime.  Therefore, I propose also 
> modifying the <domainsnapshot> xml to take a new subelement <disks> to 
> fine-tune which disks are involved in a snapshot.  For now, a checkpoint 
> must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks> 
> must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is 
> used, and checkpoints always cover full system state, and on qemu this 
> checkpoint uses internal snapshots).  Meanwhile, for disk snapshots, if 
> the <disks> element is omitted, then one is automatically created using 
> the attributes in the <domain> xml.  For ESX, if the <disks> element is 
> present, it must select the same disks as the <domain> xml.  Offline 
> checkpoints will continue to use <state>shutoff</state> in the xml 
> output, while new disk snapshots will use <state>disk-snapshot</state> 
> to indicate that the disk state was obtained from a running VM and might 
> be only crash-consistent rather than stable.
> 
> The <disks> element has an optional number of <disk> subelements; at 
> most one per <disk> in the <devices> section of <domain>.  Each <disk> 
> element has a mandatory attribute name='name', which must match the 
> <target dev='name'/> of the <domain> xml, as a way of getting 1:1 
> correspondence between domainsnapshot/disks/disk and domain/devices/disk 
> while using names that should already be unique.  Each <disk> also has 
> an optional snapshot='no|internal|external' attribute, similar to the 
> proposal for <domain>/<devices>/<disk>; if not provided, the attribute 
> defaults to the one from the <domain>.  If snapshot=external, then there 
> may be an optional subelement <source file='path'/>, which gives the 
> desired new file name.  If external is requested, but the <source> 
> subelement is not present, then libvirt will generate a suitable 
> filename, probably by concatenating the existing name with the snapshot 
> name, and remembering that the snapshot name is generated as a timestamp 
> if not specified.  Also, for external snapshots, the <disk> element may 
> have an optional sub-element specifying the driver (useful for selecting 
> qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, 
> this can normally be generated by default.
> 
> Future extensions may include teaching qemu to allow coupling 
> checkpoints with external snapshots by allowing a <disks> element even 
> for checkpoints.  (That is, while the initial implementation will always 
> output <disks> for <state>disk-snapshot</state> and never output <disks> 
> for <state>shutoff</state>, but this may not always hold in the future). 
>   Likewise, we may discover when implementing lvm or btrfs snapshots 
> that additional subelements to each <disk> would be useful for 
> specifying additional aspects for creating snapshots using that 
> technology, where the omission of those subelements has a sane default 
> state.
> 
> libvirt can be taught to honor persistent=no for qemu by creating a 
> qcow2 wrapper file prior to starting qemu, then tearing down that 
> wrapper after the fact, although I'll probably leave that for later in 
> my patch series.

qemu can already do this with -drive snapshot=on. It must be allowed to
create a temporary file for this to work, of course. Is that a problem?
If not, I would just forward the option to qemu.

> As an example, a valid input <domainsnapshot> for creation of a qemu 
> disk snapshot would be:
> 
> <domainsnapshot>
>    <name>snapshot</name>
>    <disks>
>      <disk name='vda'/>
>      <disk name='vdb' snapshot='no'/>
>      <disk name='vdc' snapshot='external'>
>        <source file='/path/to/new'/>
>      </disk>
>    </disks>
> </domainsnapshot>
> 
> which requests that the <disk> matching the target dev=vda defer to the 
> <domain> default for whether to snapshot (and if the domain default 
> requires creating an external snapshot, then libvirt will create the new 
> file name; this could also be specified by omitting the <disk 
> name='vda'/> subelement altogether); the <disk> matching vdb is not 
> snapshotted, and the <disk> matching vdc is involved in an external 
> snapshot where the user specifies the new filename of /path/to/new.  On 
> dumpxml output, the output will be fully populated with the items 
> generated by libvirt, and be displayed as:
> 
> <domainsnapshot>
>    <name>snapshot</name>
>    <state>disk-snapshot</state>
>    <parent>
>      <name>prior</name>
>    </parent>
>    <creationTime>1312945292</creationTime>
>    <domain>
>      <!-- previously just uuid, but now the full domain XML, 
> including... -->
>      ...
>      <devices>
>        <disk type='file' device='disk' snapshot='external'>
>          <driver name='qemu' type='raw'/>
>          <source file='/path/to/original'/>
>          <target dev='vda' bus='virtio'/>
>        </disk>
>      ...
>      </devices>
>    </domain>
>    <disks>
>      <disk name='vda' snapshot='external'>
>        <driver name='qemu' type='qcow2'/>
>        <source file='/path/to/original.snapshot'>
>      </disk>
>      <disk name='vdb' snapshot='no'/>
>      <disk name='vdc' snapshot='external'>
>        <driver name='qemu' type='qcow2'/>
>        <source file='/path/to/new'/>
>      </disk>
>    </disks>
> </domainsnapshot>
> 
> And, if the user were to do 'virsh dumpxml' of the domain, they would 
> now see the updated <disk> contents:
> 
> <domain>
>    ...
>    <devices>
>      <disk type='file' device='disk' snapshot='external'>
>        <driver name='qemu' type='qcow2'/>
>        <source file='/path/to/original.snapshot'/>
>        <target dev='vda' bus='virtio'/>
>      </disk>
>      ...
>    </devices>
> </domain>
> 
> ++++++++++
> Reverting
> 
> When it comes to reverting to a snapshot, the only time it is possible 
> to revert to a live image is if the snapshot is a "checkpoint" of a 
> running or paused domain, because qemu must be able to restore the ram 
> state.  Reverting to any other snapshot (both the existing "checkpoint" 
> of an offline image, which uses internal disk snapshots, and my new 
> "disk snapshot" which uses external disk snapshots even though it was 
> created against a running image), will revert the disks back to the 
> named state, but default to leaving the guest in an offline state.   Two 
> new mutually exclusive flags will allow to both revert to snapshot disk 
> state and affect the resulting qemu state; 
> virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run 
> from the snapshot, and virDomainRevertToSnapshot(snap, 
> VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave 
> it paused.  If neither of these two flags is specified, then the default 
> will be determined by the snapshot itself.  These flags also allow 
> overriding the running/paused aspect recorded in live checkpoints.  Note 
> that I am not proposing a flag for reverting to just the disk state of a 
> live checkpoint; this is considered an uncommon operation, and can be 
> accomplished in two steps by reverting to paused state to restore disk 
> state followed by destroying the domain (but I can add a third 
> mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide 
> that we really want this uncommon operation via a single API).
> Reverting from a stopped state is always allowed, even if the XML is 
> incompatible, by basically rewriting the domain's xml definition. 
> Meanwhile, reverting from an online VM to a live checkpoint has two 
> flavors - if the XML is compatible, then the 'loadvm' monitor command 
> can be used, and the qemu process remains alive.  But if the XML has 
> changed incompatibly since the checkpoint was created, then libvirt will 
> refuse to do the revert unless it has permission to start a new qemu 
> process, via another new flag: virDomainRevertToSnapshot(snap, 
> VIR_DOMAIN_SNAPSHOT_REVERT_FORCE).  The new REVERT_FORCE flag also 
> provides a safety valve - reverting to a stopped state (whether an 
> existing offline checkpoint, or a new disk snapshot) from a running VM 
> will be rejected unless REVERT_FORCE is specified.  For now, this 
> includes the case of using the REVERT_START flag to revert to a disk 
> snapshot and then start qemu - this is because qemu does not yet expose 
> a way to safely revert to a disk snapshot from within the same qemu 
> process.  If, in the future, qemu gains support for undoing the effects 
> of 'snapshot_blkdev' via monitor commands, then it may be possible to 
> use REVERT_START without REVERT_FORCE and end up reusing the same qemu 
> process while still reverting to the disk snapshot state, by using some 
> of the same tricks as virDomainReboot to force the existing qemu process 
> to boot from the new disk state.
> 
> Of course, the new safety valve is a slight change in behavior - scripts 
> that used to use 'virsh snapshot-revert' may now have to use 'virsh 
> snapshot-revert --force' to do the same actions; for backwards 
> compatibility, the virsh implementation should first try without the 
> flag, and a new VIR_ERR_* code be introduced in order to let virsh 
> distinguish between a new implementation that rejected the revert 
> because _REVERT_FORCE was missing, and an old one that does not support 
> _REVERT_FORCE in the first place.  But this is not the first time that 
> added safety valves have caused existing scripts to have to adapt - 
> consider the case of 'virsh undefine' which could previously pass in a 
> scenario where it now requires 'virsh undefine --managed-save'.
> 
> For transient domains, it is not possible to make an offline checkpoint 
> (since transient domains don't exist if they are not running or paused); 
> transient domains must use REVERT_START or REVERT_PAUSE to revert to a 
> disk snapshot.  And given the above limitations about qemu, reverting to 
> a disk snapshot will currently require REVERT_FORCE, since a new qemu 
> process will necessarily be created.
> 
> Just as creating an external disk snapshot rewrote the domain xml to 
> match, reverting to an older snapshot will update the domain xml (it 
> should be a bit more obvious now why the 
> <domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while 
> <domainsnapshot>/<disks>/<disk> lists the new name).
> 
> The other thing to be aware of is that with internal snapshots, qcow2 
> maintains a distinction between current state and a snapshot - that is, 
> qcow2 is _always_ tracking a delta, and never modifies a named snapshot, 
> even when you use 'qemu-img snapshot -a' to revert to different snapshot 
> names.  But with named files, the original file now becomes a read-only 
> backing file to a new active file; if we revert to the original file, 
> and make any modifications to it, the active file that was using it as 
> backing will be corrupted.  Therefore, the safest thing is to reject any 
> attempt to revert to any snapshot (whether checkpoint or disk snapshot) 
> that has an existing child snapshot consisting of an external disk 
> snapshot.  The metadata for each of these children can be deleted 
> manually, but that requires quite a few API calls (learn how many 
> children exist, get the list of children, and for each child, get its 
> xml to see if that child has the target snapshot as a parent, and if so 
> delete the snapshot).  So as shorthand, virDomainRevertToSnapshot will 
> be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which 
> first deletes any children of the snapshot about to be deleted prior to 
> reverting to that particular child.

I think the API should make it possible to revert to a given external
snapshot without deleting all children, but by creating another qcow2
file that uses the same backing file. Basically this new qcow2 file
would be the equivalent to the "current state" concept qcow2 uses for
internal snapshots.

It should be possible to make both look the same to users if we think
this is a good idea.

> And as long as reversion is learning how to do some snapshot deletion, 
> it becomes possible to decide what to do with the qcow2 file that was 
> created at the time of the disk snapshot.  The default behavior for qemu 
> will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta 
> change against the original file, and keeping the domain xml tied to the 
> wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be 
> used to instead completely delete the qcow2 wrapper file, and update the 
> domain xml back to the original filename.
> 
> Deleting
> ++++++++
> 
> Deleting snapshots also needs some improvements.  With checkpoints, the 
> disk snapshot contents were internal snapshots, so no files had to be 
> deleted.  But with external disk snapshots, there are some choices to be 
> made - when deleting a snapshot, should the two files be consolidated 
> back into one or left separate, and if consolidation occurs, what should 
> be the name of the new file.
> 
> Right now, qemu supports consolidation only in one direction - the 
> backing file can be consolidated into the new file by using the new 
> blockpull API. 

This is only true for live snapshot deletion. If the VM is shut down,
qemu-img commit/rebase can be used for the two directions.

> In fact, the combination of disk snapshot and block pull 
> can be used to implement local storage migration - create a disk 
> snapshot with a local file as the new file around the remote file used 
> as the snapshot, then use block pull to break the ties to the remote 
> snapshot.  But there is currently no way to make qemu save the contents 
> of a new file back into its backing file and then swap back to the 
> backing file as the live disk; also, while you can use block pull to 
> break the relation between the snapshot and the live file, and then 
> rename the live file back over the backing file name, there is no way to 
> make qemu revert back to that file name short of doing the 
> snapshot/blockpull algorithm twice; and the end result will be qcow2 
> even if the original file was raw.  Also, if qemu ever adds support for 
> merging back into a backing file, as well as a means to determine how 
> dirty a qcow2 file is in relation to its backing file, there are some 
> possible efficiency gains - if most blocks of a snapshot differ from the 
> backing file, it is faster to use blockpull to pull in the remaining 
> blocks from the backing file to the active file; whereas if most blocks 
> of a snapshot are inherited from the backing file, it is more efficient 
> to pull just the dirty blocks from the active file back into the backing 
> file.  Knowing whether the original file was qcow2 or some other format 
> may also impact how to merge deltas from the new qcow2 file back into 
> the original file.

You also need to consider that it's possible to have multiple qcow2
files using the same backing file. If this is the case, you can't pull
the deltas into the backing file.

> Additionally, having fine-tuned control over which of the two names to 
> keep when consolidating a snapshot would require passing that 
> information through xml, but the existing virDomainSnapshotDelete does 
> not take an XML argument.  For now, I propose that deleting an external 
> disk snapshot will be required to leave both the snapshot and live disk 
> image files intact (except for the special case of REVERT_DISCARD 
> mentioned above that combines revert and delete into a single API); but 
> I could see the feasibility of a future extension which adds a new XML 
> <on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that 
> specifies which of two files to consolidate into, as well as a flag 
> VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the 
> consolidation for any <on_delete> subelements in the snapshot being 
> deleted (if the flag is omitted, the <on_delete> subelement is ignored 
> and both files remain).
> 
> The notion of deleting all children of a snapshot while keeping the 
> snapshot itself (mentioned above under the revert use case) seems common 
> enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY; 
> this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the 
> target snapshot intact.

Kevin