[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt] RFCv2: virDomainSnapshotCreateXML enhancements



[BCC'ing those who have responded to earlier RFC's]

I've posted previous RFCs for improving snapshot support:

ideas on managing a subset of disks:
https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html

ideas on managing snapshots of storage volumes not tied to a domain
https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html

After re-reading the feedback received on those threads, I think I've settled on a pretty robust design for my first round of adding improvements to the management of snapshots tied to a domain, while leaving the door open for future extensions.

Sorry this email is so long (I've had it open in my editor for more than 48 hours now as I keep improving it), but hopefully it is worth the effort to read. See the bottom if you want the shorter summary on the proposed changes.

First, some definitions:
========================

disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.

internal disk snapshot: a disk snapshot where both the saved state and delta reside in the same file (possible with qcow2 and qed). If a disk image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.

external disk snapshot: a disk snapshot where the saved state is one file, and the delta is tracked in another file. For a disk image not in use by qemu, this can be done with qemu-img to create a new qcow2 file wrapping any type of existing file. Recent qemu has also learned the 'snapshot_blkdev' monitor command for creating external snapshots while qemu is using a disk, and the goal of this RFC is to expose that functionality from within existing libvirt APIs.

saved state: all non-disk information used to resume a guest at the same state, assuming the disks did not change. With qemu, this is possible via migration to a file.

checkpoint: a combination of saved state and a disk snapshot. With qemu, the 'savevm' monitor command creates a checkpoint using internal snapshots. It may also be possible to combine saved state and disk snapshots created while the guest is offline for a form of checkpointing, although this RFC focuses on disk snapshots created while the guest is running.

snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of this email will attempt to use 'snapshot' where either form works, and a qualified term where no ambiguity is intended.

Existing libvirt functionality
==============================

The virDomainSnapshotCreateXML currently manages a hierarchy of "snapshots", although it is currently only used for "checkpoints", where every snapshot has a name and a possibly empty parent. The idea is that once a domain has a snapshot, there is always a current snapshot, and all new snapshots are created with a parent of a previously existing snapshot (although there are still some bugs to be fixed in managing the current snapshot over a libvirtd restart). It is possible to have disjoint hierarchies, if you delete a root snapshot that had more than one child (making both children become independent roots). The snapshot hierarchy is maintained by libvirt (in a typical installation, the files in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named snapshot, using <domainsnapshot> XML); using additional metadata not present in the qcow2 internal snapshot format (that is, while qcow2 can maintain multiple snapshots, it does not maintain relations between them). Remember, the "current" snapshot is not the current machine state, but the snapshot that would become the parent if you create a new snapshot; perhaps we could have named it the "loaded" snapshot, but the API names are set in stone now.

Libvirt also has APIs for listing all snapshots, querying the current snapshot, reverting back to the state of another snapshot, and deleting a snapshot. Deletion comes with a choice of deleting just that named version (removing one node in the hierarchy and re-parenting all children) or that tree of the hierarchy (that named version and all children).

Since qemu checkpoints can currently only be created via internal disk snapshots, libvirt has not had to track any file name relationships - a single "snapshot" corresponds to a qcow2 snapshot name within all qcow2 disks associated to a domain; furthermore, snapshot creation was limited to domains where all modifiable disks were already in qcow2 format. However, these "checkpoints" could be created on both running domains (qemu savevm) or inactive domains (qemu-img snapshot -c), with the latter technically being a case of just internal disk snapshots.

Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.

Reverting to a snapshot can also be considered as a form of data loss - you are discarding the disk changes and ram state that have happened since the last snapshot. To some degree, this is by design - the very nature of reverting to a snapshot implies throwing away changes; however, it may be nice to add a safety valve so that by default, reverting to a live checkpoint from an offline state works, but reverting from a running domain should require some confirmation that it is okay to throw away accumulated running state.

Libvirt also currently has a limitation where snapshots are local to one host - the moment you migrate a snapshot to another host, you have lost access to all snapshot metadata.

Proposed enhancements
=====================

Note that these proposals merely add xml attribute and subelement extensions, as well as API flags, rather than creating any new API, which makes it a nice candidate for backporting the patch series based on this RFC into older releases as appropriate.

Creation
++++++++

I propose reusing the virDomainSnapshotCreateXML API and <domainsnapshot> xml for both "checkpoints" and "disk snapshots", all maintained within a single hierarchy. That is, the parent of a disk snapshot can be a checkpoint or another disk snapshot, and the parent of a checkpoint can be another checkpoint or a disk snapshot. And, since I defined "snapshot" to mean either "checkpoint" or "disk snapshot", this single hierarchy of "snapshots" will still be valid once it is expanded to include more than just "checkpoints". Since libvirt already has to maintain additional metadata to track parent-child relationships between snapshots, it should not be hard to augment that XML to store additional information needed to track external disk snapshots.

The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint, while leaving qemu running; I propose two new flags to fine-tune things: virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will create the checkpoint then halt the qemu process, and virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will create a disk snapshot rather than a checkpoint (on qemu, by using a sequence including the new 'snapshot_blkdev' monitor command). Specifying both flags at once is a form of data loss (you are losing the ram state), and I suspect it to be rarely used, but since it may be worthwhile in testing whether a disk snapshot is truly crash-consistent, I won't refuse the combination.

Other flags may be added in the future; I know of at least two features in qemu that may warrant some flags once they are stable: 1. a guest agent fsfreeze/fsthaw command will allow the guest to get the file system into a stable state prior to the snapshot, meaning that reverting to that snapshot can skip out on any fsck or journal replay actions. Of course, this is a best effort attempt since guest agent interaction is untrustworthy (comparable to memory ballooning - the guest may not support the agent or may intentionally send falsified responses over the agent), so the agent should only be used when explicitly requested - this would be done with a new flag VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE. 2. there is thought of adding a qemu monitor command to freeze just I/O to a particular subset of disks, rather than the current approach of having to pause all vcpus before doing a snapshot of multiple disks. Once that is added, libvirt should use the new monitor command by default, but for compatibility testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to require a full vcpu pause instead of the faster iopause mechanism.

My first xml change is that <domainsnapshot> will now always track the full <domain> xml (prior to any file modifications), normally as an output-only part of the snapshot (that is, a <domain> sublement of <domainsnapshot> will always be present in virDomainGetXMLDesc, but is generally ignored in virDomainSnapshotCreateXML - more on this below). This gives us the capability to use XML ABI compatibility checks (similar to those used in virDomainMigrate2, virDomainRestoreFlags, and virDomainSaveImageDefineXML). And, given that the full <domain> xml is now present in the snapshot metadata, this means that we need to add virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any security-sensitive data doesn't leak out to read-only connections. Right now, domain ABI compatibility is only checked for VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot <domain> will always be the inactive version (sufficient for starting a new qemu), although I may end up changing my mind and storing the active version (when attempting to revert from live qemu to another live checkpoint, all while using a single qemu process, the ABI compatibility checking may need enhancements to discover differences not visible in inactive xml but fatally different between the active xml when using 'loadvm', but which not matter to virsh save/restore where a new qemu process is created every time).

Next, we need a way to control which subset of disks is involved in a snapshot command. Previous mail has documented that for ESX, the decision can only be made at boot time - a disk can be persistent (involved in snapshots, and saves changes across domain boots); independent-persistent (is not involved in snapshots, but saves changes across domain boots); or independent-nonpersistent (is not involved in snapshots, and all changes during a domain run are discarded when the domain quits). In <domain> xml, I will represent this by two new optional attributes:

<disk snapshot='no|external|internal' persistent='yes|no'>...</disk>

For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor command does not yet support it, although it was documented as a possible extension); I'm not sure whether ESX supports external, internal, or both. Likewise, both ESX and qemu will reject persistent=no unless snapshot=no is also specified or implied (it makes no sense to create a snapshot if you know the disk will be thrown away on next boot), but keeping the options orthogonal may prove useful for some future extension. If either option is omitted, the default for snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no, and 'external' otherwise; and the default for persistent is 'yes' for all disks (domain_conf.h will have to represent nonpersistent=0 for easier coding with sane 0-initialized defaults, but no need to expose that ugly name in the xml). I'm not sure whether to reject an explicit persistent=no coupled with <readonly>, or just ignore it (if the disk is readonly, it can't change, so there is nothing to throw away after the domain quits). Creation of an external snapshot requires rewriting the active domain XML to reflect the new filename.

While ESX can only select the subset of disks to snapshot at boot time, qemu can alter the selection at runtime. Therefore, I propose also modifying the <domainsnapshot> xml to take a new subelement <disks> to fine-tune which disks are involved in a snapshot. For now, a checkpoint must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks> must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is used, and checkpoints always cover full system state, and on qemu this checkpoint uses internal snapshots). Meanwhile, for disk snapshots, if the <disks> element is omitted, then one is automatically created using the attributes in the <domain> xml. For ESX, if the <disks> element is present, it must select the same disks as the <domain> xml. Offline checkpoints will continue to use <state>shutoff</state> in the xml output, while new disk snapshots will use <state>disk-snapshot</state> to indicate that the disk state was obtained from a running VM and might be only crash-consistent rather than stable.

The <disks> element has an optional number of <disk> subelements; at most one per <disk> in the <devices> section of <domain>. Each <disk> element has a mandatory attribute name='name', which must match the <target dev='name'/> of the <domain> xml, as a way of getting 1:1 correspondence between domainsnapshot/disks/disk and domain/devices/disk while using names that should already be unique. Each <disk> also has an optional snapshot='no|internal|external' attribute, similar to the proposal for <domain>/<devices>/<disk>; if not provided, the attribute defaults to the one from the <domain>. If snapshot=external, then there may be an optional subelement <source file='path'/>, which gives the desired new file name. If external is requested, but the <source> subelement is not present, then libvirt will generate a suitable filename, probably by concatenating the existing name with the snapshot name, and remembering that the snapshot name is generated as a timestamp if not specified. Also, for external snapshots, the <disk> element may have an optional sub-element specifying the driver (useful for selecting qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, this can normally be generated by default.

Future extensions may include teaching qemu to allow coupling checkpoints with external snapshots by allowing a <disks> element even for checkpoints. (That is, while the initial implementation will always output <disks> for <state>disk-snapshot</state> and never output <disks> for <state>shutoff</state>, but this may not always hold in the future). Likewise, we may discover when implementing lvm or btrfs snapshots that additional subelements to each <disk> would be useful for specifying additional aspects for creating snapshots using that technology, where the omission of those subelements has a sane default state.

libvirt can be taught to honor persistent=no for qemu by creating a qcow2 wrapper file prior to starting qemu, then tearing down that wrapper after the fact, although I'll probably leave that for later in my patch series.

As an example, a valid input <domainsnapshot> for creation of a qemu disk snapshot would be:

<domainsnapshot>
  <name>snapshot</name>
  <disks>
    <disk name='vda'/>
    <disk name='vdb' snapshot='no'/>
    <disk name='vdc' snapshot='external'>
      <source file='/path/to/new'/>
    </disk>
  </disks>
</domainsnapshot>

which requests that the <disk> matching the target dev=vda defer to the <domain> default for whether to snapshot (and if the domain default requires creating an external snapshot, then libvirt will create the new file name; this could also be specified by omitting the <disk name='vda'/> subelement altogether); the <disk> matching vdb is not snapshotted, and the <disk> matching vdc is involved in an external snapshot where the user specifies the new filename of /path/to/new. On dumpxml output, the output will be fully populated with the items generated by libvirt, and be displayed as:

<domainsnapshot>
  <name>snapshot</name>
  <state>disk-snapshot</state>
  <parent>
    <name>prior</name>
  </parent>
  <creationTime>1312945292</creationTime>
  <domain>
<!-- previously just uuid, but now the full domain XML, including... -->
    ...
    <devices>
      <disk type='file' device='disk' snapshot='external'>
        <driver name='qemu' type='raw'/>
        <source file='/path/to/original'/>
        <target dev='vda' bus='virtio'/>
      </disk>
    ...
    </devices>
  </domain>
  <disks>
    <disk name='vda' snapshot='external'>
      <driver name='qemu' type='qcow2'/>
      <source file='/path/to/original.snapshot'>
    </disk>
    <disk name='vdb' snapshot='no'/>
    <disk name='vdc' snapshot='external'>
      <driver name='qemu' type='qcow2'/>
      <source file='/path/to/new'/>
    </disk>
  </disks>
</domainsnapshot>

And, if the user were to do 'virsh dumpxml' of the domain, they would now see the updated <disk> contents:

<domain>
  ...
  <devices>
    <disk type='file' device='disk' snapshot='external'>
      <driver name='qemu' type='qcow2'/>
      <source file='/path/to/original.snapshot'/>
      <target dev='vda' bus='virtio'/>
    </disk>
    ...
  </devices>
</domain>

++++++++++
Reverting

When it comes to reverting to a snapshot, the only time it is possible to revert to a live image is if the snapshot is a "checkpoint" of a running or paused domain, because qemu must be able to restore the ram state. Reverting to any other snapshot (both the existing "checkpoint" of an offline image, which uses internal disk snapshots, and my new "disk snapshot" which uses external disk snapshots even though it was created against a running image), will revert the disks back to the named state, but default to leaving the guest in an offline state. Two new mutually exclusive flags will allow to both revert to snapshot disk state and affect the resulting qemu state; virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run from the snapshot, and virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave it paused. If neither of these two flags is specified, then the default will be determined by the snapshot itself. These flags also allow overriding the running/paused aspect recorded in live checkpoints. Note that I am not proposing a flag for reverting to just the disk state of a live checkpoint; this is considered an uncommon operation, and can be accomplished in two steps by reverting to paused state to restore disk state followed by destroying the domain (but I can add a third mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide that we really want this uncommon operation via a single API).

Reverting from a stopped state is always allowed, even if the XML is incompatible, by basically rewriting the domain's xml definition. Meanwhile, reverting from an online VM to a live checkpoint has two flavors - if the XML is compatible, then the 'loadvm' monitor command can be used, and the qemu process remains alive. But if the XML has changed incompatibly since the checkpoint was created, then libvirt will refuse to do the revert unless it has permission to start a new qemu process, via another new flag: virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_FORCE). The new REVERT_FORCE flag also provides a safety valve - reverting to a stopped state (whether an existing offline checkpoint, or a new disk snapshot) from a running VM will be rejected unless REVERT_FORCE is specified. For now, this includes the case of using the REVERT_START flag to revert to a disk snapshot and then start qemu - this is because qemu does not yet expose a way to safely revert to a disk snapshot from within the same qemu process. If, in the future, qemu gains support for undoing the effects of 'snapshot_blkdev' via monitor commands, then it may be possible to use REVERT_START without REVERT_FORCE and end up reusing the same qemu process while still reverting to the disk snapshot state, by using some of the same tricks as virDomainReboot to force the existing qemu process to boot from the new disk state.

Of course, the new safety valve is a slight change in behavior - scripts that used to use 'virsh snapshot-revert' may now have to use 'virsh snapshot-revert --force' to do the same actions; for backwards compatibility, the virsh implementation should first try without the flag, and a new VIR_ERR_* code be introduced in order to let virsh distinguish between a new implementation that rejected the revert because _REVERT_FORCE was missing, and an old one that does not support _REVERT_FORCE in the first place. But this is not the first time that added safety valves have caused existing scripts to have to adapt - consider the case of 'virsh undefine' which could previously pass in a scenario where it now requires 'virsh undefine --managed-save'.

For transient domains, it is not possible to make an offline checkpoint (since transient domains don't exist if they are not running or paused); transient domains must use REVERT_START or REVERT_PAUSE to revert to a disk snapshot. And given the above limitations about qemu, reverting to a disk snapshot will currently require REVERT_FORCE, since a new qemu process will necessarily be created.

Just as creating an external disk snapshot rewrote the domain xml to match, reverting to an older snapshot will update the domain xml (it should be a bit more obvious now why the <domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while <domainsnapshot>/<disks>/<disk> lists the new name).

The other thing to be aware of is that with internal snapshots, qcow2 maintains a distinction between current state and a snapshot - that is, qcow2 is _always_ tracking a delta, and never modifies a named snapshot, even when you use 'qemu-img snapshot -a' to revert to different snapshot names. But with named files, the original file now becomes a read-only backing file to a new active file; if we revert to the original file, and make any modifications to it, the active file that was using it as backing will be corrupted. Therefore, the safest thing is to reject any attempt to revert to any snapshot (whether checkpoint or disk snapshot) that has an existing child snapshot consisting of an external disk snapshot. The metadata for each of these children can be deleted manually, but that requires quite a few API calls (learn how many children exist, get the list of children, and for each child, get its xml to see if that child has the target snapshot as a parent, and if so delete the snapshot). So as shorthand, virDomainRevertToSnapshot will be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which first deletes any children of the snapshot about to be deleted prior to reverting to that particular child.

And as long as reversion is learning how to do some snapshot deletion, it becomes possible to decide what to do with the qcow2 file that was created at the time of the disk snapshot. The default behavior for qemu will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta change against the original file, and keeping the domain xml tied to the wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be used to instead completely delete the qcow2 wrapper file, and update the domain xml back to the original filename.

Deleting
++++++++

Deleting snapshots also needs some improvements. With checkpoints, the disk snapshot contents were internal snapshots, so no files had to be deleted. But with external disk snapshots, there are some choices to be made - when deleting a snapshot, should the two files be consolidated back into one or left separate, and if consolidation occurs, what should be the name of the new file.

Right now, qemu supports consolidation only in one direction - the backing file can be consolidated into the new file by using the new blockpull API. In fact, the combination of disk snapshot and block pull can be used to implement local storage migration - create a disk snapshot with a local file as the new file around the remote file used as the snapshot, then use block pull to break the ties to the remote snapshot. But there is currently no way to make qemu save the contents of a new file back into its backing file and then swap back to the backing file as the live disk; also, while you can use block pull to break the relation between the snapshot and the live file, and then rename the live file back over the backing file name, there is no way to make qemu revert back to that file name short of doing the snapshot/blockpull algorithm twice; and the end result will be qcow2 even if the original file was raw. Also, if qemu ever adds support for merging back into a backing file, as well as a means to determine how dirty a qcow2 file is in relation to its backing file, there are some possible efficiency gains - if most blocks of a snapshot differ from the backing file, it is faster to use blockpull to pull in the remaining blocks from the backing file to the active file; whereas if most blocks of a snapshot are inherited from the backing file, it is more efficient to pull just the dirty blocks from the active file back into the backing file. Knowing whether the original file was qcow2 or some other format may also impact how to merge deltas from the new qcow2 file back into the original file.

Additionally, having fine-tuned control over which of the two names to keep when consolidating a snapshot would require passing that information through xml, but the existing virDomainSnapshotDelete does not take an XML argument. For now, I propose that deleting an external disk snapshot will be required to leave both the snapshot and live disk image files intact (except for the special case of REVERT_DISCARD mentioned above that combines revert and delete into a single API); but I could see the feasibility of a future extension which adds a new XML <on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that specifies which of two files to consolidate into, as well as a flag VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the consolidation for any <on_delete> subelements in the snapshot being deleted (if the flag is omitted, the <on_delete> subelement is ignored and both files remain).

The notion of deleting all children of a snapshot while keeping the snapshot itself (mentioned above under the revert use case) seems common enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY; this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the target snapshot intact.

Undefining
++++++++++

In one regards, undefining a domain that has snapshots is just as bad as undefining a domain with managed save state - since libvirt is maintaining metadata about snapshot hierarchies, leaving this metadata behind _will_ interfere with creation of a new domain by the same name. However, since both checkpoints and snapshots are stored in user-accessible disk images, and only the metadata is stored by libvirt, it should eventually be possible for the user to decide whether to discard the metadata but keep the snapshot contents intact in the disk images, or to discard both the metadata and the disk image snapshots.

Meanwhile, I propose changing the default behavior of virDomainUndefine[Flags] to reject attempts to undefine a domain with any defined snapshots, and to add a new flag for virDomainUndefineFlags, virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as shorthand for calling virDomainSnapshotDelete for all snapshots tied to the domain. Note that this deletes the metadata, but not the underlying storage volumes.

Migration
+++++++++

The simplest solution to the fact that snapshot metadata is host-local is to make migration attempts fail if a domain has any associated snapshots. For a first cut patch, that is probably what I'll go with - it reduces libvirt functionality, but instantly plugs all the bugs that you can currently trigger by migrating a domain with snapshots.

But we can do better. Right now, there is no way to inject the metadata associated with an already-existing snapshot, whether that snapshot is internal or external, and deleting internal snapshots always deletes the data as well as the metadata. But I already documented that external snapshots will keep both the new file and it's read-only original, in most cases, which means the data is preserved even when the snapshot is deleted. With a couple new flags, we can have virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY) which removes libvirt's metadata, but still leaves all the data of the snapshot present (visible to qemu-img snapshot -l or via multiple file names); as well as virDomainSnapshotCreateXML(dom, xml, VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE), which says to add libvirt snapshot metadata corresponding to existing snapshots without doing anything to the current guest (no 'savevm' or 'snapshot_blkdev', although it may still make sense to do some sanity checks to see that the metadata being defined actually corresponds to an existing snapshot in 'qemu-img snapshot -l' or that an external snapshot file exists and has the correct backing file to the original name).

Additionally, with these two tools in place, you can now make ABI-compatible tweaks to the <domain> xml stored in a snapshot metadata (similar to how 'virsh save-image-edit' can tweak a save image, such as changing the host name of a <disk>'s image to match what was done externally with qemu-img or other external tool). You can also make an extended protocol that first dumps all snapshot xml on the source, redefines those snapshots on the destination, then deletes the metadata on the source, all before migrating the domain itself (unfortunately, I don't think it can be wired into the cookies of migration protocol v3, as each <domainsnapshot> xml for each snapshot will be larger than the <domain> itself, and an arbitrary number of snapshots with lots of xml don't fit into a finite-sized cookie over rpc; ultimately, this may mean a migration protocol v4 that has an arbitrary number of handshakes between Begin on the source and Prepare on the dest in order to properly handle all the interchange - having a feature negotiation between client and host should be part of that interchange).

Future proposals
================

I still want to add APIs to manage storage volume snapshots for storage volumes not associated with a current domain, as well as enhancing disk snapshots to operate on more than just qcow2 file formats (for example, lvm snapshots or btrfs copy-on-write clones). But I've already signed up for quite a bit of code changes in just this email, so that will have to come later. I hope that what I have designed here does not preclude extensibility to future additions - for example, <storagevolsnapshot> would be able to use a single <disk> sublement similar to the above <domainsnapshot>/<disks>/<disk> sublement for describing the relation between a disk and its backing file snapshot.

Quick Summary
=============

These are the changes I plan on making soon; I mentioned other possible future changes above that would depend on these being complete first, or which involve creation of new API.

The following API patterns currently "succeed", but risk data loss or other bugs that can get libvirt into an inconsistent state; they will now fail by default:

virDomainRevertToSnapshot to go from a running VM to a stopped checkpoint will now fail by default. Justification: stopping a running domain is a form of data loss. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE for old behavior.

virDomainRevertToSnapshot to go from a running VM to a live checkpoint with an ABI-incompatible <domain> will now fail by default. Justification: qemu does not handle ABI incompatibilities, and even if the 'loadvm' may have succeeded, this generally resulted in fullscale guest corruption. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE to start a new qemu process that properly conforms to the snapshot's ABI.

virDomainUndefine will now fail to undefine a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: use virDomainUndefineFlags.

virDomainUndefineFlags will now default to failing an undefine of a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: separately delete all snapshots (or at least all snapshot metadata) first, or use VIR_DOMAIN_UNDEFINE_SNAPSHOTS.

virDomainMigrate/virDomainMigrate2 will now default to fail if the source has any snapshots. Justification: metadata must be transferred along with the domain for the migration to be complete. Mitigation: until an improved migration protocol can automatically do the handshaking necessary to migrate all the snapshot metadata, a user can manually loop over each snapshot prior to migration, using virDomainSnapshotCreateXML with VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE on the destination, then virDomainSnapshotDelete with VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY on the source.

Add the following XML:
in <domain>/<devices>/<disk>:
  add optional attribute snapshot='no|internal|external'
  add optional attribute persistent='yes|no'
in <domainsnapshot>:
  expand <domainsnapshot>/<domain> to be full domain, not just uuid
  add <state>disk-snapshot</state>
add optional <disks>/<disk>, where each <disk> maps back to <domain>/<devices>/<disk> and controls how to do external disk snapshots

Add the following flags to existing API:

virDomainSnapshotCreateXML:
  VIR_DOMAIN_SNAPSHOT_CREATE_HALT
  VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY
  VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE

virDomainSnapshotGetXMLDesc
  VIR_DOMAIN_XML_SECURE

virDomainRevertToSnapshot
  VIR_DOMAIN_SNAPSHOT_REVERT_START
  VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE
  VIR_DOMAIN_SNAPSHOT_REVERT_FORCE
  VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN
  VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD

virDomainSnapshotDelete
  VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY
  VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY

virDomainUndefineFlags
  VIR_DOMAIN_UNDEFINE_SNAPSHOTS

--
Eric Blake   eblake redhat com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]