[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] RFC: mirrored live block migration in libvirt 0.9.11



Il 13/03/2012 23:20, Eric Blake ha scritto:
> virDomainSnapshotCreateXML will learn a new flag:
> VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC.  If this flag is present, then
> libvirt guarantees that the snapshot operation will either succeed, or
> that failure will be reported without changing domain XML or qemu
> runtime state.  If present, the creation API will fail if qemu lacks the
> 'transaction' command and more than one disk snapshot was requested in
> the <domainsnapshot> XML.  If this flag is not present, then libvirt
> will use 'transaction' if available, but fall back to
> 'blockdev-snapshot-sync', so that it works with older qemu, but where
> the caller then has to check virDomainGetXMLDesc on failure to see if a
> partial snapshot occurred.  This flag will be implied by any other part
> of the API that requires the use of 'transaction'.

Fine.

> The VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT flag was added to
> virDomainSnapshotCreateXML in 0.9.10, with semantics that it would stop
> libvirt from complaining if a regular file already existed as the
> snapshot destination, but without interacting with qemu, which would
> blindly overwrite the contents of that file.  Since this flag is
> relatively new, and has not had much use, I propose to slightly alter
> its documented semantics to now interact with the qemu 1.1 feature being
> added as part of 'transaction'.  If qemu supports 'transaction', then
> presence of this flag implies that libvirt will explicitly request
> 'mode':'existing' for each snapshot, which tells qemu to open the
> existing file without writing any new metadata, and that the caller is
> responsible to ensure that the file has identical guest contents
> (generally by creating a qcow2 file with the current file as backing
> image and no additional contents).  Additionally, libvirt will now
> require the file to already exist (in 0.9.10, libvirt silently ignored
> the fact if the flag was requested but the file did not exist).
> Presence of the flag without qemu support for 'transaction' will now
> fail (that is, VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT will now imply
> VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC).

Also looks ok.

  Absence of the flag means that
> libvirt will rely on qemu's default to 'mode':'absolute-paths', and will
> require that the file does not exist as a regular file; this maps to
> qemu 1.0 always writing a new qcow2 header with absolute backing file
> name.  If we want to later expose additional modes, like
> 'no-backing-file', it would be done via per-<disk> annotations in the
> <domainsnapshot> XML rather than via new flags, but for this proposal, I
> think oVirt is okay using the flag to set a single policy for all disks
> mentioned in a given snapshot request.
> virDomainSnapshotCreateXML's xml argument, <domainsnapshot>, will learn
> an optional <mirror> sub-element to each <disk>.  While the
> 'transaction' command supports multiple mirrors in one transaction, for
> now, libvirt will enforce at most one mirror, which should be sufficient
> for oVirt's needs.  (Adding more support for the rest of the power of
> 'transaction' is probably best left for new libvirt API, but that's
> outside the scope of this proposal).  As an example,
>  <domainsnapshot>
>    <disks>
>      <disk name='/src/base.img' snapshot='external'>
>        <source file='/src/snap.img'/>
>        <mirror file='/dest/snap.img'/>
>      </disk>
>    </disks>
>  </domainsnapshot>
> would create a new libvirt snapshot object with /src/snap.img as the
> read-write new image, and /dest/snap.img as the new write-only mirror.
> On success, this rewrites the domain's live XML to point to
> /src/snap.img as its current file.

This is an awfully low-level API; you're designing for oVirt rather than
for everything else.  The problem here is twofold:

1) you're defining a snapshot that cannot be started without losing the
mirrors.

2) in case the snapshotting is aborted early for any reason, oVirt has
to do a rebase operation manually.  This is currently O(size-of-disk),
not O(changes-in-the-last-image), so it wastes both disk space and time.

If it works, I cannot really say "don't do it", but I think the oVirt
mirrored snapshots idea is a dead-end and a workaround for lack of block
device streaming (which is now supported).  You could have a simpler,
high-level API based on streaming rather than snapshotting.  So, if you
have /src/disk.img as your image, you would have a new API:

  virDomainBlockCopy(dom, "disk",
                     "/dst/disk.img", "/src/base.img",
                     bandwidth, flags)

which would do all that is needed:

- start mirroring writes to /dst/disk.img; no snapshotting needed.  A
flag VIR_DOMAIN_BLOCK_COPY_REUSE_EXT would let you specify the
"existing" mode.  Another flag VIR_DOMAIN_BLOCK_COPY_CREATE_RAW would
use the raw format on the destination and specify the no-backing-file
mode (of course only valid if base == NULL).

- call virDomainBlockRebase(dom, "disk", "/src/base.img", bandwidth, 0)
to start the streaming job.

If something doesn't work here, it's a QEMU bug.

> Finally, virDomainSnapshotDelete will learn a new flag,
> VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt
> snapshot object will be deleted, but only after first calling the qemu
> 'drive-reopen' monitor command for all disks that had a <mirror> in the
> associated snapshot object.  That is, for the above example, this would
> reopen the disk from it's current read-write of /src/snap.img over to
> the second storage domain's /dest/snap.img with it's accompanying
> mirrored backing chain.  On success, this rewrites the domain's live XML
> to point to the just-opened mirror location.  This flag will fail if the
> libvirt snapshot being deleted is not the current image, or if the
> snapshot being deleted does not have any mirrored disks.

I think you also need VIR_DOMAIN_SNAPSHOT_DELETE_REMOVE_MIRROR, to be
used in case of abort so that the domain can actually be started.  Or it
could be an event MIRROR_DROPPED or something like that.

Paolo


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]