[libvirt] [RFC v3] external (pull) backup API

Mon May 21 21:54:29 UTC 2018

On 05/18/2018 02:56 AM, Daniel P. Berrangé wrote:
> On Thu, May 17, 2018 at 05:43:37PM -0500, Eric Blake wrote:
>> Here's my updated counterproposal for a backup API.
>>
>> In comparison to v2 posted by Nikolay:
>> https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html
>> - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a
>> "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End"
>> - flesh out more API descriptions
>> - better documentation of proposed XML, for both checkpoints and backup
>>
>> Barring any major issues turned up during review, I've already starting to
>> code this into libvirt with a goal of getting an implementation ready for
>> review this month.
> 
> I think the key thing missing from the docs is some kind of explanation
> about the difference between a backup, and checkpoint and a snapshot.
> I'll admit I've not read the mail in detail, but at a high level it is
> not immediately obvious what the difference is & thus which APIs I would
> want to be using for a given scenario.

Indeed, and that's a fair complaint.  Here's a first draft, that I'll 
have to polish into a formal html document that both the snapshot and 
checkpoint/backup pages refer to (or maybe I merge snapshots and 
checkpoint descriptions into a single html page, although I'm not quite 
sure what to name the page then).

One of the features made possible with virtual machines is live
migration, or transferring all state related to the guest from one
host to another, with minimal interruption to the guest's activity.  A
clever observer will then note that if all state is available for live
migration, there is nothing stopping a user from saving that state at
a given point of time, to be able to later rewind guest execution back
to the state it previously had.  There are several different libvirt
APIs associated with capturing the state of a guest, such that the
captured state can later be used to rewind that guest to the
conditions it was in earlier.  But since there are multiple APIs, it
is best to understand the tradeoffs and differences between them, in
order to choose the best API for a given task.

Timing: Capturing state can be a lengthy process, so while the
captured state ideally represents an atomic point in time
correpsonding to something the guest was actually executing, some
interfaces require up-front preparation (the state captured is not
complete until the API ends, which may be some time after the command
was first started), while other interfaces track the state when the
command was first issued even if it takes some time to finish
capturing the state.  While it is possible to freeze guest I/O around
either point in time (so that the captured state is fully consistent,
rather than just crash-consistent), knowing whether the state is
captured at the start or end of the command may determine which
approach to use.  A related concept is the amount of downtime the
guest will experience during the capture, particularly since freezing
guest I/O has time constraints.

Amount of state: For an offline guest, only the contents of the guest
disks needs to be captured; restoring that state is merely a fresh
boot with the disks restored to that state.  But for an online guest,
there is a choice between storing the guest's memory (all that is
needed during live migration where the storage is shared between
source and destination), the guest's disk state (all that is needed if
there are no pending guest I/O transactions that would be lost without
the corresponding memory state), or both together.  Unless guest I/O
is quiesced prior to capturing state, then reverting to captured disk
state of a live guest without the corresponding memory state is
comparable to booting a machine that previously lost power without a
clean shutdown; but for a guest that uses appropriate journaling
methods, this crash-consistent state may be sufficient to avoid the
additional storage and time needed to capture memory state.

Quantity of files: When capturing state, some approaches store all
state within the same file (internal), while others expand a chain of
related files that must be used together (external), for more files
that a management application must track.  There are also differences
depending on whether the state is captured in the same file in use by
a running guest, or whether the state is captured to a distinct file
without impacting the files used to run the guest.

Third-party integration: When capturing state, particularly for a
running, there are tradeoffs to how much of the process must be done
directly by the hypervisor, and how much can be off-loaded to
third-party software.  Since capturing state is not instantaneous, it
is essential that any third-party integration see consistent data even
if the running guest continues to modify that data after the point in
time of the capture.

Full vs. partial: When capturing state, it is useful to minimize the
amount of state that must be captured in relation to a previous
capture, by focusing only on the portions of the disk that the guest
has modified since the previous capture.  Some approaches are able to
take advantage of checkpoints to provide an incremental backup, while
others are only capable of a full backup including portions of the
disk that have not changed since the previous state capture.

With those definitions, the following libvirt APIs have these
properties:

virDomainSnapshotCreateXML: This API wraps several approaches for
capturing guest state, with a general premise of creating a snapshot
(where the current guest resources are frozen in time and a new
wrapper layer is opened for tracking subsequent guest changes).  It
can operate on both offline and running guests, can choose whether to
capture the state of memory, disk, or both when used on a running
guest, and can choose between internal and external storage for
captured state.  However, it is geared towards post-event captures
(when capturing both memory and disk state, the disk state is not
captured until all memory state has been collected first).  For qemu
as the hypervisor, internal snapshots currently have lengthy downtime
that is incompatible with freezing guest I/O, but external snapshots
are quick.  Since creating an external snapshot changes which disk
image resource is in use by the guest, this API can be coupled with
virDomainBlockCommit to restore things back to the guest using its
original disk image, where a third-party tool can read the backing
file prior to the live commit.

virDomainBlockCopy: This API wraps approaches for capturing the state
of disks of a running guest, but does not track accompanying guest
memory state.  The capture is consistent only at the end of the
operation, with a choice to either pivot to the new file that contains
the copy (leaving the old file as the backup), or to return to the
original file (leaving the new file as the backup).

virDomainBackupStart: This API wraps approaches for capturing the
state of disks of a running guest, but does not track accompanying
guest memory state.  The capture is consistent to the start of the
operation, where the captured state is stored independently from the
disk image in use with the guest, and where it can be easily
integrated with a third-party for capturing the disk state.  Since the
backup operation is stored externally from the guest resources, there
is no need to commit data back in at the completion of the operation.
When coupled with checkpoints, this can be used to capture incremental
backups instead of full.

virDomainCheckpointCreateXML: This API does not actually capture guest
state, so much as make it possible to track which portions of guest
disks have change between checkpoints or between a current checkpoint
and the live execution of the guest.  When performing incremental
backups, it is easier to create a new checkpoint at the same time as a
new backup, so that the next incremental backup can refer to the
incremental state since the checkpoint created during the current
backup.

Putting it together: the following two sequences both capture the disk
state of a running guest, then complete with the guest running on its
original disk image; but with a difference that an unexpected
interruption during the first mode leaves a temporary wrapper file
that must be accounted for, while interruption of the second mode has
no impact to the guest.

1. Backup via temporary snapshot
virDomainFSFreeze()
virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)
virDomainFSThaw()
third-party copy the backing file to backup storage # most time spent here
virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE)
wait for commit ready event
virDomainBlockJobAbort()

2. Direct backup
virDomainFSFreeze()
virDomainBackupBegin()
virDomainFSThaw()
wait for push mode event, or pull data over NBD # most time spent here
virDomainBackeupEnd()

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org