[libvirt] Checkpoint VMs

Daniel P. Berrange berrange at redhat.com
Mon Jun 8 15:12:00 UTC 2009


On Mon, Jun 08, 2009 at 04:53:53PM +0200, Maximilian Wilhelm wrote:
> Hi!
> 
> Some months ago there were some mails about ideas of adding new API
> functions for checkpointing of domains.
> 
> For the project group Virtualized SuperComputer (the crazy guys with
> the ESX driver) we would like to have such a feature and would be
> willing to propose API patches and anything the like.
> 
> As we have to deal with many VMs at the same time which belong to one
> computation job (think: virtualized HPC cluster) we are facing the
> problem to checkpoint a set of VMs at the same time.
> 
> So besides a usual checkpoint we would be interested in something like
> 
>   checkpointAt (host, vm, timestamp)
>   restoreFromCheckpointAt (host, vm, timestamp)
> 
> an let the hypervisor / libvirtd / whatever piece of software on the
> host face the problem to execute the checkpoint/restore command at
> <timestamp>.
> 
> The main idea behind this is to checkpoint or restore the set of VMs
> as simultaniously as possible.

For this to work I assume you would have to issue the checkpointAt()
command to all the virDomainptr objects you cared about, using a
timestamp that is in the future, and hope that all the commands
get dealt with in time.  I'm not really convinced this is the best
approach from the API pov. 

That said I agree that we need need a more advanced & clever API
for save/restore/checkpoint. To start with I'd like an API that
is following along these kind of lines:

  http://www.redhat.com/archives/libvir-list/2009-March/msg00205.html

> As far as I got it ESX for examples is able to execute tasks at a
> given point of time once or recurring as far as you have a Virtual
> Center with all your hosts belong^Wconnected to it.
> 
> Any opinions on such new functions?

I think a first step would be producing a managed save/restore/checkpoint
capability. Once that's working consider the issue of batched/synchronized
operations. How accurate does a 'synchronized' checkpoint have to be ?
millseconds, seconds or even more latitude ? 

Currently each save operation in libvirt blocks until completion. Clearly
not good enough for a synchorized save of several VMs. If accuracy is on
the order of 'seconds', then making a non-blocking option for the save
method would allow an app to save/checkpoint multiple VMs in very close
succession. If you need milli-seconds accuracy, then we'd need more 
explicit API support for a delayed operation, or the abiity to define
an object representing a group of VMs, and then issue an operation on
that group, letting libvirt then dispatch it to each VM internally.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|




More information about the libvir-list mailing list