[libvirt] [PATCH 00/11] Post-Copy Live Migration Support

Thu Dec 4 09:40:00 UTC 2014

On 12/01/2014 10:59 AM, Cristian Klein wrote:
> Qemu currently implements pre-copy live migration. VM memory pages are
> first copied from the source hypervisor to the destination, potentially
> multiple times as pages get dirtied during transfer, then VCPU state
> is migrated. Unfortunately, if the VM dirties memory faster than the
> network bandwidth, then pre-copy cannot finish. `virsh` currently
> includes an option to suspend a VM after a timeout, so that migration
> may finish, but at the expense of downtime.
>
> A future version of qemu will implement post-copy live migration. The
> VCPU state is first migrated to the destination hypervisor, then
> memory pages are pulled from the source hypervisor. Post-copy has the
> potential to do migration with zero-downtime, despite the VM dirtying
> pages fast, with minimum performance impact. On the other hand, while
> post-copy is in progress, any network failure would render the VM
> unusable, as its memory is partitioned between the source and
> destination hypervisor. Therefore, post-copy should only be used when
> necessary.
>
> Post-copy migration in qemu will work as follows:
> (1) The `x-postcopy-ram` migration capability needs to be set.
> (2) Migration is started.
> (3) When the user decides so, post-copy migration is activated by
> sending the `migrate-start-postcopy` command.
> (4) Qemu acknowledges by setting migration status to `postcopy-active`.

(there are probably inaccuracies and misstatements in the following, but
the topic does need consideration, and this seemed like a good place to
bring it up while it's fresh in my mind...)

I happened to be thinking about post-copy migration vs. guest networking
over the weekend, and realized a potential problem related to starting
the destination domain so quickly after it is created - if the guest is
connected to the network via a host bridge that has STP enabled and a
non-zero forwarding delay, the guest's network traffic could be
interrupted until the delay timer has counted down. This points out a
couple of things:

1) the "migrate-start-postcopy" needs to be either sent, or acknowledged
(I'm not sure which coincides more closely with the stopping of the
source domain and starting of the destination domain) after the
destination domain's tap devices have existed and been connected to the
bridge long enough to be able to forward traffic.

2) libvirt needs to have a more formal separation of the following tasks:

    * allocate resources for a network device (i.e.
networkAllocateActualDevice())
    * create a network device (create and ifup the tap device,
      which would start timers counting down; in the case of macvtap,
the device
      should be created, but not ifup'ed)
    * activate a network device (for a tap device send a gratuitous arp
request,
      update the bridge's FDB for the guest's MAC address. For macvtap,
ifup the device)

It should also have the reverse of all these operations:

    * deactivate (remove fdb entries for tap, ifdown for macvtap)
    * destroy (delete the tap/macvtap device)
    * free  (networkReleaseActualDevice())

Additionally, for completeness we need "notify" which is done for each
guest interface any time libvirtd is restarted (this already exists in
networkNotifyActualDevice()); this just recreates libvirtd's tables of
which host interfaces are in use by guests.

Currently, libvirt does create and activate simultaneously (and also
qemu does a gratuitous ARP request at some point, although I haven't
checked if it happens when qemu starts or when the guest CPUs are
started), and deactivate, destroy, and free all happen at pretty much
the same time as well. The former leads to problems like this one
reported by dgilbert:

  https://bugzilla.redhat.com/show_bug.cgi?id=1081461

This is just one of several possible variations of "some parts of the
network have incorrect information about where MAC X is currently
located"; when you mix in post-copy migration, and manual handling of
the bridge FDB
(https://www.redhat.com/archives/libvir-list/2014-December/msg00173.html),
there are many opportunities for failure!

Back to my list of operations - to make migration work smoothly,
allocate and create should be done prior to starting the qemu process,
but activate shouldn't be done until just before the CPUs are turned on
(and ideally, *that* shouldn't happen until the connection to the device
is ready to forward traffic). Likewise, deactivate should be called as
soon as the CPUs are paused, while destroy/free should be done after
qemu is terminated. This way, the guest's MAC will only be in one
bridge's FDB at any given time, and it will be the FDB of the bridge
attached to the currently running instance.

Does anybody else have any thoughts/ideas on this subject? Cleaning up
the hypervisor drivers' use of network devices has been on my mind for a
long time, and it may be time to finally take action.