[libvirt] [PATCH] RFC: Support QEMU live uprgade

Tue Nov 19 07:43:55 UTC 2013

on 2013/11/13 21:10, Daniel P. Berrange wrote:
> On Wed, Nov 13, 2013 at 12:15:30PM +0800, Zheng Sheng ZS Zhou wrote:
>> Hi Daniel,
>>
>> on 2013/11/12/ 20:23, Daniel P. Berrange wrote:
>>>> On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote:
>>>> Hi all,
>>>>
>>>> Recently QEMU developers are working on a feature to allow upgrading
>>>> a live QEMU instance to a new version without restarting the VM. This
>>>> is implemented as live migration between the old and new QEMU process
>>>> on the same host [1]. Here is the the use case:
>>>>
>>>> 1) Guests are running QEMU release 1.6.1.
>>>> 2) Admin installs QEMU release 1.6.2 via RPM or deb.
>>>> 3) Admin starts a new VM using the updated QEMU binary, and asks the old
>>>> QEMU process to migrate the VM to the newly started VM.
>>>>
>>>> I think it will be very useful to support QEMU live upgrade in libvirt.
>>>> After some investigations, I found migrating to the same host breaks
>>>> the current migration code. I'd like to propose a new work flow for
>>>> QEMU live migration. It is to implement the above step 3).
>>>
>>> How does it break migration code ? Your patch below is effectively
>>> re-implementing the multistep migration workflow, leaving out many
>>> important features (seemless reconnect to SPICE clients for example)
>>> which is really bad for our ongoing code support burden, so not
>>> something I want to see.
>>>
>>> Daniel
>>>
>>
>> Actually I wrote another hacking patch to investigate how we
>> can re-use existing framework to do local migration. I found
>> the following problems.
>>
>> (1) When migrate to different host, the destination domain uses
>> the same UUID and name as the source, and this is OK. When migrate
>> to localhost, destination domain UUID and name causes conflict
>> with the source. In QEMU driver, it maintains a hash table of
>> domain objects, the reference key is the UUID of the virtual
>> machine. The closeCallbacks is also a hash table with domain
>> UUID as key, and maybe there are other data structures using
>> UUID as key. This implies we use a different name and UUID
>> for the destination domain. In the migration framework, during
>> the Begin and Prepare stage, it calls virDomainDefCheckABIStability
>> to prevent us using a different UUID, and it also checks the
>> hostname and host UUID to be different. If we want to enable
>> local migration, we have to skip these check and generate new
>> UUID and name for destination domain. Of course we restore the
>> original UUID after migration. UUID is used in higher level
>> management software to identify virtual machines. It should
>> stay the same after QEMU live upgrade.
> 
> This point is something that needs to be solved regardless of
> whether using migration framework, or re-inventing the migration
> framework. The QEMU driver fundamentally assumes that there is
> only ever one single VM with a given UUID, and a VM has only
> 1 process. IMHO name + uuid must be preserved during any live
> upgrade process, otherwise mgmt will get confused. This has
> more problems becasue 'name' is used for various resources
> created by QEMU on disk - eg the monitor command path. We can't
> have 2 QEMUs using the same name, but at the same time that's
> exactly what we'd need here.
> 

Thanks Daniel. I agree with you on that we should not change QEMU UUID.
I also think refactor and re-use existing migration code is great. So I
did some investigation towards this direction. I found the assumption
of one process in one VM and UUID is not only in the QEMU driver, also
in libvirt higher level data structure and functions, say virDomainObj
structure and virCloseCallbacksXXX functions. The hypervirsor process ID
is directly associated with virDomainObj.pid. virDomainObj contains only
one pid field. virCloseCallbacksXXX functions maintain an invariant that
only one callback and connection can be registered to each VM UUID. For
example, in a non-p2p migration, client opens two connections to
libvirtd, one for source domain and one for destination domain. When it
tries to register a close callback for the dst connection and the dst
domain, libvirt reports error that there is already another connection
registered the callback for this UUID, it's registered by src connection
for the src domain.

Is it acceptable if we start the new QEMU process giving it the same
UUID while we refer to it in libvirt virDomainObj using a different
UUID? I mean we can generate new UUID for the destination VM, thus
avoids all the conflicts in libvirt. We should also store the original
UUID in the VM definition, and when we start new QEMU process, use the
original UUID. After migration, we drop the new virDomainObj, and let
original virDomainObj attaches to the new QEMU process. In this way the
guest should not notice any change in the UUID and we avoid conflict in
libvirt.

>> (2) If I understand the code correctly, libvirt uses thread
>> pool to handle RPC requests. This means local migration may
>> cause deadlock in P2P migration mode. Suppose there are some
>> concurrent local migration requests and all the worker threads
>> are occupied by these requests. When source libvirtd connects
>> destination libvirtd on the same host to negotiate the migration,
>> the negotiation request is queued, but the negotiation request
>> will never be handled, because the original migration request
>> from client is waiting for the negotiation request to finish
>> to progress, while the negotiation request is queued waiting
>> for the original request to end. This is one of the dealock
>> risk I can think of.
>> I guess in traditional migration mode, in which the client
>> opens two connections to source and destination libvirtd,
>> there is also risk to cause deadlock.
> 
> Yes, it sounds like you could get deadlock even with 2 separate
> libvirtds, if both them were migrating to the other concurrently.
> 

We will try to locate and fix deadlock problems when implementing local
migration. This seems the right way to go.

> 
>> (3) Libvirt supports Unix domain socket transport, but
>> this is only used in a tunnelled migration. For native
>> migration, it only supports TCP. We need to enable Unix
>> domain socket transport in native migration. Now we already
>> have a hypervisor migration URI argument in the migration
>> API, but there is no support for parsing and verifying a
>> "unix:/full/path" URI and passing that URI transparently
>> to QEMU. We can add this to current migration framework
>> but direct Unix socket transport looks meaningless for
>> normal migration.
> 
> Actually as far as QEMU is concerned libvirt uses fd: migration
> only. Again though this points seems pretty much unrelated to
> the question of how we design the APIs & structure the code.
> 

Yes. I just want to remind that native unix socket transport is what
QEMU developers decide to use in local migration with page-flipping. You
may already notice that the system call vmsplice() needs a pipe. The old
QEMU process and the new one are not parent-child, so QEMU uses some
ancillary and out-of-band APIs of Unix domain socket to transfer the
pipe fd from one QEMU process to another. This is not supported by TCP.
That's why I need to enable direct Unix domain socket for QEMU live
upgrade.

>> (4) When migration fails, the source domain is resumed, and
>> this may not work if we enable page-flipping in QEMU. With
>> page-flipping enabled, QEMU transfers memory page ownership
>> to the destination QEMU, so the source virtual machine
>> should be restarted but not resumed when the migration fails.
> 
> IMHO that is not an acceptable approach. The whole point of doing
> live upgrades in place, is that you consider the VMs to be
> "precious". If you were OK with VMs being killed & restarted then
> we'd not bother doing any of this live upgrade pain at all.
> 
> So if we're going to support live upgrades, we *must* be able to
> guarantee that they will either succeed, or the existing QEMU is
> left intact.  Killing the VM and restarting is not an option on
> failure.
> 

Yes. I'll check with QEMU developers to see if a page-flipped guest can
resume vCPU or not.

>> So I propose a new and compact work flow dedicated for QEMU
>> live upgrade. After all, it's an upgrade operation based on
>> tricky migration. When developing the previous RFC patch for
>> the new API, I focused on the correctness of the work flow,
>> so many other things are missing. I think I can add things
>> like Spice seamless migration when I submitting new versions.
> 
> This way lies madness. We do not want 2 impls of the internal
> migration framework.
> 
>> I am also really happy if you could give me some advice to
>> re-use the migration framework. Re-using the current framework
>> can saves a lot of effort.
> 
> I consider using the internal migration framework a mandatory
> requirement here, even if the public API is different.
> 
> Daniel
> 

I also think re-using migration code is good. I'm trying to find ways to
avoid UUID and name conflict problems. If there is no simple way, I need
to investigate how much refactor should be done, and propose some
solutions to enable QEMU driver managing multiple processes.

Thanks and best regards!

-- 
Zhou Zheng Sheng / 周征晟