[libvirt] Race between monitor startup and incoming migration impacting libvirt

Charles Duffy charles at dyfis.net
Sat Sep 26 01:43:54 UTC 2009


There appears to be a race condition wherein a 'cont' command sent 
immediately on qemu startup can prevent a inbound migration specified 
via -incoming from occurring. libvirt's process for starting up qemu 
domains with an incoming migration includes with a 'cont' command at the 
end of qemudInitCpus, shortly after a successful connection with the 
monitor is made. While the libvirt monitor is generally unresponsive 
while an inbound migration is ongoing, forcing the 'cont' to occur only 
after the migration has completed, this isn't always true (as will be 
demonstrated below).

I suspect strongly that this is responsible for an occasional failure 
I'm seeing when loading libvirt domains from file.

This is highly reproducible using qemu-kvm-0.11.0-rc2, and 
straightforward to demonstrate by the following means:


     [ONE-TIME SETUP]
     - Build an appropriate ramsave file via migrating a stopped guest 
to disk.
     - Mark any backing store used by this guest read-only.

     [COMMON STEPS]
     - Create an empty qcow2 file backed by the read-only store, if your 
guest has any disks.
     - Invoke qemu with arguments appropriate to the VM being resumed, 
and also the following: -S -monitor stdio -incoming 'exec:echo 
START_DELAY >&2 && sleep 5 && echo END_DELAY >&2 && cat <ramsave.raw && 
echo LOAD_DONE >&2'.

     [VALIDATING CORRECT OPERATION]
     - Wait until 'LOAD_DONE' is displayed, and run 'cont'
     - The VM will correctly resume.

     [REPRODUCING THE BUG]
     - Run 'cont' after START_DELAY is displayed, but before END_DELAY.
     - 'cat: write error: Broken pipe' will be displayed.
     - The guest VM will reboot, enter a catatonic state, or otherwise 
fail to load correctly.

     [REPRODUCING WITHOUT ARTIFICIAL DELAY]
     As the 'sleep 5' used in the above may be considered cheating, this 
issue may also be reproduced without any delay by removing the 'sleep', 
and terminating the shell command used to invoke qemu with <<<$'cont\n'

     [REPRODUCING OVER A UNIX SOCKET]
     Included for completeness, as libvirt 0.7.x uses UNIX sockets here.
     Use -monitor unix:tmp/test.monitor during qemu invocation, and
     - Invoke the following in a separate window:
       socat - UNIX-LISTEN:/tmp/test.monitor <<<$'cont\n'
     - Invoke qemu as above, but with -monitor unix:/tmp/test.monitor

I have a work-in-progress patch which modifies libvirt to use -daemonize 
for startup; waiting for the guest to detach before attempting to 
interact with the monitor may avoid this issue. However, as this patch 
is against libvirt master, and the master branch has other issues which 
expose themselves on virDomainRestore, I am unable to test it here.


Thoughts (and workarounds) welcome.




More information about the libvir-list mailing list