[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt] Race between monitor startup and incoming migration impacting libvirt



There appears to be a race condition wherein a 'cont' command sent immediately on qemu startup can prevent a inbound migration specified via -incoming from occurring. libvirt's process for starting up qemu domains with an incoming migration includes with a 'cont' command at the end of qemudInitCpus, shortly after a successful connection with the monitor is made. While the libvirt monitor is generally unresponsive while an inbound migration is ongoing, forcing the 'cont' to occur only after the migration has completed, this isn't always true (as will be demonstrated below).

I suspect strongly that this is responsible for an occasional failure I'm seeing when loading libvirt domains from file.

This is highly reproducible using qemu-kvm-0.11.0-rc2, and straightforward to demonstrate by the following means:


    [ONE-TIME SETUP]
- Build an appropriate ramsave file via migrating a stopped guest to disk.
    - Mark any backing store used by this guest read-only.

    [COMMON STEPS]
- Create an empty qcow2 file backed by the read-only store, if your guest has any disks. - Invoke qemu with arguments appropriate to the VM being resumed, and also the following: -S -monitor stdio -incoming 'exec:echo START_DELAY >&2 && sleep 5 && echo END_DELAY >&2 && cat <ramsave.raw && echo LOAD_DONE >&2'.

    [VALIDATING CORRECT OPERATION]
    - Wait until 'LOAD_DONE' is displayed, and run 'cont'
    - The VM will correctly resume.

    [REPRODUCING THE BUG]
    - Run 'cont' after START_DELAY is displayed, but before END_DELAY.
    - 'cat: write error: Broken pipe' will be displayed.
- The guest VM will reboot, enter a catatonic state, or otherwise fail to load correctly.

    [REPRODUCING WITHOUT ARTIFICIAL DELAY]
As the 'sleep 5' used in the above may be considered cheating, this issue may also be reproduced without any delay by removing the 'sleep', and terminating the shell command used to invoke qemu with <<<$'cont\n'

    [REPRODUCING OVER A UNIX SOCKET]
    Included for completeness, as libvirt 0.7.x uses UNIX sockets here.
    Use -monitor unix:tmp/test.monitor during qemu invocation, and
    - Invoke the following in a separate window:
      socat - UNIX-LISTEN:/tmp/test.monitor <<<$'cont\n'
    - Invoke qemu as above, but with -monitor unix:/tmp/test.monitor

I have a work-in-progress patch which modifies libvirt to use -daemonize for startup; waiting for the guest to detach before attempting to interact with the monitor may avoid this issue. However, as this patch is against libvirt master, and the master branch has other issues which expose themselves on virDomainRestore, I am unable to test it here.


Thoughts (and workarounds) welcome.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]