[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] cont command failing via JSON monitor on restore



On 01/12/2011 05:13 PM, Jim Fehlig wrote:
libvirt 0.8.7
qemu 0.13

I'm looking into a problem with qemu save/restore via JSON monitor.  On
restore, the vm is left in a paused state with following error returned
for 'cont' command

An incoming migration is expected before this command can be executed

I was trying to debug the issue in gdb, but stepping through the code
introduces enough delay between qemudStartVMDaemon() and doStartCPUs()
that the latter succeeds.  Any suggestions on how to determine when it
is safe to call doStartCPUs() to prevent the above error?  I don't see
this issue with the text monitor btw.

I'm pretty sure this is related to a bug I reported on qemu-devel last April:

   http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg00635.html

(be sure to read my own followup if you want a correct description of the circumstances). In this case libvirt was using the text monitor, and there was a race condition between qemudStartVMDaemon (which executes qemu with '-S -incoming') and doStartCPUs() (which issues a 'cont' command to the qemu monitor). The result would be that sometimes the 'cont' would be received and processed by qemu before the incoming migration had started, meaning that qemu would be executing garbage memory instead of the saved/restored image of the guest.

The solution to this was posted to upstream qemu in July:

  http://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01574.html

and I believe is in qemu 0.13. That patch adds a check to the 'cont' command so that if '-incoming' was specified on the commandline, 'cont' will only execute after a migration has successfully completed, but will otherwise return an error.

Actually, thinking about this "fix", it seems that it isn't really a solution, because instead of the guest starting up in an indeterminate state, doStartCPUs() will just fail (as you've seen) making the entire guest startup fail.

You can almost surely make it work properly by putting in a 250msec delay between those two function calls in libvirt. It would be nice if it could be totally fixed in qemu, though, so that libvirt didn't need such a hack :-(

(I had unfortunately lost track of the bug by the time the patch was posted - it had been there for so long I'd just gotten used to manually pausing/unpausing any guest I wanted to save on the one machine that displays the problem. Too bad I got so used to living with it, as I'd have otherwise been forced to try it out (this machine is running F13, which is still at qemu-0.12.5, which doesn't have the patch).


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]