[libvirt] Long QEMU main loop pauses during migration (to file) under heavy load

Fri Nov 11 13:03:20 UTC 2011

Libvirt recently introduced a change to the way it does 'save to file'
with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by
default. When saving to file, we didn't want any artificial limit,
but rather to max out the underlying storage. So when doing save to
file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so
it is effectively unlimited.

After doing this, we discovered that the QEMU monitor was becoming
entirely blocked. It did not even return from the 'migrate' command
until migration was complete despite the 'detach' flag being set.
This was a bug in libvirt, because we passed a plain file descriptor
which does not support EAGAIN. Thank you POSIX.

Libvirt has another mode where it uses an I/O helper command so get
O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring
that this pipe FD really does have O_NONBLOCK set, we still saw some
odd behaviour.

I'm not sure whether what I describe can neccessarily be called a QEMU
bug, but I wanted to raise it for discussion anyway....

The sequence of steps is

  - libvirt sets qemu migration bandwidth to "unlimited"
  - libvirt opens a pipe() and sets O_NONBLOCK on the write end
  - libvirt spawns  libvirt-iohelper giving it the target file
    on disk, and the read end of the pipe
  - libvirt does 'getfd migfile' monitor command to give QEMU
    the write end of the pipe
  - libvirt does 'migrate fd:migfile -d' to run migration
  - In parallel
       - QEMU is writing to the pipe (which is non-blocking)
       - libvirt_helper reading the pipe & writing to disk with O_DIRECT

The initial 'migrate' command detaches into the background OK, and
libvirt can enter its loop doing 'query-migrate' frequently to
monitor progress. Initially this works fine, but at some points
during the migration, QEMU will get "stuck" for a very long time
and not respond to the monitor (or indeed the mainloop at all).
These blackouts are anywhere from 10 to 20 seconds long.

Using a combination of systemtap, gdb and strace I managed to determine
out the following

 - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms

 - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms

 - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds*

 - I can see EAGAIN from the FD QEMU is migrating from - hence most
   of the iterations are quite short.

 - In the 10-20 second long calls, no EAGAIN is seen for the entire
   period.

 - The host OS in general is fairly "laggy", presumably due to the high
   rate of direct I/O being performed by the I/O helper, and bad schedular
   tuning

IIUC, there are two things which will cause a qemu_savevm_state_iterate()
call to return

   - Getting EAGAIN on the migration FD
   - Hitting the max bandwidth limit

We have set effectively unlimited bandwidth, so everything relies on
the EAGAIN behaviour.

If the OS is doing a good job at scheduling processes & I/O, then this
seems to work reasonably well. If the OS is highly loaded and becoming
less responsive to scheduling apps, then QEMU gets itself into a bad
way.

What I think is happening is that the OS is giving too much time to
the I/O helper process that is reading the other end of the pipe
given to QEMU, and then doing the O_DIRECT to disk.

Thus in the shorter-than-normal windows of time when QEMU itself is
scheduled by the OS, the pipe is fairly empty, so QEMU does not see
EAGAIN for a very long period of wallclock time.

So we get into a case where QEMU sees 10-20 second gaps betweeen
iterations of the main loop. Having a non-inifinite max-bandwidth
for the migration, would likely mitigate this to some extent, but
I still think it'd be possible to get QEMU into these pathelogical
conditions under high load for a host.

Is this a scenario we need to worry about for QEMU ? On the one
hand it seems like it is a rare edge case in OS behaviour overall.
On the other hand, times when a host is highly loaded and non-responsive
are exactly the times when a mgmt app might want to save a guest to
disk, or migrate it elsewhere.  Which would mean we need QEMU to
behave as well as possible in these adverse conditions.

Thus should we consider having an absolute bound on the execution time
of qemu_savevm_state_iterate(), independant of EAGAIN & bandwidth
limits, to ensure the main loop doesn't get starved ?

Or perhaps moving migration to a separate thread, out of the mainloop
is what we need to strive for ?

Regards,
Daniel
--  
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|