[libvirt] Fwd: There seems a deadlock in libvirt

Jiri Denemark jdenemar at redhat.com
Mon Feb 18 22:57:58 UTC 2013


On Mon, Feb 18, 2013 at 10:45:50 +0800, Chun-Hung Chen wrote:
> Hi, all,
> 
> We were running OpenStack with Ubuntu and libvirt 0.9.10. We found that
> libvirt monitor command not working well.
> There were a lot of error in libvirtd.log like this
> 2013-02-07 06:07:39.000+0000: 18112: error :
> qemuDomainObjBeginJobInternal:773 : Timed out during operation: cannot
> acquire state change lock
> 
> We dig into libvirtd by strace and find one of the thread only have the
> following command
> futex(0x7f69ac0ec0ec, FUTEX_WAIT_PRIVATE, 2717, NULL
> 
> It seems this thread waiting for reply but nothing came back thus other
> threads would wait for it. We also saw there is a function called
> virCondWaitUntil(). Is it safe for us to modify the code from virCondWait()
> to virCondWaitUntil() to prevent such deadlock scenario? Thanks.

No, replacing virCondWait with virCondWaitUntil is not safe. You would
solve the situation you fight with but on the other hand, you cound
bring nasty inconsistencies between qemu and libvirt since libvirt would
think a monitor command failed while it just took longer than expected.

> Following is the gdb -p 'libvirt.pid' and 'thread id' and 'bt full'

It's generally better to provide 'thread apply all backtrace' or
't a a bt full' if you wish since knowing what other threads are doing
is useful and sometimes crucial when solving issues.

> #0  0x00007f69c8c1dd84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> No symbol table info available.
> #1  0x00007f69c9ee884a in virCondWait (c=<optimized out>, m=<optimized
> #2  0x000000000049c749 in qemuMonitorSend (mon=0x7f69ac0ec0c0,
> #3  0x00000000004ac8ed in qemuMonitorJSONCommandWithFd (mon=0x7f69ac0ec0c0,
> #4  0x00000000004ae794 in qemuMonitorJSONGetBalloonInfo
> #5  0x0000000000457451 in qemudDomainGetInfo (dom=<optimized out>,
> #6  0x00007f69c9f63eda in virDomainGetInfo (domain=0x7f69980e3650,

Anyway, this does not look like a deadlock of any kind. It's just that
libvirt is waiting for qemu to reply to query-balloon monitor command.
The problem with this command is that it may require cooperation with
guest OS, i.e., qemu actually sends a request to the guest OS' balloon
driver and waits for the reply. Thus, if the guest OS is not responding,
you will end up waiting forever.

The good thing is, libvirt does not really need to send query-balloon
command to qemu from virDomainGetInfo API. And it currently does not
send that command. However, both qemu and libvirt need to support
BALLOON_CHANGE event. In other words, with new enough libvirt (git
commit v0.9.13-64-g1d9d510, which was first released in libvirt 1.0.0)
and qemu (not sure what version), the issue just disappears.

If memory ballooning is not needed (i.e., there's no need to shrink the
amount of memory given to a domain while the domain is running), you can
work around this issue even with old libvirt/qemu by disabling balloon
driver; just replace the existing memballoon element in domain XML with

    <memballoon model='none'/>

Jirka




More information about the libvir-list mailing list