[libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

Wed Oct 13 14:21:28 UTC 2010

  On 10/13/2010 09:11 AM, Daniel P. Berrange wrote:
> On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:
>>   On 10/07/2010 09:06 AM, Soren Hansen wrote:
>>> I had trouble applying the patch (I think maybe Thunderbird may have
>>> fiddled with the formatting :( ), but after doing it manually, it works
>>> excellently. Thanks!
>>>
>> Great. I will prepare a V3.
>>
>> I am also shooting a kill -SIGHUP at libvirt once in a while to see what
>> happens (while creating / destroying 2 VMs and modifying their filters).
>> Most of the time all goes well, but occasionally things do get stuck. I
>> get the following debugging output from libvirt and attaching gdb to
>> libvirt I see the following stack traces. Maybe Daniel can interpret
>> this... To me it looks like some of the conditions need to be 'tickled'...
>>
>> (gdb) thr ap all bt
>>
>> Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
>> #0  0x000000327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>     from /lib64/libpthread.so.0
>> #1  0x0000000000435312 in virCondWaitUntil (c=<value optimized out>,
>>      m=<value optimized out>, whenms=<value optimized out>)
>>      at util/threads-pthread.c:115
>> #2  0x000000000043d0ab in qemuDomainObjBeginJobWithDriver
>> (driver=0x1f9c010,
>>      obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
>> #3  0x0000000000458abf in qemuAutostartDomain (payload=<value optimized
>> out>,
>>      name=<value optimized out>, opaque=0x7f49bf591320)
>>      at qemu/qemu_driver.c:818
>> #4  0x00007f49c040ab6a in virHashForEach (table=0x1f9be20,
>>      iter=0x458a90<qemuAutostartDomain>, data=0x7f49bf591320)
>>      at util/hash.c:495
>> #5  0x000000000043cdac in qemudAutostartConfigs (driver=0x1f9c010)
>>      at qemu/qemu_driver.c:855
>> #6  0x000000000043ce2a in qemudReload () at qemu/qemu_driver.c:2003
>> #7  0x00007f49c0450a3e in virStateReload () at libvirt.c:1017
>> #8  0x00000000004189e1 in qemudDispatchSignalEvent (
>>      watch=<value optimized out>, fd=<value optimized out>,
>>      events=<value optimized out>, opaque=0x1f6f830) at libvirtd.c:388
>> ---Type<return>  to continue, or q<return>  to quit---
>> #9  0x00000000004186a9 in virEventDispatchHandles () at event.c:479
>> #10 virEventRunOnce () at event.c:608
>> #11 0x000000000041a346 in qemudOneLoop () at libvirtd.c:2217
>> #12 0x000000000041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
>> #13 0x0000003276807761 in start_thread () from /lib64/libpthread.so.0
>> #14 0x00000032760e14ed in clone () from /lib64/libc.so.6
>
> This thread shows the problem. Guests must not be run directly
> from the event loop thread, because startup requires waiting
> for I/O events. So this thread is sitting on the condition
> variable waiting for an I/O event to complete, but because
> its doing this from the event loop thread the event loop
> isn't running. So the condition will never be signalled.
> This is completely unrelated to the other problems discussed
> in this thread&  I'm surprised we've not seen it before now!
>
Yes, it's unrelated and came up through my testing of the code paths I 
touched with the deadlock-prevention patch...
> When you send SIGHUP to libvirt this triggers a reload of  the
> guest domain configs. For some reason we also have this SIGHUP
> re-triggering autostart. IMHO this is a very big mistake. If
> a guest is marked as autostart, I don't think an admin would
> expect it to be started when just sending SIGHUP. I think we
> should fix it so that autostart is only ever done at daemon
> startup, not SIGHUP. This would avoid the entire problem code
> path here
>
FWIW, I don't have any VM marked as 'autostart', but the code seems to 
be doing something 'for' VMs no matter whether they are marked as 
autostart or not, i.e., run  'qemuDomainObjBeginJobWithDriver'

    Stefan