[libvirt] [Question]Libvirt doesn't care about qemu monitor event if fail to destroy qemu process

Mon Mar 5 09:36:47 UTC 2018

On 03/05/2018 10:26 AM, Michal Privoznik wrote:
> On 03/05/2018 03:20 AM, Wuzongyong (Euler Dept) wrote:
>> Hi,
>>
>> We unregister qemu monitor after sending QEMU_PROCESS_EVENT_MONITOR_EOF to workerPool:
>>
>> static void
>> qemuProcessHandleMonitorEOF(qemuMonitorPtr mon,
>>                             virDomainObjPtr vm,
>>                             void *opaque)
>> {
>>     virQEMUDriverPtr driver = opaque;
>>     qemuDomainObjPrivatePtr priv;
>> struct qemuProcessEvent *processEvent;
>> ...
>> processEvent->eventType = QEMU_PROCESS_EVENT_MONITOR_EOF;
>>     processEvent->vm = vm;
>>
>>     virObjectRef(vm);
>>     if (virThreadPoolSendJob(driver->workerPool, 0, processEvent) < 0) {
>>         ignore_value(virObjectUnref(vm));
>>         VIR_FREE(processEvent);
>>         goto cleanup;
>>     }
>>
>>     /* We don't want this EOF handler to be called over and over while the
>>      * thread is waiting for a job.
>>      */
>> qemuMonitorUnregister(mon);
>> ...
>> }
>>
>> Then we handle QEMU_PROCESS_EVENT_MONITOR_EOF in processMonitorEOFEvent function:
>>
>> static void
>> processMonitorEOFEvent(virQEMUDriverPtr driver,
>>                        virDomainObjPtr vm)
>> {
>>       ...
>>       if (qemuProcessBeginStopJob(driver, vm, QEMU_JOB_DESTROY, true) < 0)
>>         return;
>>       ...
>> }
>>
>> Here,  libvirt will show that the vm state is running all the time if qemuProcessBeginStopJob return -1
>> even though qemu may terminate or be killed later.
>>
>> So, may be we should re-register the monitor when qemuProcessBeginStopJob failed?
> 
> The fact that processMonitorEOFEvent() failed to grab DESTROY job means
> that we screwed up earlier and now you're just seeing effects of it.
> Threads should be albe to acquire DESTROY job at any point, regardless
> of other jobs set on the domain object.
> 
> Can you please:
> a) try to turn on debug logs [1] and tell us why acquiring DESTROY job
> failed? You should see an error message like this:
> 
>   error: cannot acquire state change lock ..
> 
> b) tell us what is your libvirt version and if you're able to reproduce
> this with the latest git HEAD?
> 
> 
> Ha! Looking at the code I think I've found something that might be
> causing this issue. Do you have max_queued set in qemu.conf? Because if
> you do, then qemuDomainObjBeginJobInternal() might fail to set job
> because it's above the set limit. If I'm right, this should be the fix:
> 
> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index 8b4efc82d..7eb631e06 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -5401,7 +5401,8 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
>      then = now + QEMU_JOB_WAIT_TIME;
> 
>   retry:
> -    if (cfg->maxQueuedJobs &&
> +    if ((!async && job == QEMU_JOB_DESTROY) &&

Oh, this should be reversed (note to myself: don't write any patches
until morning coffee kicks in):

if ((!async && job != QEMU_JOB_DESTROY) &&
    cfg->maxQueuedJobs &&

Michal