[libvirt] libvirtd deadlock on shutdown

Thu Jun 21 09:01:53 UTC 2012

On 21.06.2012 01:37, Jim Fehlig wrote:
> Eric Blake wrote:
>> On 06/20/2012 03:07 PM, Jim Fehlig wrote:
>>   
>>> I'm looking into a libvirtd deadlock on daemon shutdown.  The deadlock
>>> occurs when shutting down virNetServer.  If the handling of a job is in
>>> flight in virNetServerHandleJob(), the virNetServer lock is acquired
>>> when freeing job->prog (src/rpc/virnetserver.c:167).  But the lock is
>>> already held in virNetServerFree(), which is blocked in
>>> virThreadPoolFree() waiting for all the workers to finish.  No progress
>>> can be made.
>>>
>>> The attached hack fixes the problem, but I'm not convinced this is an
>>> appropriate fix.  Is it necessary to hold the virNetServer lock when
>>> calling virNetServerProgramFree(job->prog)?  I notice the lock is not
>>> held in the error path of virNetServerHandleJob().
>>>
>>>     
>>
>>   
>>> +++ b/src/rpc/virnetserver.c
>>> @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv)
>>>      for (i = 0 ; i < srv->nservices ; i++)
>>>          virNetServerServiceToggle(srv->services[i], false);
>>>  
>>> +    virNetServerUnlock(srv);
>>>      virThreadPoolFree(srv->workers);
>>> +    virNetServerLock(srv);
>>>     
>>
>> As written, this can't be right, because it reads a field of srv outside
>> the locks.  But maybe you meant:
>>
>> <type> tmp = srv->workers;
>> srv->workers = NULL;
>> virNetServerUnlock(srv);
>> virThreadPoolFree(tmp);
>> virNetServerLock(srv);
>>
>> as a possible fix that doesn't violate the safety rules of reading
>> fields from srv outside a lock.
>>   
> 
> Perhaps, but it _currently_ appears srv->workers is only set in
> virNetServerNew(), which is called when libvirtd starts.  I suppose that
> could change in the future, causing a bug as I wrote it.
> 
>> I hope danpb chimes in on this one, as he is more of an expert when it
>> comes to the locking rules in virnet*.
>>   
> 
> Agreed.  I think there is a better fix here.
> 
> Thanks,
> Jim

I've met this (or is it just similar) issue thousand times. When we are
processing an API and we are just in monitor when SIGINT is received.

I've started to work on patch but never had enough will and time to
finish it. Basically, our _virStateDriver struct misses notifyCleanup
which would prepare driver for cleanup (in case of qemu driver close all
qemu monitors making those APIs hanging around go away). This cannot be
done in cleanup because all virNetServerProgram's are freed already,
client sockets are closed, when entering cleanup. In fact, it's a bit
more complicated than that - if there's an API which was canceled by
this machinery we need to run one iteration of event loop to write
errors to users. But if there's no such API then we must not run event
loop as we would hang indefinitely (nothing to read and nothing to write).

The other approach I was playing with is pthread_cancel() but I guess we
are not ready yet.

Michal