[libvirt] [PATCH 0/9] Resolve libvirtd hang on termination with connected long running client

Daniel P. Berrangé berrange at redhat.com
Fri Jan 26 10:29:49 UTC 2018


On Fri, Jan 26, 2018 at 11:47:04AM +0300, Nikolay Shirokovskiy wrote:
> 
> 
> On 19.01.2018 20:23, John Ferlan wrote:
> > RFC:
> > https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
> > 
> > Adjustments since RFC...
> > 
> > Patches 1&2: No change, were already R-B'd
> > Patch 3: Removed code as noted in code review, update commit message
> > Patch 4: From old series removed, see below for more details
> > Patch 9: no change
> > NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy at virtuozzo.com>
> >     are removed as they seemed to not be necessary
> > 
> > Replaced the former patch 4 with series of patches to (slowly) provide
> > support to disable new connections, handle removing waiting jobs, causing
> > the waiting workers to quit, and allow any running jobs to complete.
> > 
> > As it turns out, waiting for running jobs to complete cannot be done
> > from the virNetServerClose callbacks because that means the event loop
> > processing done during virNetServerRun will not allow any currently
> > long running worker job thread a means to complete.
> 
> Hi, John.
> 
> So you suggest a different appoarch. Instead of introducing means to 
> help rpc threads to finish after event loop is finished (stateShutdown hook in my series)
> you suggest to block futher new rpc processing and then waiting running
> rpc calls to finish keeping event loop running for that purpose.
> 
> This could lead to libvirtd never finish its termination if one of
> qemu processes do not respond for example. In case of long running
> operation such as migration finishing can take much time. On the 
> over hand if we finish rpc threads abruptly as in case of stateShutdown hook
> we will deal with all possible error paths on daemon finishing. May
> be good approach is to abort long running operation, then give rpc threads
> some time to finish as you suggest and only after that finish them abruptly
> to handle hanging qemu etc.

We should keep in mind why we are bothering todo any of this "graceful"
cleanup. 99% of the time it would be fine for libvirtd to just immediately
immediately exit and not run any cleanup code, since the OS cleans everything
up regardless. Really the only benefit of running cleanup is so that developers
can check for memory leaks across the process execution.

On balance it is *way* more important that libvirtd is guranteed to exit in
a short, finite amount of time, even if that means skipping cleanup, because
that is what is relevant to production deployment.

So this means while it is nice to wait for worker threads to complete their
current RPC option, we should not wait too long. eg Give them 15 seconds to
finish their work, and if they're not done then just unconditionally exit
libvirtd, possibly skipping remaining cleanup if that's neccessary to avoid
SEGV.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




More information about the libvir-list mailing list