[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt] deadlock in remoteDispatchDomainUndefine vs daemonStreamHandleAbort

I happened to analyze a bug [1] report I got from a friend and for
quite a while it was rather elusive. But I now finally got it
reproducible [2] enough to share it with the community.

The TL;DR of what I see is:
- an automation with python-libvirt gets a SIGINT
- cleanup runs destroy and further undefine
- the guest closes FDs due to SIGINT and/or destroy which triggers
- those two fight over the lock

There I get libvirtd into a deadlock which ends up with all threads
dead [4] and two of them fighting [3] (details) in particular.

The to related stacks summarized are like:

daemonStreamHandleWrite (failing to write)
 -> daemonStreamHandleAbort (closing things and cleaning up)
    -> ... virChrdevFDStreamCloseCb

# there is code meant to avoid such issues emitting "Unable to close"
if a lock is held
# but the log doesn't show this triggering with debug enabled

#10 seems triggered via an "undefine" call
  ... -> virChrdevFree
     ... -> virFDStreamSetInternalCloseCb
        -> virObjectLock(virFDStreamDataPtr fdst)
          -> virMutexLock(&obj->lock);
  # closing all streams of a guest (requiring the same locks)

While that already feels quite close I struggle to see where exactly
we'd want to fix it.
But finally having a repro-script [2] I hope that someone else here
might be able to help me with that.

After all it is a race - on my s390x system it triggers usually <5
tries, while on x86 I have needed up to 18 runs of the test to hang.
Given different system configs it might be better or worse for you.

FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same.
I haven't built 5.1 or a recent master, but the commits since 5.0
didn't mention any issue that seems related. OTOH I'm willing and able
to build and try suggestions if anyone comes up with ideas.

[1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096
[2]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachment/5251655/+files/test4.py
[3]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3
[4]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17

Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]