[libvirt] [PATCH] [Changelog]:If there is a process with a client which registers event callbacks, and it calls libvirt's API which uses the same virConnectPtr in that callback function. When this process exit abnormally lead to client disconnect, there is a possibility that the libvirtd's main thread is use virServerClient to send event just after the virServerClient been freed by job thread. Following is backtrace:

John Ferlan jferlan at redhat.com
Mon Mar 13 22:12:45 UTC 2017


Not sure you followed the exact meaning from the hacking page - that's
an awfully long commit message string. It should be a single line of up
to about 70 chars with "details" in the commit message itself somewhat
like the original patch, e.g.:

"remote: Fix use after free reference for connection pointer"

followed by a "solid summary" for what's changing in the commit message.
There's many examples...

Those details should contain enough information so that someone
reviewing won't need to attempt to entirely replicate the research you
did to discover/diagnose the problem.

For these types of "edge conditions" involving multiple references to
objects and the order of free or something really complicated I'm always
in favor of adding as much detail as you can recall to provide enough of
a frame of reference for the reviewer to be able to paint the same
picture in their mind as you have (now) in yours. Not only is it really
helpful, but it makes review/triage a whole lot easier. In particular
some details how you discovered the bug, how to reproduce, and why you
feel the fix is where you put it. Such as, what setup and path do you
feel closed the connection before and what shreds of evidence are there.

Also your patch and any attached traces should be done against the top
of the git HEAD... I don't believe that's the case here since the line
numbers in your trace don't necessarily match up to the current HEAD.

This isn't an area of the code that I know well, so providing some
specifics would have helped this reader understand the scope of the
problem. I'm still not sure what's being fixed. I can chase the "normal"
close paths, but it's not clear what abnormal path you're attempt to fix.

I did do a bit of research, at first I focused too much on the
virConnectPtr adjustments, but eventually I started digging into the
'client' and callback/event/state queues and I think I made a discovery
described in the stack trace.


On 03/08/2017 01:39 AM, xinhua.Cao wrote:
>     #0  0x00007fda223d66d8 in virClassIsDerivedFrom (klass=0xdeadbeef, parent=0x7fda24c81b40) at util/virobject.c:169
>     #1  0x00007fda223d6a1e in virObjectIsClass (anyobj=anyobj at entry=0x7fd9e575b400, klass=<optimized out>) at util/virobject.c:365
>     #2  0x00007fda223d6a44 in virObjectLock (anyobj=0x7fd9e575b400) at util/virobject.c:317
>     #3  0x00007fda22507f71 in virNetServerClientSendMessage (client=client at entry=0x7fd9e575b400, msg=msg at entry=0x7fd9ec30de90)
>     at rpc/virnetserverclient.c:1422

This is where things fall apart - virNetServerClientSendMessage takes a
'client' pointer of virNetServerClientPtr. It seems by the time you get
to that point the 'client' is already free'd (or virObjectUnref()'d)

However, the code you change in remoteClientFreeFunc (and
remoteClientCloseFunc) takes a daemonClientPrivate, so I'm missing the
connection with the changes you made.

It would seem you need to find the needle in the haystack where the
virNetServerClientPtr is disposed perhaps improperly or too early.
Perhaps some code that doesn't do a virObjectRef() when it assigns the
client to something else.

Code such as in remote.c where the following occurs (multiple times):

    if (VIR_ALLOC(callback) < 0)
        goto cleanup;
    callback->client = client;

Which could be considered similar to how the virConnectPtr is handled in
virObjectEventCallbackListAddID where there's a :

    if (VIR_ALLOC(cb) < 0)
        goto cleanup;
    cb->conn = virObjectRef(conn);

which corresponds to the virObjectUnref()'s done later for conn.

If this is the 'real' fix, then

 1. Ensure to virObjectUnref(callback->client) if we fail to append or
register the callback

 2. Modify remoteEventCallbackFree to virObjectUnref(calback->client)
the client (of course you'll need to define variable such as
"daemonClientEventCallbackPtr callback = opaque;" too.


In any case, since I'd already done a review pass on the code before the
type A engineer in me kicked in to try and find out what the problem was
- I'll leave that in the following sections. Perhaps it'll help you out
in future libvirt patch submissions....

>     #4  0x00007fda230d714d in remoteDispatchObjectEventSend (client=0x7fd9e575b400, program=0x7fda24c844e0, procnr=348,
>     proc=0x7fda2310e5e0 <xdr_remote_domain_event_callback_tunable_msg>, data=0x7ffc3857fdb0) at remote.c:3803
>     #5  0x00007fda230dd71b in remoteRelayDomainEventTunable (conn=<optimized out>, dom=0x7fda27cd7660, params=0x7fda27f3aae0, nparams=1,opaque=0x7fd9e6c99e00) at remote.c:1033
>     #6  0x00007fda224484cb in virDomainEventDispatchDefaultFunc (conn=0x7fda27cd0120, event=0x7fda2736ea00,
>     cb=0x7fda230dd610 <remoteRelayDomainEventTunable>, cbopaque=0x7fd9e6c99e00) at conf/domain_event.c:1910
>     #7  0x00007fda22446871 in virObjectEventStateDispatchCallbacks (callbacks=<optimized out>, callbacks=<optimized out>, event=0x7fda2736ea00,state=0x7fda24ca3960) at conf/object_event.c:722
>     #8  virObjectEventStateQueueDispatch (callbacks=0x7fda24c65800, queue=0x7ffc3857fe90, state=0x7fda24ca3960) at conf/object_event.c:736
>     #9  virObjectEventStateFlush (state=0x7fda24ca3960) at conf/object_event.c:814
>     #10 virObjectEventTimer (timer=<optimized out>, opaque=0x7fda24ca3960) at conf/object_event.c:560
>     #11 0x00007fda223ae8b9 in virEventPollDispatchTimeouts () at util/vireventpoll.c:458
>     #12 virEventPollRunOnce () at util/vireventpoll.c:654
>     #13 0x00007fda223ad1d2 in virEventRunDefaultImpl () at util/virevent.c:314
>     #14 0x00007fda225046cd in virNetDaemonRun (dmn=0x7fda24c775c0) at rpc/virnetdaemon.c:818
>     #15 0x00007fda230d6351 in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1623
> 
>     Let's clean all event callback when client close.
> 
>     Signed-off-by: xinhua.Cao <caoxinhua at huawei.com>
> 
>     this patch resolved the bug which reported by Wang King. and the prior patch link is  http://www.redhat.com/archives/libvir-list/2017-February/msg00452.html
> ---
>  daemon/remote.c | 60 +++++++++++++++++++++++++++------------------------------
>  1 file changed, 28 insertions(+), 32 deletions(-)
> 

While this does apply with git am -3, it didn't with just 'git am' which
says to me you're not sending a patch based on git HEAD. It's not like
daemon/remote.c changes that much - although Jan 5 there was a commit
which certain alters the line numbers above by a bit...


> diff --git a/daemon/remote.c b/daemon/remote.c
> index 4aa43c223..22330307f 100644
> --- a/daemon/remote.c
> +++ b/daemon/remote.c
> @@ -100,6 +100,7 @@ static void make_nonnull_node_device(remote_nonnull_node_device *dev_dst, virNod
>  static void make_nonnull_secret(remote_nonnull_secret *secret_dst, virSecretPtr secret_src);
>  static void make_nonnull_nwfilter(remote_nonnull_nwfilter *net_dst, virNWFilterPtr nwfilter_src);
>  static void make_nonnull_domain_snapshot(remote_nonnull_domain_snapshot *snapshot_dst, virDomainSnapshotPtr snapshot_src);
> +static void remoteClientCleanEventCallbacks(struct daemonClientPrivate *priv);

No need for a forward declaration in this case since the static function
would be declared first.

>  
>  static int
>  remoteSerializeDomainDiskErrors(virDomainDiskErrorPtr errors,
> @@ -1389,24 +1390,15 @@ void remoteRelayConnectionClosedEvent(virConnectPtr conn ATTRIBUTE_UNUSED, int r
>                                    &msg);
>  }
>  
> -/*
> - * You must hold lock for at least the client
> - * We don't free stuff here, merely disconnect the client's
> - * network socket & resources.
> - * We keep the libvirt connection open until any async
> - * jobs have finished, then clean it up elsewhere
> - */
> -void remoteClientFreeFunc(void *data)
> +static void remoteClientCleanEventCallbacks(struct daemonClientPrivate *priv)

The "newer" style of function would be:

static void
remoteClientCleanEventCallbacks(struct daemonClientPrivate *priv)
{
}

If you find this important, then doing something like this should have
been a separate patch prior to the one that you fixes the bug seen...
That is one that does nothing more than create a separate function or
perhaps in this case a C macro do obfuscate those repeated lines for
each event list.

We call these code motion or refactoring patches. They need a very short
summary just indicating that you're refactoring code.

>  {
> -    struct daemonClientPrivate *priv = data;
> -
>      /* Deregister event delivery callback */
> -    if (priv->conn) {
> -        virIdentityPtr sysident = virIdentityGetSystem();
> +    if (priv && priv->conn) {

This type of check could be in the caller(s)... That would alter your
indention too...

>          size_t i;
> -
> +        virIdentityPtr sysident = virIdentityGetSystem();
>          virIdentitySetCurrent(sysident);
>  
> +        virMutexLock(&priv->lock);

I can certainly see why it's felt this might be important, but since
each loop calls virConnectDomainEventDeregisterAny which ends up locking
in the remote* functions, I'm not sure they're necessary. I would be
concerned over deadlock once the vir* function is called.

>          for (i = 0; i < priv->ndomainEventCallbacks; i++) {
>              int callbackID = priv->domainEventCallbacks[i]->callbackID;
>              if (callbackID < 0) {
> @@ -1420,6 +1412,7 @@ void remoteClientFreeFunc(void *data)
>                  VIR_WARN("unexpected domain event deregister failure");
>          }
>          VIR_FREE(priv->domainEventCallbacks);
> +        priv->ndomainEventCallbacks = 0;

You've added these "= 0" for domain, network, and qemu events, but not
the others. Any special reason why the others were not included?

Regardless as I look through the code - register and deregister
functions seem to properly incr/decr the value - so I'm not sure this is
necessary especially since the 'priv' gets VIR_FREE()'d shortly.

FWIW: If there's a path you find that doesn't decr properly, then that
would need to be a separate patch.

>  
>          for (i = 0; i < priv->nnetworkEventCallbacks; i++) {
>              int callbackID = priv->networkEventCallbacks[i]->callbackID;
> @@ -1435,21 +1428,7 @@ void remoteClientFreeFunc(void *data)
>                  VIR_WARN("unexpected network event deregister failure");
>          }
>          VIR_FREE(priv->networkEventCallbacks);
> -
> -        for (i = 0; i < priv->nstorageEventCallbacks; i++) {
> -            int callbackID = priv->storageEventCallbacks[i]->callbackID;
> -            if (callbackID < 0) {
> -                VIR_WARN("unexpected incomplete storage pool callback %zu", i);
> -                continue;
> -            }
> -            VIR_DEBUG("Deregistering remote storage pool event relay %d",
> -                      callbackID);
> -            priv->storageEventCallbacks[i]->callbackID = -1;
> -            if (virConnectStoragePoolEventDeregisterAny(priv->conn,
> -                                                        callbackID) < 0)
> -                VIR_WARN("unexpected storage pool event deregister failure");
> -        }
> -        VIR_FREE(priv->storageEventCallbacks);

?? What happened to the storage event callbacks?  They cannot just
disappear.

> +        priv->nnetworkEventCallbacks = 0;
>  
>          for (i = 0; i < priv->nqemuEventCallbacks; i++) {
>              int callbackID = priv->qemuEventCallbacks[i]->callbackID;
> @@ -1465,28 +1444,45 @@ void remoteClientFreeFunc(void *data)
>                  VIR_WARN("unexpected qemu monitor event deregister failure");
>          }
>          VIR_FREE(priv->qemuEventCallbacks);
> +        priv->nqemuEventCallbacks = 0;
>  
>          if (priv->closeRegistered) {
>              if (virConnectUnregisterCloseCallback(priv->conn,
>                                                    remoteRelayConnectionClosedEvent) < 0)
>                  VIR_WARN("unexpected close callback event deregister failure");
>          }
> -
> -        virConnectClose(priv->conn);
> -
> +        virMutexUnlock(&priv->lock);
>          virIdentitySetCurrent(NULL);
>          virObjectUnref(sysident);
>      }
> +}
> +
> +
> +/*
> + * You must hold lock for at least the client
> + * We don't free stuff here, merely disconnect the client's
> + * network socket & resources.
> + * We keep the libvirt connection open until any async
> + * jobs have finished, then clean it up elsewhere
> + */
> +void remoteClientFreeFunc(void *data)

And this would be:

void
remoteClientFreeFunc(void *data)

> +{
> +    struct daemonClientPrivate *priv = data;
> +    if (!priv || !priv->conn)
> +        return;

If I check the caller from a "normal" path:

   virEventPollRunOnce =>
    virEventPollCleanupHandles =>
     virNetSocketEventFree =>
      virObjectUnref =>
       virNetServerClientDispose

I see that virNetSocketEventFree would set sock->opaque to NULL, so what
other path(s) would call into here where perhaps the *data argument
wouldn't be reset to NULL.

IOW: How would data == NULL at this point?


> +
> +    remoteClientCleanEventCallbacks(priv);
> +    virConnectClose(priv->conn);
>  
>      VIR_FREE(priv);
>  }
>  
> -

Policy is two blank lines between functions, so don't remove this one.

>  static void remoteClientCloseFunc(virNetServerClientPtr client)
>  {
>      struct daemonClientPrivate *priv = virNetServerClientGetPrivateData(client);
>  
>      daemonRemoveAllClientStreams(priv->streams);
> +    remoteClientCleanEventCallbacks(priv);

Putting this here is what led me down the path of why would moving this
make a difference. Since remoteClientFreeFunc is essentially called
after this (in the normal path at least) - sure moving the cleanup here
could be a solution, but it could also be masking the "root cause".
Besides, this code has been this way for *ages* - changing something
like this always requires extra scrutiny as I have found out through
much trial and error that there's a reason for "current" format.

In this case, there's processing in virNetDaemonRun between the
virHashForEach for the daemonServerProcessClients (which could call this
function) and the virEventRunDefaultImpl/virEventPollRunOnce code first
call to virEventPollCleanupHandles which ends up calling
virNetServerClientDispose. Something tells me that processing is
important; otherwise, when first developed the event cleanup would have
been put here (see commit id 'df0b57a95').

In any case, in light of the virObjectRef(client) - I don't think this
is necessary.

Looking forward to the next series of patches...

John

>  }
>  
>  
> 




More information about the libvir-list mailing list