[libvirt] [RFC] Add API to change qemu agent response timeout

Daniel P. Berrangé berrange at redhat.com
Fri Oct 4 09:13:11 UTC 2019


On Fri, Oct 04, 2019 at 08:14:39AM +0200, Peter Krempa wrote:
> On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
> > Some layered products such as oVirt have requested a way to avoid being
> > blocked by guest agent commands when querying a loaded vm. For example,
> > many guest agent commands are polled periodically to monitor changes,
> > and rather than blocking the calling process, they'd prefer to simply
> > time out when an agent query is taking too long.
> > 
> > This patch adds a way for the user to specify a custom agent timeout
> > that is applied to all agent commands.
> > 
> > One special case to note here is the 'guest-sync' command. 'guest-sync'
> > is issued internally prior to calling any other command. (For example,
> > when libvirt wants to call 'guest-get-fsinfo', we first call
> > 'guest-sync' and then call 'guest-get-fsinfo').
> > 
> > Previously, the 'guest-sync' command used a 5-second timeout
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
> > followed always blocked indefinitely
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
> > custom timeout is specified that is shorter than
> > 5 seconds,  this new timeout also used for 'guest-sync'. If there is no
> > custom timeout or if the custom timeout is longer than 5 seconds, we
> > will continue to use the 5-second timeout.
> > 
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional details.
> > 
> > Signed-off-by: Jonathon Jongsma <jjongsma at redhat.com>


> > @@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
> >  
> >      return 0;
> >  }
> > +
> > +int
> > +qemuAgentSetTimeout(qemuAgentPtr mon,
> > +                    int timeout)
> > +{
> > +    if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
> > +        virReportError(VIR_ERR_INVALID_ARG,
> > +                       _("guest agent timeout '%d' is "
> > +                         "less than the minimum '%d'"),
> > +                       timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);
> 
> This error is misleading as -1 and -2 are special values and not actual
> timeout.
> 
> > +        return -1;
> > +    }
> > +
> > +    mon->timeout = timeout;
> > +    return 0;
> > +}
> 
> [...]
> 
> > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> > index 1e041a8bac..09251cc9e2 100644
> > --- a/src/qemu/qemu_driver.c
> > +++ b/src/qemu/qemu_driver.c
> > @@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
> >      return ret;
> >  }
> >  
> > +static int
> > +qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
> > +                              int timeout)
> > +{
> > +    virDomainObjPtr vm = NULL;
> > +    qemuAgentPtr agent;
> > +    int ret = -1;
> > +
> > +    if (!(vm = qemuDomObjFromDomain(dom)))
> > +        goto cleanup;
> > +
> > +    if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) < 0)
> > +        goto cleanup;
> > +
> > +    agent = qemuDomainObjEnterAgent(vm);
> 
> You must acquire a job on @vm if you want to call this:
> 
> /*
>  * obj must be locked before calling
>  *
>  * To be called immediately before any QEMU agent API call.
>  * Must have already called qemuDomainObjBeginAgentJob() or
>  * qemuDomainObjBeginJobWithAgent() and checked that the VM is
>  * still active.
>  *
>  * To be followed with qemuDomainObjExitAgent() once complete
>  */
> qemuAgentPtr
> qemuDomainObjEnterAgent(virDomainObjPtr obj)
> 
> 
> > +    ret = qemuAgentSetTimeout(agent, timeout);
> > +    qemuDomainObjExitAgent(vm, agent);
> 
> Also this API is inherently racy if you have two clients setting the
> timeout and it will influence calls of a different client.
> 
> IMO the only reasonable approach is to add new APIs which have a
> 'timeout' parameter for any agent API which requires tunable timeout to
> prevent any races and unexpected behaviour.
> 
> Other possibility may be to add a qemu config file option for this but
> that is not really dynamic.

I guess the key question is whether we actually have a compelling
use case to vary the timeout on a per-command basis.

If not, then we could do fine with a global config that is either
recorded in the domain XML, or in the global QEMU config.

The possible causes of slowness are

 - Host is overloaded so the guest is not being scheduled
 - Guest is overloaded so the agent is not being scheduled
 - The agent has crash/deadlocked
 - The agent is maliciously not responding
 - The command genuinely takes a long time to perform its action

The first 4 of those are fine with a global timeout either on guest or
in the driver.

Only the last one really pushes for having this per-public API.

Looking commands, ones I think can take a long time are

 - FS trim - this can run for many minutes if there's alot to trim
 - FS freeze - this might need to wait for apps to quiesce their I/O IIUC
 - Guest exec - it can run any command

The latter isn't used in libvirt, can be be run via the QGA passthrough
api in libvirt-qemu.so

So sadly I think we genuinely do have a need for per-commad timeouts,
for at least some of the API calls. I don't think all of them need it
though.

I though we could likely add a qemu.conf setting for the globak timeout,
but then add a timeout to individual APIs in specific cases where we
know they can genuinely take a very long time to execute.

We must also ensure we NEVER block any regular libvirt APIs when a
guest agent comamnd is running.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




More information about the libvir-list mailing list