[libvirt] [RFC] Add API to change qemu agent response timeout
Daniel P. Berrangé
berrange at redhat.com
Fri Oct 4 09:13:11 UTC 2019
On Fri, Oct 04, 2019 at 08:14:39AM +0200, Peter Krempa wrote:
> On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
> > Some layered products such as oVirt have requested a way to avoid being
> > blocked by guest agent commands when querying a loaded vm. For example,
> > many guest agent commands are polled periodically to monitor changes,
> > and rather than blocking the calling process, they'd prefer to simply
> > time out when an agent query is taking too long.
> >
> > This patch adds a way for the user to specify a custom agent timeout
> > that is applied to all agent commands.
> >
> > One special case to note here is the 'guest-sync' command. 'guest-sync'
> > is issued internally prior to calling any other command. (For example,
> > when libvirt wants to call 'guest-get-fsinfo', we first call
> > 'guest-sync' and then call 'guest-get-fsinfo').
> >
> > Previously, the 'guest-sync' command used a 5-second timeout
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
> > followed always blocked indefinitely
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
> > custom timeout is specified that is shorter than
> > 5 seconds, this new timeout also used for 'guest-sync'. If there is no
> > custom timeout or if the custom timeout is longer than 5 seconds, we
> > will continue to use the 5-second timeout.
> >
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional details.
> >
> > Signed-off-by: Jonathon Jongsma <jjongsma at redhat.com>
> > @@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
> >
> > return 0;
> > }
> > +
> > +int
> > +qemuAgentSetTimeout(qemuAgentPtr mon,
> > + int timeout)
> > +{
> > + if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
> > + virReportError(VIR_ERR_INVALID_ARG,
> > + _("guest agent timeout '%d' is "
> > + "less than the minimum '%d'"),
> > + timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);
>
> This error is misleading as -1 and -2 are special values and not actual
> timeout.
>
> > + return -1;
> > + }
> > +
> > + mon->timeout = timeout;
> > + return 0;
> > +}
>
> [...]
>
> > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> > index 1e041a8bac..09251cc9e2 100644
> > --- a/src/qemu/qemu_driver.c
> > +++ b/src/qemu/qemu_driver.c
> > @@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
> > return ret;
> > }
> >
> > +static int
> > +qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
> > + int timeout)
> > +{
> > + virDomainObjPtr vm = NULL;
> > + qemuAgentPtr agent;
> > + int ret = -1;
> > +
> > + if (!(vm = qemuDomObjFromDomain(dom)))
> > + goto cleanup;
> > +
> > + if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) < 0)
> > + goto cleanup;
> > +
> > + agent = qemuDomainObjEnterAgent(vm);
>
> You must acquire a job on @vm if you want to call this:
>
> /*
> * obj must be locked before calling
> *
> * To be called immediately before any QEMU agent API call.
> * Must have already called qemuDomainObjBeginAgentJob() or
> * qemuDomainObjBeginJobWithAgent() and checked that the VM is
> * still active.
> *
> * To be followed with qemuDomainObjExitAgent() once complete
> */
> qemuAgentPtr
> qemuDomainObjEnterAgent(virDomainObjPtr obj)
>
>
> > + ret = qemuAgentSetTimeout(agent, timeout);
> > + qemuDomainObjExitAgent(vm, agent);
>
> Also this API is inherently racy if you have two clients setting the
> timeout and it will influence calls of a different client.
>
> IMO the only reasonable approach is to add new APIs which have a
> 'timeout' parameter for any agent API which requires tunable timeout to
> prevent any races and unexpected behaviour.
>
> Other possibility may be to add a qemu config file option for this but
> that is not really dynamic.
I guess the key question is whether we actually have a compelling
use case to vary the timeout on a per-command basis.
If not, then we could do fine with a global config that is either
recorded in the domain XML, or in the global QEMU config.
The possible causes of slowness are
- Host is overloaded so the guest is not being scheduled
- Guest is overloaded so the agent is not being scheduled
- The agent has crash/deadlocked
- The agent is maliciously not responding
- The command genuinely takes a long time to perform its action
The first 4 of those are fine with a global timeout either on guest or
in the driver.
Only the last one really pushes for having this per-public API.
Looking commands, ones I think can take a long time are
- FS trim - this can run for many minutes if there's alot to trim
- FS freeze - this might need to wait for apps to quiesce their I/O IIUC
- Guest exec - it can run any command
The latter isn't used in libvirt, can be be run via the QGA passthrough
api in libvirt-qemu.so
So sadly I think we genuinely do have a need for per-commad timeouts,
for at least some of the API calls. I don't think all of them need it
though.
I though we could likely add a qemu.conf setting for the globak timeout,
but then add a timeout to individual APIs in specific cases where we
know they can genuinely take a very long time to execute.
We must also ensure we NEVER block any regular libvirt APIs when a
guest agent comamnd is running.
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
More information about the libvir-list
mailing list