[libvirt] [RFC] Introduce API for retrieving bulk domain stats

Tue Aug 19 14:24:51 UTC 2014

----- Original Message -----
> From: "Peter Krempa" <pkrempa at redhat.com>
> To: libvir-list at redhat.com
> Cc: "Peter Krempa" <pkrempa at redhat.com>
> Sent: Tuesday, August 19, 2014 3:14:19 PM
> Subject: [libvirt] [RFC] Introduce API for retrieving bulk domain stats
> 
> I'd like to propose a (hopefully) fairly future-proof API to retrieve
> various statistics for domains.

Hi,

Speaking for VDSM/oVirt, the proposal looks really nice and serves well our needs.
Some specific points

> The motivation is that management layers that use libvirt usually poll
> libvirt for statistics using various split up APIs we currently provide.
> To get all the necessary stuff, the mgmt app need to issue Ndomains *
> Napis calls and cope with the various returned formats. The APIs I'm
> wanting to introduce here will:
> 
> 1) Return data in a format that we can expand in the future and is
> hierarchical. For starters I'll use XML, with possible expansion to
> something like JSON if it will be favourable for a consumer (switchable
> by a flag)

awesome

> 2) Stats for multiple (all) domains can be queried at once and are
> returned in one call. This will allow to decrease the overhead necessary
> to issue multiple calls per domain multiplied by the count of domains.

We had (and still have) a lot of pain from a specific scenario on which
a VM becomes unresponsive, 99% of time because QEMU gets stuck, likely
on I/O (please remember that oVirt supports more storage types than just NFS,
like ISCSI to say the least, so soft mount is not always the solution...).

We then need a timeout or a way to signal that some VMs are not responding.

Moreover, if we have N VMs and M not responding (being of course M <= N),
would be cool to have a timeout *not* proportional to M... We'd like to avoid
to wait M * timeout seconds before to know that some of them are failing :)

Most importantly, the call should somehow report *all* the failed VMs.

Let me try to summarize. Let's say we have 10 VMs (0-9), of which VMs 3,4,7,9 are
failing (N=10, M=4). We'd like to wait less than M=4*timeout seconds and, maybe
most importantly, we'll need to know that all of the above have failed, not
just the one (maybe the first).

The reason is our management app, VDSM, needs to report all the not responding VMs.

Maybe an entry into the XML data for a not responding VM would be OK

> 3) Selectable (bit mask) fields in the returned format. This will allow
> to retrieve only specific stats according to the APPs need.

awesome as well

[...]
> Initially the implementation will introduce the option to retrieve
> block, interface  and cpu stats with the possibility to add more in the
> future.

I filed a list of APIs relevant for VDSM here:
https://bugzilla.redhat.com/show_bug.cgi?id=1113116#c1

Turns out that the list could be narrowed down to

virDomainBlockInfo <- for highest sector of a block
virDomainGetInfo <- for balloon stats
virDomainGetCPUStats
virDomainBlockStatsFlags
virDomainInterfaceStats
virDomainGetVcpusFlags

(will updated the BZ soon)

> As this is a first draft and dump of my mind on this subject it may be
> a bit rough, so suggestions are welcome.
> 
> Thanks for looking.

Thanks for the proposal :) I think is a great step forward

Thanks and bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani