[libvirt] RFC: exposing qemu's block-set-write-threshold

Tue May 19 12:42:16 UTC 2015

On 05/19/2015 05:52 AM, Peter Krempa wrote:
> On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
>> I'm trying to wire up libvirt to do event reporting based on qemu 2.3's
>> BLOCK_WRITE_THRESHOLD event. Doing this will allow management
>> applications to do event-based notification on when to enlarge LVM (or
>> other) storage underlying a qcow2 volume, rather than their current
>> requirement to frequently poll block statistics.  But I'm stuck on the
>> best way to expose the new parameter:
>>
>> One idea is to treat it as part of the domain XML, and have
>> virDomainSetBlockIoTune add one more typed parameter for a disk's
>> current write threshold.  Doing this could allow setting a threshold
> 
> Since virDomainSetBlockIoTune operates on disk-level and the event will
> need to be registered on a backing-chain element level, using
> virDomainSetBlockIoTune won't be a good choice, IMO.

Oh, I wasn't even thinking about backing-chain levels.  You are right -
it is entirely possible for someone to want to set a threshold reporting
for "vda[1]" (the backing element of "vda") - of course, it would only
kick in during a blockcommit to vda[1], but it is still important to
allow.  But the BlockIoTune XML is not (yet) geared for backing images.

On the other hand, it might make sense to allow BlockIoTune on backing
chains - for a difference in throttling between the main image and its
backing image.  That is, I could possibly see a case where a local image
is based on top of a network backing file, and where we want to read the
local image with no throttling, but read the backing file with rate
limiting in effect to avoid saturating the network; in such a setup, the
user is likely going to do a blockpull to move data off the network onto
the local copy, but doesn't want the pull to affect performance.  Or
conversely, someone could have a setup where the backing file has no
rate limit, but the active file is rate-limited (and thus the guest
performs faster the closer it is to the original backing file, as a way
of measuring how much the guest differs from the golden image).  Of
course, we're still waiting for per-node throttling to land in qemu:
https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg01196.html

> 
>> even for an offline domain (although the threshold is only meaningful
>> for a running domain), but might get weird because qemu's event is
>> one-shot (you have to re-arm a new threshold every time an existing
>> threshold fires - so every time it fires, the domain XML is rewritten,
>> even though it is not guest-visible ABI that was changing).  At least
> 
> Having the configuration exposed in the XML might make sense. Although
> currently it won't allow to specify them for anything but the top image,
> which will require users of libvirt to register the event additionally
> to the backing chain elements. Since the allocation of the backing chain
> elements can change only once a block job is started it should be safe
> enough to allow that only using the API and once libvirt will track the
> full backing chain we can use the XML config too.
> 
>> with this approach, it is also easy for a client to poll the current
>> setting of the threshold, via virDomainGetBlockIoTune.  But the
>> threshold isn't quite a tuning parameter (it isn't throttling how fast
>> the guest can write to the block device, only how full the host side can
>> get in order to allow transparent resizing of the host storage prior to
>> running out of space).
> 
> Again, virDomainGetBlockIoTune won't work on individual elements and
> using it that way would also be impractical.
> 

Okay, I'm fairly convinced that reusing virDomainGetBlockIoTune is not
the right approach, and therefore adding a new API (and requiring the
.so bump) is going to be required.

>>
>> Another idea is to add a completely new API, maybe named
>> virDomainBlockSetWriteThreshold(virDomainPtr dom, const char *disk, long
>> long int threshold, unsigned int flags) (with threshold in bytes).
> 
> Why is the treshold a signed value? I can't imagine a use case where
> negative values could be used.

I've swapped to unsigned long long in my current patches, but see below [1].

> 
> As for the @disk parameter it will need to take the target with the
> index argument since I know that oVirt is using the same approach also
> for backing chain sub-elements hosted on LVM when doing snapshot
> merging via the block job APIs.

Good thing we already have code for resolving backing chain index in
disk names.

> 
> This also implies another required thing for this to be actually usable.
> Since the block jobs happening on the backing chain can trigger the
> event on a member of the backing chain, the returned event will need to
> contain the disk identification in a way that is unique across backing
> chain alterations.

Right now, I'm planning on the event looking like:

typedef void (*virConnectDomainEventWriteThresholdCallback)
  (virConnectPtr conn,
   virDomainPtr dom,
   const char *devAlias,
   unsigned long long threshold,
   unsigned long long length,
   void *opaque);

Remember, the event callback can only be registered once per domain, so
it HAS to include disk information (whether "vda" or "vdb" at the top
level), and it is not that much harder to make it include indexed disk
information "vda[1]" if the event triggered to due a commit to a backing
file.

> 
> While for local files we could again opt to use the path this won't be
> scalable to non-local devices. Thus I think the bes way will be to
> include also the disk target with index.

You are right that the event must be tied to the disk alias and not the
path name, as not all disks have a unique local path name.

> 
> This though will require using node-names for tracking and/or generating
> the indexes in the backing store in a deterministic way.
> 
>> Here, virDomainBlockStatsFlags() could be a way to query the current
>> threshold.  And if desired, we could add a flag value to treat thresholda
> 
> virDomainBlockStatsFlags does not operate on backing chain subelements,
> so it would need to be instrumented to do so.

Hmm, more work.  At least that work is not going to affect .so
versioning, like the new API for actually setting the threshold will
have to do.

> 
>> as a percentage instead of a byte value (but is 1% too large of
>> granularity, and how would you scale the percentage to anything finer
>> while still keeping the parameter as long long int rather than double?)
> 
> You can use a proportional unit with a larger fractional part: promile,
> parts per million, parts per billion etc.

[1] Indeed, we can add more and more flags, but we'll see if it makes
sense.  It might also be nice to allow a negative threshold, as in
setting a threshold of -1*1024*1024*1024 to trigger when the disk
reaches within 1 gigabyte of space, regardless of how many gigabytes it
currently contains (easier than calling virDomainBlockInfo and doing the
computation myself).  That could be done by allowing a signed threshold,
or by keeping threshold unsigned but adding a flag that says the
threshold is relative to the tail of the file rather than the beginning.

But adding flags can be done later; my first implementation will not
define any flags (bytes only, no percentage or relative-to-end values).

> 
>>
>> Of course, I'd want virConnectGetAllDomainStats() to list the current
>> threshold setting (0 if no threshold or if the event has already fired,
>> non-zero if the threshold is still set waiting to fire), so that clients
>> can query thresholds for multiple domains and multiple disks per domain
>> in one API call.  But I don't know if we have any good way to set
> 
> Not only disks but for separate backing chain elements too.
> 

Thankfully GetAllDomainStats is already wired to report backing chain
element details.

>> multiple thresholds in one call (at least virDomainSetBlockIoTune must
>> be called once per disk; it might be possible for my proposed
>> virDomainBlockStatsFlags() to set a threshold for multiple disks if the
>> disk name is passed as NULL - but then we're back to the question of
>> what happens if the guest has multiple disks of different sizes; it's
>> better to set per-disk thresholds than to assume all disks must be at
>> the same byte or percentage threshold).
> 
> That is just usage-sugar for the users. I'd rather avoid doing this on
> multiple disks simultaneously.
> 

Good - then I won't worry about it; the new API will make disk name
mandatory.  (Setting to a percentage or to a relative-to-tail might make
more sense across multiple disks, but on the other hand, setting a
threshold will be a rare thing; and while first starting the domain has
to set a threshold on all disks, later re-arming of the trigger will be
on one disk at a time as events happen; making the startup case more
efficient is not going to be the bottleneck in management).

>>
>> I'm also worried about what happens across libvirtd restarts - if the
>> qemu event fires while libvirtd is unconnected, should libvirt be
>> tracking that a threshold was registered in the XML, and upon
>> reconnection check if qemu still has the threshold?  If qemu no longer
>> has a threshold, then libvirt can assume it missed the event, and
>> generate one as part of reconnecting to the domain.
> 
> Libvirt should have enough information to actually check if the event
> happened and should be able to decide that it in fact missed the event
> and it should be emitted by libvirt.
> 
> The new block copy API should also add a new typed parameter config that
> will allow to set the write treshold once you are using it in a similar
> way with a LV as backing.

Ah, as in arm a threshold in the same API that starts a block job.
Makes sense.  But won't require a .so bump, so doesn't have to be done
in my first posting of the series.

> 
> Peter
> 

Thanks for the ideas; now for me to crank out my proof-of-concept code.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 604 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20150519/63399949/attachment-0001.sig>