libvirt-devaddr: a new library for device address assignment

Thu Mar 19 19:00:09 UTC 2020

TL;DR - I'm not as anti-XML as the proposal seems to be, but also not 
pro-XML. I also (after thinking about it) understand the advantage of 
putting this in a separate library. So yeah, let's go it!

On 3/13/20 6:47 AM, Daniel P. Berrangé wrote:
> On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote:
>> On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange at redhat.com> wrote:
>>> We've been doing alot of refactoring of code in recent times, and also
>>> have plans for significant infrastructure changes. We still need to
>>> spend time delivering interesting features to users / applications.
>>> This mail is to introduce an idea for a solution to an specific
>>> area applications have had long term pain with libvirt's current
>>> "mechanism, not policy" approach - device addressing. This is a way
>>> for us to show brand new ideas & approaches for what the libvirt
>>> project can deliver in terms of management APIs.
>>>
>>> To set expectations straight: I have written no code for this yet,
>>> merely identified the gap & conceptual solution.
>>>
>>>
>>> The device addressing problem
>>> =============================
>>>
>>> One of the key jobs libvirt does when processing a new domain XML
>>> configuration is to assign addresses to all devices that are present.
>>> This involves adding various device controllers (PCI bridges, PCI root
>>> ports, IDE/SCSI buses, USB controllers, etc) if they are not already
>>> present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
>>> device so they are associated with controllers. When libvirt spawns a
>>> QEMU guest, it will pass full address information to QEMU.
>>>
>>> Libvirt, as a general rule, aims to avoid defining and implementing
>>> policy around expansion of guest configuration / defaults, however, it
>>> is inescapable in the case of device addressing due to the need to
>>> guarantee a stable hardware ABI to make live migration and save/restore
>>> to disk work.  The policy that libvirt has implemented for device
>>> addressing is, as much as possible, the same as the addressing scheme
>>> QEMU would apply itself.
>>>
>>> While libvirt succeeds in its goal of providing a stable hardware API,
>>> the addressing scheme used is not well suited to all deployment
>>> scenarios of QEMU. This is an inevitable result of having a specific
>>> assignment policy implemented in libvirt which has to trade off mutually
>>> incompatible use cases/goals.
>>>
>>> When the libvirt addressing policy is not been sufficient, management
>>> applications are forced to take on address assignment themselves,
>>> which is a massive non-trivial job with many subtle problems to
>>> consider.
>>>
>>> Places where libvirt's addressing is insufficient for PCI include
>>>
>>>   * Setting up multiple guest NUMA nodes and associating devices to
>>>     specific nodes
>>>   * Pre-emptive creation of extra PCIe root ports, to allow for later
>>>     device hotplug on PCIe topologies
>>>   * Determining whether to place a device on a PCI or PCIe bridge
>>>   * Controlling whether a device is placed into a hotpluggable slot
>>>   * Controlling whether a PCIe root port supports hotplug or not
>>>   * Determining whether to places all devices on distinct slots or
>>>     buses, vs grouping them all into functions on the same slot
>>>   * Ability to expand the device addressing without being on the
>>>     hypervisor host
>> (I don't understand the last bullet point)
> I'm not sure if this is still the case, but at some point in time
> there was a desire from KubeVirt to be able to expand the users'
> configuration when loaded in KubeVirt, filling in various defaults
> for devices. This would run when the end user YAML/JSON config
> was first posted to the k8s API for storage, some arbitrary amount
> of time later the config gets chosen to run on a virtualization
> host at which point it is turned into libvirt domain XML.

If I recall the discussion properly, the context was that we wanted 
kubevirt to remember all the stuff like PCI addresses, MAC addresses, 
exact machinetype to be "backfilled" from libvirt into the Kubevirt 
config, but for them that's a one-way street. So having all these things 
set by a separate API (even in a separate library) would definitely be 
an advantage for them, as long as all the same info was available at 
that time (e.g. you really need to know the machinetypes supported by 
the specific qemu that is going to be used in order to set the exact 
machinetype)

>
>>> Libvirt wishes to avoid implementing many different address assignment
>>> policies. It also wishes to keep the domain XML as a representation
>>> of the virtual hardware, not add a bunch of properties to it which
>>> merely serve as tunable input parameters for device addressing
>>> algorithms.
>>>
>>> There is thus a dilemma here. Management applications increasingly
>>> need fine grained control over device addressing, while libvirt
>>> doesn't want to expose fine grained policy controls via the XML.
>>>
>>>
>>> The new libvirt-devaddr API
>>> ===========================
>>>
>>> The way out of this is to define a brand new virt management API
>>> which tackles this specific problem in a way that addresses all the
>>> problems mgmt apps have with device addressing and explicitly
>>> provides a variety of policy impls with tunable behaviour.
>>>
>>> By "new API", I actually mean an entirely new library, completely
>>> distinct from libvirt.so, or anything else we've delivered so
>>> far.

I was at first against the idea of a completely separate library, since 
each new library means a new package to be maintained and installed. 
However, I do see the advantage of being completely disconnected from 
libvirt, since there may be scenarios where libvirt isn't needed (maybe 
libvirt is on a different host, or maybe something else (libvirt-ng? 
:-P) is being used. Keeping this separate means it can be used in other 
scenarios. So now I agree with this.

>>> The closest we've come to delivering something at this kind
>>> of conceptual level, would be the abortive attempt we made with
>>> "libvirt-builder" to deliver a policy-driven API instead of mechanism
>>> based. This proposal is still quite different from that attempt.
>>>
>>> At a high level
>>>
>>>   * The new API is "libvirt-devaddr" - short for "libvirt device addressing"

It's more than just device addresses though. (On the other hand, a name 
is just a name, so...)

>>>
>>>   * As input it will take
>>>
>>>     1. The guest CPU architecture and machine type

To repeat the point above - do we expect libvirt-devaddr to provide the 
exact machinetype? If so, what will be the mechanism for telling it 
exactly which machinetypes are supported? Will it need to replicate all 
of libvirt's qemu capabilities code? (and would that really work if, 
say, libvirt-devaddr is being used on a machine different from the 
machine where the virtual machine will eventually be run?)

>>>     2. A list of global tunables specifying desired behaviour of the
>>>        address assignment policy
>>>     3. A minimal list of devices needed in the virtual machine, with
>>>        optional addresses and optional per-device tunables to override
>>>        the global tunables
>>>
>>>   * As output it will emit
>>>
>>>     1. fully expanded list of devices needed in the virtual machine,
>>>        with addressing information sufficient to ensure stable hardware ABI

I know you already know it and it's implied in what you say, but just to 
make sure it's clear to anybody else, the "expanded list of devices" 
will also include all PCI (and SCSI and SATA and whatever) controllers 
needed for the entire hierarchy. (Or maybe you said that and I missed 
it. Wouldn't surprise me)

This means that the library will need to know which types of which 
controllers are supported for the machinetype being requested (and of 
course what is supported by each controller). Is it going to query qemu? 
Which qemu - the one on the host where libvirt-devaddr is being called I 
suppose, but that won't necessarily be the same as the host where the 
guest will eventually run.

Will libvirt-devaddr care about things all the way to the level of which 
type of pcie-root-port to use (for example)?

And what about all the odd attributes of various controllers that 
libvirt sets to a default value and then stores in the XML (chassis id, 
etc)? I guess we need to take care of all those as well.

>>>
>>> Initially the API would implement something that behaves the same
>>> way as libvirt's current address assignment API.
>>>
>>> The intended usage would be
>>>
>>>   * Mgmt application makes a minimal list of devices they want in
>>>     their guest
>>>   * List of devices is fed into libvirt-devaddr API
>>>   * Mgmt application gets back a full list of devices & addresses
>>>   * Mgmt application writes a libvirt XML doc using this full list &
>>>     addresses
>>>   * Mgmt application creates the guest in libvirt
>>>
>>> IOW, this new "libvirt-devaddr" API is intended to be used prior to
>>> creating the XML that is used by libvirt. The API could also be used
>>> prior to needing to hotplug a new device to an existing guest.

So everything returned from the original call would need to be kept 
around in that form (or the application would need to be able to 
reproduce it on demand), and that's then fed into the API. I guess this 
could just be the same API - similar to how libvirt acts now, it would 
accept any address info provided, and then assign it wherever it was 
omitted.

>>> This API is intended to be a deliverable of the libvirt project, but
>>> it would be completely independent of the current libvirt API. Most
>>> especially note that it would NOT use the domain XML in any way.
>>> This gives applications maximum flexibility in how they consume this
>>> functionality, not trying to force a way to build domain XML.

I was originally going to argue in favor of using the same XML, since we 
otherwise have to convert back and forth. But during the extra long time 
I've taken to think about it, I think I agree that this isn't important, 
especially if the chosen format is as simple as possible.

>> This procedure forces Mgmt to learn a new language to describe device
>> placement. Mgmt (or should I just say "we"?) currently expresses the
>> "minimal list of devices" in XML form and pass it to libvirt. Here we
>> are asked to pass it once to libvirt-devaddr, parse its output, and
>> feed it as XML to libvirt.
> I'm not neccessarily suggesting we even need a document format the
> core API level. I could easily see the API working in terms of a
> list of Go structs, with tunables being normal method parameters.
> A JSON format could be an optional way to serialize the Go structs,
> but if the app were written in Go the JSON may not be needed at all.

"Using JSON when we eventually need XML is just using XML with extra 
steps". Or something like that. Is JSON really that much simpler than XML?

Anyway, since we aren't saddled with the precondition that "everything 
must be stable and backward compatible", there's freedom to experiment, 
so I guess it's not really necessary to spend too much time debating and 
trying to make the "definite 100% sure best decision". We can just pick 
something and try it. If it works out, great; if it doesn't then we pick 
something else :-)

>
>> I believe it would be easier to use the domxml as the base language
>> for the new library, too. libvirt-devaddr would accept it with various
>> hints (expressed as its own extension to the XML?) such as "place all
>> of these devices in the same NUMA node", "keep on root bus" or
>> "separate these two chattering devices to their own bus". The output
>> of libvirt-devaddr would be a domxml with <devices> filled with
>> controllers and addresses, readily available for consumption by
>> libvirt.
> I don't believe that using the libvirt domain XML is a good idea for
> this as it uneccesssarily constrains the usage scenarios. Most management
> applications do not use the domain XML as their canonical internal storage
> format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just
> store metadata in their DB, others vary again. Some of these applications
> benefit from being able to expand device topology/addressing, a long time
> before they get any where near use of domain XML - the latter only matters
> when you come to instantiate a VM on a particular host.

This explains why it's not necessary to use XML. But I don't see use of 
XML as "unnecessarily constraining" the usage scenarios. Does it make 
the code (on either side) unnecessarily inefficient? Does it require 
pulling in libraries that applications otherwise wouldn't need? Required 
code is too complex?

>
> We could of coure have a convenience method which optionally generates
> a domain XML template from the output list of devices, if someone believes
> that's useful to standardize on, but I don't think the domain XML should
> be the core format format.

>
> I would also like this library to usable for scenarios in which libvirt
> is not involved at all. One of the strange things about the QEMU driver
> in libvirt compared to the other hypervisor drivers is that it is missing
> an intermediate API layer. In other drivers the hypervisor platform itself
> provides a full management API layer, and libvirt merely maps the libvirt
> APIs to the underling mgmt API or data formats. IOW, libvirt is just a
> mapping layer.

When you're just a "mapping layer", and you're expected to transparently 
map in both directions, it gets problematic. Especially when there are 
multiple ways of describing the same setup, or options supported at one 
end that are ignored/not supported at the other. Not sure why I'm 
replying to this point, just when I hear "mapping layer" I think about 
the fact that netcf was never able to deal with the many different ways 
that debian interfaces files could be written, or ignore but leave in 
place extra ifcfg options it didn't support (that's just a couple that 
come to mind, and we shouldn't derail this conversation to talk about 
them :-/)

>
> QEMU though only really provides a few low level building blocks, alongside
> other building blocks you have to pull in from Linux. It doesn't even provide
> a configuration file. Libvirt pulls all these pieces together to form the
> complete managment QEMU API, as well as mapping everything onto the libvirt
> domain XML & APIs. I think all there is scope & interest/demand to look at
> creating an intermediate layer that provides a full managment layer for
> QEMU, such that libvirt can eventually become just a mapping layer for
> QEMU. In such a scenario the libvirt-devaddr library is still very useful
> but you don't want it using the libvirt domain XML, as that's not likely
> to be the format in use.

My opinion would be that it's not necessary for libvirt domain XML (or a 
subset) be the format, but that it also shouldn't necessarily be avoided 
(unless the alternative is better in some quantifiable way).

Anyway, in the end I think my opinion is we should push ahead and think 
about consequences of the specifics later, after some experimenting. I'd 
love to help if there's a place for it. I'm just not sure where/how I 
could contribute, especially since I have only about 4 hours worth of 
golang knowledge :-) (certainly not against getting more though!)