[libvirt] [GSoC] Abstracting device address allocation

Sun Jun 12 21:26:00 UTC 2016

On 06/08/2016 02:04 PM, Tomasz Flendrich wrote:
> Hello everyone,
>
> Let me introduce myself - I'm Tomasz Flendrich and I'm a Google Summer
> of Code student from University of Wrocław, Poland.
>
> My goal is to create a generic address allocator (for PCI, virtio,
> SCSI, etc.), because current handling of addresses has its flaws:
> sometimes addresses aren't properly assigned and checked for duplicates.

I know you took this sentence straight from the wiki page of GSoC ideas, 
but I'd be interested to see a few specific examples of the problems 
that you intend to solve (maybe Martin can help here, since I think he's 
the one who wrote that).

I know of a few problems with PCI address allocation, but abstracting 
the data structure and functions used for allocation aren't going to 
solve them (for example, if you attach a device that uses PCI with 
"virsh attach-device .... --live", then then add another with "virsh 
attach-device --live --config", the new device will be assigned a 
different address in the persistent config than in the live domain (the 
result will be that the next time you shutdown and restart, the device 
will appear at a different address, which could confuse the guest OS). 
Another (which I need to fix *very* soon) is that even when a device is 
classified as connecting to PCIe slots rather than PCI (and there aren't 
enough of those right now, which is *yet another* problem I need to 
fix), if there aren't any empty PCIe slots available to assign, we'll 
fail rather than adding a new controller of the appropriate type and 
attached to the appropriate existing controller). But, as I said, a new 
data structure isn't going to solve any of those problems (and I'm not 
sure it will make the solution any easier to get to either).

> Additionally, a generic solution will be very useful when more hypervisors
> add the feature of explicitly stating the addresses of devices.
>
> The key goal I'm willing to achieve at this moment is defining
> a minimum viable product (one thing to be implemented quickly
> which is reasonably complete on its own, and has potential for further
> development in the direction which makes sense).
>
>
> I came up with an initial design. The internal allocator's data
> will be kept in struct Allocator. There will be one structure
> for one address type and all of the allocators would be in an array.
>
> We will have the possibility of adding an address pool to the allocator.
> For example, when we create a root PCI bus, we tell the allocator that
> the following PCI addresses are possible: {0..0}:{0..31}.{0..7},
> where {a..b} means a range from a to b, inclusive.

PCI addresses are likely much more complicated than you've planned for 
this. For starters, each PCI bus can have a different number of slots 
available (the 2nd dimension, which you list as {0..31}, has different 
start & end indexes depending on the type of controller that provides 
that bus. (e.g. pci-root, pcie-root, and pci-bridge have 31 slots from 
1-31, dmi-to-pci-bridge and pcie-switch-upstream-port have 32 slots from 
0-31, pcie-root-port and pcie-switch-downstream-port each have a single 
slot at 0,), and each bus has different rules about what devices can be 
plugged into them (e.g. a pcie-root-port can only be plugged into 
pcie-root or a pcie-switch-downstream-port, some devices can only be 
plugged into pci-root, pci-bridge, or dmi-to-pci-bridge, some can be 
plugged into pcie-root directly if specifically requested, but it's 
otherwise avoided, etc). So it's not just a case of a 4-dimensional 
vector with the same size at each index on on the same axis, and each 
address is not the same as any other address.

Are you intending for this allocator to have any sort of "Give me the 
next free address" function? If so, doing this for PCI is likely to 
require you to add lots of extra complexity that will be completely 
unnecessary for the other address types (although... there are several 
different kinds of SCSI controller (including virtio-scsi as well as 
several emulations of real hardware) that all share the same address 
space, and we need to know what model is each controller so that the 
user can be sure devices are placed on the desired controller.) Of 
course maybe your intent is that your allocator will only implement an 
"allocate this specific address" function, and leave auto-assignment up 
to a higher layer; but then the higher layer would still need a way to 
efficiently scan the entire list of free addresses, and each address 
would need a bit of data describing the controllers at that address so 
they they could be matched.

(Another thing - at any time a new bus of just about any type can be 
added, sometimes in response to a request to allocate an address for 
which no matching slot is available (e.g. - adding a new pcie-root-port, 
pcie-switch-upstream-port, and list of pcie-downstream-ports when there 
are no more hotpluggable PCIe slots available)).

>
> This function call could look like this:
> allocatorAddPool(
>     allocator[PCI_ADDRESS_TYPE],
>     &outputAddress,
>     AllocatorRange(0, 0),
>     AllocatorRange(0, 31),
>     AllocatorRange(0, 7));

Again, this doesn't allow for some of the indexes on a particular 
dimension having a different start/end (or for other differences for a 
particular index, which possibly could be stored in some opaque pointer).

>
> The outputAddress would be an array owned by the caller
> (only filled in by the allocator).
>
>
> If we were to allocate IP addresses for computers from the range
> 192.168.1.2 to 192.168.1.240, where the router has to have
> the address 192.168.1.254:
>
> First, we reserve the address for the router to let the Allocator know
> that it's in use, and that we can track collisions in manually
> assigned addresses:
>
> allocatorReserveAddress(
>     allocator[IP_ADDRESS_TYPE],
>     &outputAddress,
>     AllocatorValue(192),
>     AllocatorValue(168),
>     AllocatorValue(1),
>     AllocatorValue(254));
>
> Then, we assign the addresses for the computers:
>
> allocatorReserveAddress(
>     allocator[IP_ADDRESS_TYPE],
>     &outputAddress,
>     AllocatorValue(192),
>     AllocatorValue(168),
>     AllocatorValue(1),
>     AllocatorRange(1, 254));
> Please note that AllocatorValue() is now in use.
>
> There could be a different function call to simply assign any address:
> allocatorReserveAny(Allocator* allocator, &outputAddress);

Ah, there it is! This won't work for PCI (or for scsi) without much more 
information available for the ranges that are being allocated from.

>
> Let's say that we want an "sda" disk. We could create a wrapper:
> allocatorReserveSCSIAddress(allocator, "sda");
> All this wrapper does is representing the string "sda" as an integer
> mapping to the letter (eg. 'a' = 0, 'z' = 25):
> allocatorReserveAddress(allocator[SCSI_ADDRESS_TYPE], &outputAddress, 0);

Disk device names are incredibly ugly beasts. Apparently once you get to 
sdz, you can start over at sdaa, sdab, etc. And the whole thing is 
pointless anyway, because that name isn't passed directly to the guest 
anyway (there  isn't a way to do that - the guest OS determines device 
names on its own based on the bus and order in which the disks appear).

>
> If an address is already determined, because it was specified in the XML
> or it's some specific device that has to be at some specific address,
> we still reserve it to let the Allocator know that it's in use.
>
>
> How would this work internally?
> One of the possible solutions is keeping a set of ranges of addresses.
> For example, suppose that we have two PCI busses to use. Our address pool
> is stored as one range:
> {0..1} {0..31} {0..7}
>
> Now someone reserves address 0:13.2
>
> Our new free addresses pool is now stored as this set of ranges:
> {0..0} {0..12} {0..7},
> {0..0} {12..12} {0..1},
> {0..0} {12..12} {3..7},
> {0..0} {13..31} {0..7},
> {1..1} {0..31} {0..7}
>
> If we kept every address separately, it would require 2*32*8=512 addresses.

...which isn't all that much, since we aren't talking about 
microcontrollers with tiny amounts of RAM.

>
> The set data structure from gnulib's gl_oset.h is a candidate for keeping
> the ranges in a sorted fashion. Another candidate is simply a sorted list
> from gl_list.h.

Without looking at the implementation, I'm concerned that this may be 
adding lots of extra code complexity in order to save just a few bytes 
of RAM, which realistically is pointless - we only have one pool of PCI 
addresses per active domain, so the amount of memory used to store a 
bitmap of the availability of addresses is inconsequential in relation 
to the total memory requirement for a domain.

>
> This structure would be able to handle all types of addresses that are
> convertible to a fixed-length list of integers. We don't mind how many
> of these integers there are, because we can use variadic arguments.
> It won't allow duplicates if we stick to using it in every place where
> we have some addresses.
> It will also be very generic, with the possibility of writing wrappers on top
> of it that might be more convenient for some particular address type.
>
> This way we would keep qemu_* files as they are for now, and just replace
> existing allocators (and manual assignment of addresses) to calls to our
> new allocator.
>
> My ideas:
> - A possibility of reserving a range of addresses might be used
>    to reserve a whole PCI slot at once.

I do see the usefulness of this in general, but in this particular case 
rather than reserving an entire slot at once, we may want to allocate a 
single function where the device is going to be, but mark the rest of 
that slot available for further functions to be allocated, then later 
mark it unavailable for further allocations *without actually allocating 
the rest of the functions in the slot*.  The reason for this is that it 
would make it simpler to use the same code for detaching a "lone" device 
on a slot as we do for detaching the functions of a multifunction PCI 
device. (You're likely thinking "what's the difference between 
allocating a single function + mark the other functions on the slot as 
unavailable vs. just marking the entire slot as allocated; it all has to 
do with being able to easily notice when the slot is once again 
completely availaable - if you mark all the functions in the slot as 
used, then when you free it you have to free all the slots; if you're 
doing that, then you have to figure out whether the device you'r freeing 
was one function of a multifunction device, or if it was a lone device 
plugged into a slot by itself. Of course that may not fit in with your 
abstracted view of generic address allocation.) Right now we "ham fist" 
this issue by just marking the entire slot as allocated when someone 
asks for a single function, but really we should do allocations (and 
frees) one function at a time, but just with a knowledge of when it's 
appropriate to allocate a 2nd (and further) function on a slot that 
already has some functions allocated, and when it isn't.

> - Flags could be added, changing the behavior of particular
>    addresses or the whole allocator.
>
> My questions:
> - Is this approach good long-term?

Again, my concern would be about making something more complex than 
necessary for the individual uses and their requirements. Eliminating 
nearly duplicate code is good, but you need to be careful that what you 
create doesn't require more "glue" to make each of the cases fit into 
the abstract model than would have been used for a separate simpler 
"from-scratch" implementation for each individual case.

> - Is address releasing expected to be required at any point?
>    (think device hotplug/unplug, etc.)

Yes. And we need to be aware that sometimes devices are added to the 
live domain but not the persistent config, sometimes only to the 
persistent config, and sometimes to both (and when they're added to 
both, they should use the same address in both live domain and 
persistent config, which isn't always the case today).

> - What other features are needed?

You had said you wanted the qemu_* files to remain the same for now. 
Exactly what are you figuring on replacing? For example, for PCI 
addresses (since these are obviously the only ones I care about :-)) are 
you just planning to replace virDomainPCIAddressBus with your new 
virAllocator and a set of accessor functions? Even if you do nothing 
higher level than this, each bus will need to have a "connectFlags" that 
describes what type of devices can be connected to that bus (or range, 
or whatever you call it).

>
> Please speak up your mind if you have remarks or thoughts about the idea.
> I'd really appreciate it.

Hopefully my comments haven't just made the situation more muddy :-)