[libvirt] [PATCH v7 0/4] Add Mediated device support

Wed Aug 31 07:04:13 UTC 2016

On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson at redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface.  I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc.  This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file.  Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>>     ├── 11
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     ├── 12
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     └── 13
>>         ├── create
>>         ├── description
>>         └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
> 
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create 
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.
> 

Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.

>>
>> For vGPUs like NVIDIA where we don't support multiple types
>> concurrently, this directory structure would update as mdev devices are
>> created, removing no longer available types.  I carried forward
> 
> or keep the type with max_instances cleared to ZERO.
>

+1 :)

>> max_instances here, but perhaps we really want to copy SR-IOV and
>> report a max and current allocation.  Creation and deletion is
> 
> right, cur/max_instances look reasonable.
> 
>> simplified as we can simply "echo $UUID > create" per type.  I don't
>> understand why destroy had a parameter list, so here I imagine we can
>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>> entry under each mdev device, so we remove it at the device rather than
>> in some central location (any objections?).
> 
> OK to me. 

IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.

>> We discussed how this might look with Intel devices which do allow
>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>> the vendor driver could still make a finite set of supported types,
>> perhaps with additional module options to the vendor driver to enable
>> more "exotic" types.  So for instance if IGD vGPUs are based on
>> power-of-2 portions of the framebuffer size, then the vendor driver
>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
>
> Yes, Intel can do such type of definition. One thing I'm not sure is 
> about impact cross listed types, i.e. when creating a new instance
> under a given type, max_instances under other types would be 
> dynamically decremented based on available resource. Would it be
> a problem for libvirt or upper level stack, since a natural interpretation
> of max_instances should be a static number?
>
> An alternative is to make max_instances configurable, so libvirt has
> chance to define a pool of available instances with different types
> before creating any instance. For example, initially IGD driver may 
> report max_instances only for a minimal sharing granularity:
> 	128MB:
> 		max_instances (8)
> 	256MB:
> 		max_instances (0)
> 	512MB:
> 		max_instances (0)
> 
> Then libvirt can configure more types as:
> 	128MB:
> 		max_instances (2)
> 	256MB:
> 		max_instances (1)
> 	512MB:
> 		max_instances (1)
> 
> Starting from this point, max_instances would be static and then
> mdev instance can be created under each type. But I'm not
> sure whether such additional configuration role is reasonable to libvirt...
>>
>> We still don't have any way for the admin to learn in advance how the
>> available supported types will change once mdev devices start to be
>> created.  I'm not sure how we can create a specification for this, so
>> probing by creating devices may be the most flexible model.
>>
>> The other issue is the start/stop requirement, which was revealed to
>> setup peer-to-peer resources between vGPUs which is a limited hardware
>> resource.  We'd really like to have these happen automatically on the
>> first open of a vfio mdev device file and final release.  So we
>> brainstormed how the open/release callbacks could know the other mdev
>> devices for a given user.  This is where the instance number came into
>> play previously.  This is an area that needs work.
> 
> IGD doesn't have such peer-to-peer resource setup requirement. So
> it's sufficient to create/destroy a mdev instance in a single action on
> IGD. However I'd expect we still keep the "start/stop" interface (
> maybe not exposed as sysfs node, instead being a VFIO API), as 
> required to support future live migration usage. We've made prototype
> working for KVMGT today.

It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.

I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:

	/sys/bus/pci/devices/<sbdf>/mdev/
	|-- mdev01/
	|   `-- vm_uuid
	`-- mdev02/
	    `-- vm_uuid

Did I miss something?

>>
>> There was a thought that perhaps on open() the vendor driver could look
>> at the user pid and use that to associate with other devices, but the
>> problem here is that we open and begin access to each device, so
>> devices do this discovery serially rather than in parallel as desired.
>> (we might not fault in mmio space yet though, so I wonder if open()
>> could set the association of mdev to pid, then the first mmio fault
>> would trigger the resource allocation?  Then all the "magic" would live
>> in the vendor driver.  open() could fail if the pid already has running
>> mdev devices and the vendor driver chooses not to support hotplug)
>>
>> One comment was that for a GPU that only supports homogeneous vGPUs,
>> libvirt may choose to create all the vGPUs in advance and handle them
>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>> case.
>>
>> We also considered whether iommu groups could be (ab)used for this use
>> case, peer-to-peer would in fact be an iommu grouping constraint
>> afterall.  This would have the same UUID+instance constraint as above
>> though and would require some sort of sysfs interface for the user to
>> be able to create multiple mdevs within a group.
>>
>> Everyone was given homework to think about this on their flights home,
>> so I expect plenty of ideas by now ;)
>>
>> Overall I think mediated devices were well received by the community,
>> so let's keep up the development and discussion to bring it to
>> fruition.  Thanks,
> 
> Thanks a lot Alex for your help on driving this discussion. Mediated device
> technique has the potential to be used for other type of I/O virtualizations
> in the future, not limited to GPU virtualization. So getting the core framework
> ready earlier would be highly welcomed. :-)
> 
--
Thanks,
Jike