[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] RFC: managing "pci passthrough" usage of sriov VFs via a new network forward type

On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV
ethernet cards decided that the virtual functions (VF) of the card
(each VF corresponds to an ethernet device, e.g. "eth10") should
each be given a new+different+random MAC address each time the
hardware is rebooted.

This makes using SRIOV VFs via PCI passthrough very unpalatable. The
problem can be solved by setting the MAC address of the ethernet
device prior to assigning it to the guest, but of course the
<hostdev>  element used to assign PCI devices to guests has no place
to specify a MAC address (and I'm not sure it would be appropriate
to add something that function-specific to<hostdev>).
In discussions at the KVM forum, other related problems were
noted too. Specifically when using an SRIOV VF with VEPA/VNLink
we need to be able to set the port profile on the VF before
assigning it to the guest, to lock down what the guest can
do. We also likely need to a specify a VLAN tag on the NIC.
The VLAN tag is actally something we need to be able todo
for normal non-PCI passthrough usage of SRIOV networks too.

                                                         Dave Allan
and I have discussed a different possible method of eliminating this
problem (using a new forward type for libvirt networks) that I've
outlined below. Please let me know what you think - is this
reasonable in general? If so, what about the details? If not, any
counter-proposals to solve the problem?
The issue I see is that if an application wants to know what
PCI devices have been assigned to a guest, they can no longer
just look at<hostdev>  elements.

Actually, I was thinking that the proper <hostdev> *would* be added to the live XML as non-persistent. This way all PCI devices currently assigned to the guest could still be retrieved by looking at the <hostdev> elements, but the specific PCI device used for this particular instance wouldn't need to be hardcoded into the config XML. (I think the ability to grab a free ethernet device from a pool at runtime, rather than having hardcoded devices, is an important feature of this proposed method of dealing with pci passthrough ethernet devices. I suppose a management app could be written to handle that allocation, and rewrite the domain config, but it seems like something that libvirt should be able to handle).

  They also need to look at
<interface>  elements. If we follow this proposed model in other
areas, we could end up with PCI devices appearing as<disks>
<controllers>  and who knows what else. I think this is not
very desirable for applications, and it is also not good for
our internal code that manages PCI devices. ie the security
drivers now have to look at many different places to find
what PCI devices need labelling.

I agree that we don't want to make management applications look for PCI devices scattered all over the config. Likewise I think it would be nice if applications don't have to go looking all over the place for MAC addresses. And now that I've heard port profiles need to be associated with these devices too, I'm wondering what will be next... having that type of high level information in a <hostdev> doesn't seem very appealing to me. I think it would be much cleaner if it could remain in <interface> (or in a <portgroup> of a network definition).

I think with non-persistent <hostdev> elements auto-generated based on <interface>/<network> definitions, we can get the best of both worlds - a complete list of all PCI devices allocated to the guest is still available in one place, but we can leverage a lot of code already in the network interface management stuff - interface pools, portgroups, etc. (unfortunately, we'll never be able to take advantage of bandwidth management or nwfilters, but there's really no solution to that short of installing an agent in the guest - by the time you get to that point, I think it's probably time to acknowledge that PCI passthrough of network devices just isn't a great general purpose solution, and use an actual QEMU network device instead)

One problem this doesn't solve is that when a guest is migrated, the
PCI info for the allocated ethernet device on the destination host
will almost surely be different. Is there any provision for dealing
with this in the device passthrough code? If not, then migration
will still not be possible.
Migration is irrelevant with PCI passthrough, since we reject any
attempt to migrate a guest with assigned PCI devices. A management
app must explicitly hot-unplug all PCI devices before doing any
migration, and plug back in new ones after migration finishes.

Nice. I didn't realize that. The description of how a management app handles the situation actually fits quite well with my proposal - the non-persistent hostdev would be unplugged, and after migration is completed, the normal codepath for initializing network device plumbing for the qemu process on the destination host would automatically reserve and plug in a new pci device.

Although I realize that many people are predisposed to not like the
idea of PCI passthrough of ethernet devices (including me), it seems
that it's going to be used, so we may as well provide the management
tools to do it in a sane manner.
Reluctantly I think we need to provide the neccessary information
underneath the<hostdev>  element. Fortunately we already have an
XML schema for port profile and such things, that we share between
the<interface>  device element and the<network>  schema.

I had actually been considering from the beginning that a <hostdev> element would end up in the live XML (after being created based on the <interface> (and the <network> it references) while the guest is starting up). This keeps network device config out of hostdev space, and hostdev config out of network device space (and fits in with the idea of eliminating host-specific config info from the domain config (since the actual PCI device to be used isn't in the domain XML, but is instead determined at domain startup.)

If it's acceptable to add non-persistent <hostdev>s to the live XML, the main open item I see is that the management apps trying to migrate a guest containing them will need to understand that these transient <hostdev> devices will have replacements automatically plugged in on the destination by the networking code. For that matter, the management app shouldn't be unplugging them either (and neither should "virsh detach-device", for example), because they will require extra code not normally run during a PCI hot-unplug (to disassociate the port profile, and return the ethernet device to the network's pool) (So maybe the hostdev does need some reference back to the higher level device definition (in this case <interface>) after all. Bah.)

(Another potential problem area I see is with the relative sequencing of unplugging/disassociating/plugging/associating these devices during a migration - for standard network devices I think the unplugging on the source host doesn't happen until after the migration is complete, but for PCI passthrough devices it must happen before the migration starts. But I may again be trying to think up a solution to a problem that is irrelevant).

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]