[vfio-users] Lost link when pass through rtl8168 to guest

Alex Williamson alex.williamson at redhat.com
Fri Sep 23 16:23:01 UTC 2016


On Fri, 23 Sep 2016 14:52:46 +0800
Wei Xu <wexu at redhat.com> wrote:

> On 2016年09月21日 22:50, Alex Williamson wrote:
> > On Wed, 21 Sep 2016 14:04:20 +0800
> > Wei Xu <wexu at redhat.com> wrote:
> >  
> >> On 2016年09月21日 13:41, Wei Xu wrote:  
> >>   > On 2016年09月21日 12:31, Alex Williamson wrote:  
> >>   >> On Wed, 21 Sep 2016 11:52:31 +0800
> >>   >> Wei Xu <wexu at redhat.com> wrote:
> >>   >>  
> >>   >>> On 2016年09月21日 02:59, Nick Sarnie wrote:  
> >>   >>>> Hi Wei,
> >>   >>>>
> >>   >>>> My system is a desktop, so it must just be a general Gigabyte BIOS  
> >> bug.  
> >>   >>>> I submitted a help ticket about this issue and just gave a brief
> >>   >>>> explanation and then sent Alex's explanation. Hopefully it will be
> >>   >>>> escalated correctly.  
> >>   >>>
> >>   >>> Thanks for your feedback, i'm also using a Gigabyte board, i have
> >>   >>> checked out the firmware update history and updated my firmware to the
> >>   >>> latest one which was released at March, looks it's a long way to get a
> >>   >>> feedback for this issue from them.
> >>   >>>
> >>   >>> Alex,
> >>   >>> It's a hard time for us to do nothing but wait, the reason why i use my
> >>   >>> desktop is i got a com console on it, so it's quite convenient to
> >>   >>> debugging kernel via kgdb, and i want to keep my realtek nic for ssh
> >>   >>> access from my notebook, anyway to workaround it to just bypass the
> >>   >>> wireless nic only as a temporary experiment?
> >>   >>>
> >>   >>> I'm trying VirtIO DMAR patch with vIOMMU in the guest recently, which
> >>   >>> need pass through a pcie unit from host, and one more virtio nic  
> >> for the  
> >>   >>> guest due to the feedbacks, maybe i can pass through a device in other
> >>   >>> groups instead of a nic?  
> >>   >>
> >>   >> Sure, but skylake platforms are notoriously bad for their lack of
> >>   >> device isolation, even things like USB controllers and audio devices
> >>   >> are now part of multifunction packages that do not expose isolation
> >>   >> through ACS.  If you can't resolve the IOMMU grouping otherwise, your
> >>   >> choices are as I told Nick in the other thread:
> >>   >>
> >>   >>    "Your choices are to run an unsupported (and unsupportable)
> >>   >>    configuration using the ACS override patch, get your hardware vendor
> >>   >>    to fix their platform, or upgrade to better hardware with better
> >>   >>    isolation characteristics."
> >>   >>
> >>   >> It's unfortunate that Intel provides VT-d on consumer platforms without
> >>   >> sufficient device isolation to really make it usable, but that's often
> >>   >> the state of things.  The workstation and server class platforms,
> >>   >> supporting Xeon E5 or High End Desktop Processors provide the necessary
> >>   >> isolation.  Thanks,  
> >>   >
> >>   > Yes, fortunately i get it solved finally, i tried adding the 'r8169'
> >>   > driver to the kernel group whitelist behind 'pci-stub' and recompile  &
> >>   > update the kernel firstly, and the VM boot up successfully, but a map
> >>   > page to iova error for realtek nic during DMA crashed the system later,
> >>   > looks it was caused by the group dependency, i remembered the vfio doc
> >>   > tells the group is the minimum isolation unit.  
> >
> > This approach is just a bad idea.
> >  
> >>   >
> >>   > Then i found there are 3 pci bridges on my board, 2 of them are with a
> >>   > group, another is a separate group, after plug the iwl wlan nic to this
> >>   > one, everything works well.  
> >>
> >> Just noticed a topology change of my system, looks the PCI bridges is
> >> different as before after i changed the slot for my wlan nic, i used to
> >> think i plugged it to 00:1d.0 but it was connected to Sky Lake PCIe
> >> controller, does this mean there are hidden PCI bridges for pci
> >> enumeration in the system, is this allowable?
> >>
> >> Before:
> >> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> >> Port #5 (rev f1) (prog-if 00 [Normal decode])
> >> 00:1c.7 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> >> Port #8 (rev f1) (prog-if 00 [Normal decode]) ------------ wlan nic
> >> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> >> Port #9 (rev f1) (prog-if 00 [Normal decode])
> >>
> >> Now:
> >> 00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16)
> >> (rev 07) (prog-if 00 [Normal decode]) ------------ wlan nic
> >> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> >> Port #5 (rev f1) (prog-if 00 [Normal decode])
> >> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> >> Port #9 (rev f1) (prog-if 00 [Normal decode])  
> >
> > There are generally two sources of PCIe root ports on Intel systems,
> > the processor itself and the PCH (Platform Controller Hub).  Look at a
> > block diagram for a modern system and you'll see this.  Typically for a
> > client processor (i3/i5/i7) there is no isolation between or
> > downstream of the individual processor root ports and isolation between
> > the individual PCH root ports is via quirks, because Intel didn't
> > include ACS or broke ACS.  You've found these processor root ports.
> > Why don't they show up in lspci when nothing is plugged into them?  Why
> > should they?  Chances are almost certain that your system does not
> > support PCI hotplug, so there's no requirement to expose empty
> > bridges.  I'm glad you've found a working setup, desktop class systems
> > often have poor isolation characteristics which make device assignment
> > difficult.  Thanks,  
> 
> Thanks for your illustration, googled a few docs & info about it,
> really helpful.
> 
> Still a few questions.
> 
> Q1:
>  > there is no isolation between or downstream of the individual
>  > processor root ports  
> Normally the processor root ports afford a higher speed than PCH root
> ports, the words 'no isolation here for processor root ports' here
> means no 'ACS' for root port? AFAIK the physical address have been 
> translated to iova before filling into the device, then how the root
> port forward TLPs between devices directly? IOTLB cache?

ACS is what tells the root port that it must forward transactions
upstream, through the IOMMU.  Without this we cannot guarantee that a
DMA target is translated through the IOMMU, a physical address range
that overlaps with an IOVA range could induce an unintended
peer-to-peer.
 
> Q2:
>  > isolation between the individual PCH root ports is via quirks,
>  > because Intel didn't include ACS or broke ACS.  
> A little confused, 'Intel didn't include ACS'? Does 'Intel' here mean
> processor root ports or root complex or the PCH?

The PCH root ports.

> For my case, to support isolation between the individual PCH root
> ports, 2 conditions should be satisfied at the same time.
> 
> - BIOS reports ACS capability for the devices.
> - Quirks as you listed before to correct the awareness of the
>    capability.

No, in general, if a device properly reports ACS no quirks are
necessary.  Quirks are only necessary when ACS is not implemented or
implemented incorrectly.  In the case of the Z170 PCH, ACS is
implemented, when enabled by the BIOS, but it's broken.  The control
register offset in the Intel version is at the wrong offset.  The quirk
to handle this is based on the ACS capability existing, we modify the
ACS usage to the different offset.  See the errata for the chipset.

> Are there any other quirks needed?

Only in the case of Z170 is the quirk based on the existence of an ACS
capability to start with.  Prior chipsets had no native ACS support and
the quirks manipulate various device specific registers to enforce PCH
root port isolation.
 
> Q3: This is a question not related to vfio, but i have been confused by
> it for a while, could you please try to answer it? or point me
> to some spec helpful.
> 
> For multiple cpu sockets case, e.g., consider a 2 cpu sockets,
> there should be 2 root complexes, how do the root complexes be
> connected to system pcie bus? normally there should be only one PCH in
> system, right? then how does PCH connected to the root complex? only
> one of them with upstream port or something else?
> 
> Considering device A was connected to root port of processor 0, if the
> target DMA address is for memory slots of processor 1, then how does
> the DMA/TLP look like on the fly?

Each processor has a host bridge, which is the source of the root
complex.  Various root ports and integrated endpoint devices can be
embedded into that root complex.  Therefore multi-socket systems
generally have multiple host bridges and therefore root complexes.
The devices hosted on those root complexes can be configured via the
BIOS, I assume it's possible to entirely disable the host bridge on
some processors.

The PCH is essentially a proprietary extension of the root complex over
a DMI link.  The PCH provides not only additional root ports, but often
legacy features to the system.  Therefore there's generally only one
PCH in a system, but I suppose more could be offered.  Generally on a
two-socket system, the combination of the root ports available through
the processors and a single PCH is more than sufficient to provide a
reasonable number of slots.  Thanks,

Alex




More information about the vfio-users mailing list