[vfio-users] Bus reset trouble with Titan-X

Tue Oct 18 22:48:59 UTC 2016

Alex,

I think I was able to do it successfully and was scucessfully able to make
the thing fail. It went from (rev a1) to (rev ff) with response of the
header error.

Instead of doing all devices I just did 1 at a time.

this was the output of

# lspci -tv

+-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0  NVIDIA Corporation
GM200 [GeForce GTX TITAN X]
                                            |                 \-00.1
NVIDIA Corporation Device efb0
                                            +-04.0-[05]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
                                            |                 \-00.1
NVIDIA Corporation Device efb0
                                            +-08.0-[06]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
                                            |                 \-00.1
NVIDIA Corporation Device efb0
                                            +-0c.0-[07]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
                                            |                 \-00.1
NVIDIA Corporation Device efb0
                                            +-14.0-[08]----00.0   Mellanox
Technologies MT27600 Family [ConnectX-3]
+-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+--00.0-[0d]--+-00.0
 NVIDIA Corporation GM200 [GeForce GTX TITAN X]

          |                  \-00.1  NVIDIA Corporation Device 0fb0

          +--04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

          |                  \-00.1  NVIDIA Corporation Device 0fb0

          +--08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

          |                  \-00.1  NVIDIA Corporation Device 0fb0

          +--0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

          |                  \-00.1  NVIDIA Corporation Device 0fb0

I tried the first device
# virsh nodedev-detach --driver=kvm pci_0000_04_00_0
Device pci_0000_04_00_0 detached

# virsh nodedev-detach --driver=kvm pci_0000_04_00_1
Device pci_0000_04_00_1 detached

In the script I put

DEVS=(
            03:00.0
            04
)

Ran it 100 times and got no error.

Ran it for a different device 05

# virsh nodedev-detach --driver=kvm pci_0000_05_00_0
Device pci_0000_05_00_0 detached

# virsh nodedev-detach --driver=kvm pci_0000_05_00_1
Device pci_0000_05_00_1 detached

DEVS=(
            03:04.0
            05:
)

I saw this.

#: for i in $(seq 1 100); do ./reset.sh; done
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev ff)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)

I repeated this with another device on the system.

I assume this indicates that that the device is not resetting properly? The
question is where do I go from here? Would this indicate a problem with the
PCI Reset code or a problematic hardware?

-Kevin

On Tue, Oct 18, 2016 at 11:49 AM, Alex Williamson <
alex.williamson at redhat.com> wrote:

> On Tue, 18 Oct 2016 11:04:14 -0500
> Kevin Vasko <kvasko at gmail.com> wrote:
>
> > Alex,
> >
> > (crossing fingers this goes into the correct thread).
> >
> > I upgraded this machine to 4.4.0-42-generic.
> >
> > I spawned a single VM with 1 GPU immediately after the kernel upgrade. It
> > works. It attached properly and in the VM when I ran lspci, it showed up
> > properly.
> >
> > I deleted that VM and started up the system with 4x GPUs, and then it
> > started exhibiting the same issue. Three of the GPUs attached properly.
> >
> > This appears to be that it was not resolved with upgrading the kernel. If
> > you don't mind providing instructions on resetting the bus to see if I
> can
> > narrow this down further (what you were talking about yesterday) that
> would
> > be appreciated. Any other suggestions would be greatly appreciated as
> well.
>
> Ok, you're going to need to identify the parent bridge for the GPUs.
> You can do this with 'lspci -tv'.  If you need help, send the output of
> that command.  Here's an example:
>
> # lspci -tv
> -[0000:00]-+-00.0  Intel Corporation 5520/5500/X58 I/O Hub to ESI Port
>            +-01.0-[01]--+-00.0  Intel Corporation 82576 Gigabit Network
> Connection
>            |            \-00.1  Intel Corporation 82576 Gigabit Network
> Connection
>            +-03.0-[02]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>            +-07.0-[03]--+-00.0  Intel Corporation Ethernet Controller X710
> for 10GbE SFP+
>            |            \-00.1  Intel Corporation Ethernet Controller X710
> for 10GbE SFP+
>            ...
>
> Say I want to do a bus reset on the X710 ethernet devices at 03:00.0
> and 03:00.1.  This should be similar to a GPU and companion audio
> device.  The parent bridge is device 00:07.0.  I can double check this
> by running lspci on this device:
>
> # lspci -vs 00:07.0
> 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
> Root Port 7 (rev 22) (prog-if 00 [Normal decode])
>         Flags: bus master, fast devsel, latency 0, IRQ 27
>         Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
>                          ^^^^^^^^^^^^
>
> The secondary bus is 03, thus it's the parent device of 03:00.[01].
>
> Prior to performing a bus reset, attach all the affected devices to a
> driver that isn't going to be making use of the devices, for instance
> pci-stub.  We can do this with virsh using:
>
> # virsh nodedev-detach --driver=kvm pci_0000_03_00_0
> Device pci_0000_03_00_0 detached
>
> # virsh nodedev-detach --driver=kvm pci_0000_03_00_1
> Device pci_0000_03_00_1 detached
>
> The "--driver=kvm" simply selects pci-stub rather than vfio-pci, which
> would otherwise be the default.
>
> Also note that after a bus reset, the downstream devices are not going
> to be usable until after a system reboot.  Our goal is to see how
> reliably we can perform a bus reset and have the devices re-appear, we
> cannot make use of them beyond running lspci on them without a system
> reboot.
>
> Ok, so for each GPU you should know the parent bridge, the address of
> the GPUs themselves, and each GPU and companion audio device should be
> bound to pci-stub.
>
> Using the dual port NICs as stand-ins for your GPUs, we need a script
> like this:
>
> # cat reset.sh
> #!/bin/sh
>
> DEVS=(
>         00:01.0 # parent of 01:
>         01:     # affected devices of 01.0
>         00:07.0 # parent of 03:
>         03:     # affected devices of 07.0
>         # change the entries above for your system
>         # you will have more devices here, a parent bridge
>         # followed by the bus of the affected GPU, 0f:, 10:, 0e:, 0d:
> )
>
> i=0
>
> while [ $i -lt ${#DEVS[@]} ]; do
>         setpci -s ${DEVS[$i]} 3e.w=40:40 # Set 2ndary bus reset bit
>         sleep 0.2
>         setpci -s ${DEVS[$i]} 3e.w=0:40 # Clear 2ndary bus reset bit
>         sleep 1
>         # when this reports abnormally, we've failed
>         lspci -s ${DEVS[$(( $i + 1 ))]}
>         i=$(( $i + 2 ))
> done
>
> Don't forget to chmod 755 the script.  Run it once and it should
> produce something like this (of course with your GPUs instead of my
> NICs):
>
> # ./reset.sh
> 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710
> for 10GbE SFP+ (rev 01)
> 03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710
> for 10GbE SFP+ (rev 01)
>
> If that works, then run it 100 times:
>
> # for i in $(seq 1 100); do ./reset.sh; done
>
> If you start seeing "(rev ff) (prog-if ff)" then the device has
> failed.  (left as an exercise to the reader to automatically stop on
> this condition ;)  Please report what you find and remember that it's
> expected that you will need to reboot the system after performing this
> test to get the devices back into a workable state.  We're not saving
> and restoring the state of the devices around reset.  Thanks,
>
> Alex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161018/ba265b4a/attachment.htm>