[vfio-users] Bus reset trouble with Titan-X (was Re: Welcome to the "vfio-users" mailing list (Digest mode))

Kevin Vasko kvasko at gmail.com
Tue Oct 18 16:07:21 UTC 2016


On Tue, Oct 18, 2016 at 11:04 AM, Kevin Vasko <kvasko at gmail.com> wrote:

> Alex,
>
> (crossing fingers this goes into the correct thread).
>
> I upgraded this machine to 4.4.0-42-generic.
>
> I spawned a single VM with 1 GPU immediately after the kernel upgrade. It
> works. It attached properly and in the VM when I ran lspci, it showed up
> properly.
>
> I deleted that VM and started up the system with 4x GPUs, and then it
> started exhibiting the same issue. Three of the GPUs attached properly.
>
> This appears to be that it was not resolved with upgrading the kernel. If
> you don't mind providing instructions on resetting the bus to see if I can
> narrow this down further (what you were talking about yesterday) that would
> be appreciated. Any other suggestions would be greatly appreciated as well.
>
> Here are the logs of the 4 GPU attachment that failed.
>
> On the host.
>
> /etc/var/log/libvirt/qemu/instance-00000185.log
>
> this shows the /usr/bin/kvm command issuing the connection of the
> following devices
>
> -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5
> -device vfio-pci,host=10:00.0,id=hostdev1,bus=pci.0,addr=0x6
> -device vfio-pci,host=0e:00.0,id=hostdev2,bus=pci.0,addr=0x7
> -device vfio-pci,host=0d:00.0,id=hostdev3,bus=pci.0,addr=0x8
>
>
> lspci -vnnn -d 10de:17c2 (on the host, I omitted the other 4 GPUs)
>
>
> 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
> [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
>
>      subsystem: NVIDIA Corporation Device [10de:1132]
>
>      Flags: fast devsel, IRQ 28
>
>      Memory at b9000000 (32-bit, non-prefetchable) [size=16M]
>
>      Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]
>
>      Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]
>
>      I/O ports at 3000 [size=128]
>
>      Expansion ROM at ba000000 [disabled] [size=512k]
>
>      Capabilities: [60] Power Management version 3
>
>      Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+
>
>      Capabilities: [78] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [100] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [250] Latency Tolerance Reporting
>
>      Capabilities: [258] L1 PM Substates
>
>      Capabilities: [128] Power Budgeting <?>
>
>      Capabilities: [420] Advanced Error Reporting
>
>      Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
> Len=024 <?>
>
>      Capabilities: [900] #19
>
>      Kernel driver in use: vfio-pci
>
> 0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
> [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
>
>      subsystem: NVIDIA Corporation Device [10de:1132]
>
>      Flags: fast devsel, IRQ 28
>
>      Memory at b9000000 (32-bit, non-prefetchable) [size=16M]
>
>      Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]
>
>      Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]
>
>      I/O ports at 3000 [size=128]
>
>      Expansion ROM at ba000000 [disabled] [size=512k]
>
>      Capabilities: [60] Power Management version 3
>
>      Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+
>
>      Capabilities: [78] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [100] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [250] Latency Tolerance Reporting
>
>      Capabilities: [258] L1 PM Substates
>
>      Capabilities: [128] Power Budgeting <?>
>
>      Capabilities: [420] Advanced Error Reporting
>
>      Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
> Len=024 <?>
>
>      Capabilities: [900] #19
>
>      Kernel driver in use: vfio-pci
>
>
> 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
> [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
>
>             !!! Unknown header type 7f
>
>             Kernel driver in use: vfio-pci
>
>
> 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
> [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
>
>      subsystem: NVIDIA Corporation Device [10de:1132]
>
>      Flags: fast devsel, IRQ 28
>
>      Memory at b9000000 (32-bit, non-prefetchable) [size=16M]
>
>      Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]
>
>      Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]
>
>      I/O ports at 3000 [size=128]
>
>      Expansion ROM at ba000000 [disabled] [size=512k]
>
>      Capabilities: [60] Power Management version 3
>
>      Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+
>
>      Capabilities: [78] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [100] Express Legacy Endpoint, MSI 00
>
>      Capabilities: [250] Latency Tolerance Reporting
>
>      Capabilities: [258] L1 PM Substates
>
>      Capabilities: [128] Power Budgeting <?>
>
>      Capabilities: [420] Advanced Error Reporting
>
>      Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
> Len=024 <?>
>
>      Capabilities: [900] #19
>
>      Kernel driver in use: vfio-pci
>
>
> On the VM guest:
>
>
> lspci
>
>
> 00:06.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
>
> 00:07.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
>
> 00:08.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
>
> dmesg
>
>
> [    0.787786] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff
>
> [    0.788970] pci 0000:00:06.0: [10de:17c2] type 00 class 0x030000
>
> [    0.855192] pci 0000:00:07.0: [10de:17c2] type 00 class 0x030000
>
> [    0.925003] pci 0000:00:08.0: [10de:17c2] type 00 class 0x030000
>
>
>
>
> On Mon, Oct 17, 2016 at 11:10 PM, Kevin Vasko <kvasko at gmail.com> wrote:
>
>> Thanks. I'm an idiot. I just replied to the email directly after the
>> subscription and wasn't paying attention. Thank you for correcting it.
>>
>> I was originally running 3.13.0-86-generic upgraded to the 3.19 version
>> to try before I posted this, but got the same results. I'll try a newer
>> version of the kernel and see what happens.
>>
>> Sorry to be dense but what do you mean by "retrain properly"? I assume
>> you mean that once it fails to reset it just never recovers?
>>
>> We have 2 other machines that I've never seen this problem with so what
>> what you are saying makes sense. This system does have a slightly more
>> specialized PCI bus to be able to stick 8 cards on a single bus (at least
>> that is my understanding), so at this point, either I'm hitting a bug that
>> is fixed in the kernel, or this PCI bus is not doing something that
>> vfio-pci is expecting (would be my speculation).
>>
>> I'll report back my findings tomorrow.
>>
>> Thanks for the help.
>>
>> -Kevin
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson <
>> alex.williamson at redhat.com> wrote:
>>
>>> (generally a good idea to have a useful subject line)
>>>
>>> On Mon, 17 Oct 2016 16:26:15 -0500
>>> Kevin Vasko <kvasko at gmail.com> wrote:
>>> >
>>> > Any suggestions on debugging a !!! Unknown header type 7f?
>>> >
>>>
>>> This usually means that the device didn't come back from bus reset and
>>> re-reading the PCI config space where the device was just gives a -1
>>> response.  lspci tries to interpret that bogus data and gives results
>>> like you see.  You might try a newer kernel, we've probably fixed some
>>> things in the bus reset path since v3.19.  It looks like you continue
>>> to see the bogus data once it gets into this state, so it's probably
>>> not a "simple" device coming out of reset too slowly problem.  Possibly
>>> the PCIe link doesn't retrain properly sometimes after a bus reset.  If
>>> a new kernel doesn't help, I could give you instructions for performing
>>> a bus reset with setpci and you could test how reliably you can reset
>>> the device and read config space after.  Thanks,
>>>
>>> Alex
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161018/a190df33/attachment.htm>


More information about the vfio-users mailing list