[vfio-users] Need help with GPU Passthrough on Ryzen C6H + GTX 980 Ti + GTX 1060 6G

Sat Jul 8 01:42:25 UTC 2017

On Thu, Jul 6, 2017 at 1:46 PM, Thiago Ramon <thiagoramon at gmail.com> wrote:

>
> On Thu, Jul 6, 2017 at 2:20 AM, Alex Williamson <
> alex.l.williamson at gmail.com> wrote:
>
>> On Wed, Jul 5, 2017 at 10:23 PM, Thiago Ramon <thiagoramon at gmail.com>
>> wrote:
>>>
>>>
>>> Here, dropped the raw message in pastebin: https://pastebin.com/hfJ6ryJg
>>>
>>> That particular run was trying to pass the 980 Ti, which is the boot
>>> device, and which probably had something else prodding at it (I'll give it
>>> a try again and check what else was attaching to it). I've mostly focused
>>> on passing the 1060 though, which doesn't get touched by anything but
>>> vfio-pci, and also doesn't show any mmap issues, here's the last QEMU run
>>> with SeaBIOS:
>>>
>>> https://pastebin.com/DEPpewCH
>>>
>>> And the last one from OVMF:
>>>
>>> https://pastebin.com/L7gkrm36
>>>
>>> On the kernel log, I only get the vfio_bar_restore messages. One
>>> interesting and consistent pattern is that SeaBIOS always generate 2 pairs
>>> of warnings (one for GPU, one audio), while OVMF generates quite a bit
>>> (dozen+, don't have a log handy). Probably not relevant, as apparently the
>>> failure happens before the first message anyway.
>>>
>>> Another detail that may be relevant: Whenever I try a passthrough (and
>>> fail), the kernel fails to soft restart. It gets to the last stage where it
>>> would do a soft reset but the console just sits there. Could this just be
>>> vfio_pci trying to do something with the unresponsive card, or something
>>> else that may be a clue to what's going on?
>>>
>>
>> Yep, here's what I suspected about the D3 warning:
>>
>> >PCI state after passthrough attempt:
>> > 29:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
>> [GeForce GTX 980 Ti] [10de:17c8] (rev ff) (prog-if ff)
>> >   !!! Unknown header type 7f
>> >   Kernel driver in use: vfio-pci
>> >   Kernel modules: nouveau, nvidia_drm, nvidia
>> >
>> > 29:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition
>> Audio [10de:0fb0] (rev ff) (prog-if ff)
>> >   !!! Unknown header type 7f
>> >   Kernel driver in use: vfio-pci
>> >   Kernel modules: snd_hda_intel
>>
>> The card isn't actually stuck in D3, it's basically disappeared from the
>> bus and all reads from config space are returning -1, which is
>> indistinguishable from from D3 power state for the bits that tell us the
>> power state.  This is probably the result of doing a bus reset, but that's
>> also our only way of putting the device back to a known state before
>> starting it in the VM.  You might try to see if you can reproduce this
>> result manually with setpci.  We do a bus reset by finding the bridge
>> upstream of the device, lspci -t is handy for this with a tree view of the
>> PCI topology.  As an example:
>>
>> https://pastebin.com/c3URT6vx
>>
>> Bus numbers are shown in brackets, so if I want the parent bridge of
>> device 01:00.0, look to the left of [01]--00.0 to find 01.0.  This is
>> attached to the root bus at [0000:00], so the full address of the parent
>> bridge is 0000:00:01.0.
>>
>> We can access the bridge control register using
>>
>> # setpci -s 0000:00:01.0 BRIDGE_CONTROL
>>
>> The secondary bus reset bit is 0x40.  We want to set this bit:
>>
>> # setpci -s 0000:00:01.0 BRIDGE_CONTROL=40:40
>>
>> Then clear it:
>>
>> # setpci -s 0000:00:01.0 BRIDGE_CONTROL=00:40
>>
>> Then run lspci on the bus to see if the device is still present.  In your
>> case it would be bus 29, so you'd run
>>
>> # lspci -vvv -s 0000:29:
>>
>> Do you get output like above with the 'Unknown header type 7f' or a
>> complete listing of the device?  Be sure to reboot the system after running
>> this test, regardless of the result the device will be re-initialized, and
>> clearly nothing should be using the device while doing this.  If the
>> graphics card doesn't recover from a bus reset, then something about this
>> system setup is not compatible with this use case.  Thanks,
>>
>> Alex
>>
>
> Ok, did some more testing. First thing I did was from having my 2 cards
> bound to the NVidia driver, shut down X, rmmod nvidia, bound my secondary
> card to vfio-pci and tried to reset the bus. It indeed failed to reset
> properly and got stuck.
> Then I tried switching out to my primary passthrough setup, to see what
> was grabbing the card memory, which turned out to be vesafb, even though
> I've disabled it.
> After adding a bunch more options to the boot command line, I've managed
> to properly block it from anything else, and proceeded to test the bus
> reset, which worked this time.
> Then I tried running the VM (without external BIOS) which failed, but
> complained about not accessing the BIOS.
> Rebooted again and tried with a pre-dumped BIOS, and it still failed in
> the same way as before.
>
> Returning to my secondary card, I've tried to reset the bus again, this
> time from a fresh boot, which seems to have worked fine. Here are the logs:
>
> https://pastebin.com/94F5wURY
>
> I've proceeded to reset the bus a few times, to see if it was a problem,
> but at least half a dozen resets don't seem to have caused any problems.
> Any other ideas?
>

Progress!(?)

Decided to try PCI passthrough with VirtualBox to see if I could get
anything out of it, as it does it quite differently from QEMU. And to my
surprise, it seems it actually managed to passthrough my GTX 1060, though
due to he nature of NVidia's drivers I got stuck with a code 43. I'm not
sure if the virtualization showing through is the only issue though, and I
couldn't get the card to actually start on a Linux guest either.

Anyway, the interesting news is that at least to a cursory lspci in the
guest the card seems good, and the card doesn't get corrupted (at least
until I try to reset it manually afterwards, but it does that if I used it
on the host before).

Here's a copy of the dmesg output during a whole run, might help clear
what's going on here: https://pastebin.com/U6Qvu0Wh

I've failed so far to locate anyone running PCI passthrough on a modern
NVidia GPU with VirtualBox, so I don't think this is going to be viable
(but I'll try a bit more), but maybe comparing the different approaches we
can figure out which part of the process is going bad with the vfio/QEMU
option.

Thanks for the help so far.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20170707/f84338f6/attachment.htm>