[vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load
Brian Yglesias
brian at atlanticdigitalsolutions.com
Sat Jul 16 10:47:35 UTC 2016
I spent some time troubleshooting, and I have things so that every use-case
works except multi-seat. I’m not sure how long it would be stable for with
both users in MS Office, but if one user does anything remotely 3D
intensive, then the nvidia driver will crash in /both/ VMs within a minute
or two, and within seconds of each other. The VMs must be hard shut down,
despite a notification stating that the driver has recovered in some cases.
There is some variability to it. Sometimes they will crash at the desktop,
before anything “3D intensive”, but this is rare. Usually one of them must
be using its GPU to some significant extent.
One VM at a time works fine, irrespective of what I do with it. Very well,
actually. In addition, I can pass two GPUs to one VM, and I can run on
benchmark on the 660 and two on the 970 (1080p and 4K/1080p, respectively),
etc, and it will not crash for at least 8 hours at a time. That’s as long
as I’ve run the tests for. The system is actually fairly useable
throughout, despite CPU being at redline as well.
I’ve made sure the RAM is good, and swapped the PSU for a bigger one. I
give each VM 8 gb, and leave about 6 gb for the host. I have another
motherboard I can repurpose as a test in the next few days, but in any case
I’m pretty sure at this point I don’t have a hardware problem per se.
There could still be some conflict with my host GPU/driver. I notice that
on the boot display I see the vfio module get loaded, followed by “vga arb
device changed decodes”… while still in the initrd, and then everything
stops on that display. This is more or less what I expect to happen, but
maybe I’m wrong to. I notice that I cannot pass the boot GPU to a VM. If I
try, the screen goes from being frozen with the vfio output mentioned above,
to idle, and stays that way.
Obviously “something is wrong” there. Even though I don’t use that GPU with
VMs, and despite having blacklisted nouvou/nvidia drivers, maybe it’s
somehow related. Seems doubtful.
On the software side, I can try another distro.
Changing the motherboard or the distro are not really good solutions for me,
however. I’m at something of an impasse, and I could use a suggestion.
Thanks in advance,
Brian
From: Alex Williamson [mailto:alex.l.williamson at gmail.com]
Sent: Thursday, July 7, 2016 3:28 PM
To: Torbjorn Jansson
Cc: Brian Yglesias; vfio-users
Subject: Re: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia
driver crashes often, seemingly under load
On Thu, Jul 7, 2016 at 1:20 PM, Torbjorn Jansson
<torbjorn.jansson at mbox200.swipnet.se
<mailto:torbjorn.jansson at mbox200.swipnet.se> > wrote:
On 2016-07-07 20:01, Brian Yglesias wrote:
I've been trying to get GPU passthrough to work more reliably for a few
days.
I have an Asus Rampage III Forumula (X58 chipset LGA1366) with latest bios,
Xeon X5670, kernel 4.4.13, quemu 2.5.1.1. I'm passing through a GTX 660 and
a GTX 970, sometimes to two different VMs, and sometimes to the same one.
i have a gtx970 and it works pretty well for gpu passthru.
but i'm not so sure a 660 will work and i suspect you will have reset
issues.
Seems to be some growing FUD with nvidia and reset issues. AFAIK, there are
no reset issues for Kepler and newer cards, including the 660. Fermi cards
always seem to cause problems, but I don't necessarily think it's reset
related. Reset problems on nvidia are more likely a result of trying to
assign the primary host graphics or getting the card into a bad state with
host graphics drivers. I have a GTX660, it doesn't get used often for this
purpose but IIRC, it works just fine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20160716/9f347eb6/attachment.htm>
More information about the vfio-users
mailing list