[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load



I wan't to correct what I said below.  ZFS was not the problem.  It seems having my VMs on any sort of storage other than (so far) a single disk with the OS, precludes multi-seat.

I'm about to post a much more detailed message about it, but I wanted to correct what I said earlier.

----- Original Message -----
From: "Brian Yglesias" <brian atlanticdigitalsolutions com>
To: "Alex Williamson" <alex l williamson gmail com>, "Torbjorn Jansson" <torbjorn jansson mbox200 swipnet se>
Cc: "vfio-users" <vfio-users redhat com>
Sent: Friday, July 22, 2016 11:28:56 AM
Subject: Re: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load

The problem seems to have been between ZFS and vfio.

If I move my VMs to any other type of partition/volume, the problem goes away.

I've verified that it can be reproduced consistently with two X58 chipset MBs, at least.

Should I file a bug report with someone?

Thanks,
Brian

----- Original Message -----
From: "Brian Yglesias" <brian atlanticdigitalsolutions com>
To: "Alex Williamson" <alex l williamson gmail com>, "Torbjorn Jansson" <torbjorn jansson mbox200 swipnet se>
Cc: "vfio-users" <vfio-users redhat com>
Sent: Saturday, July 16, 2016 6:47:15 AM
Subject: RE: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load

I spent some time troubleshooting, and I have things so that every use-case 
works except multi-seat.  I’m not sure how long it would be stable for with 
both users in MS Office, but if one user does anything remotely 3D 
intensive, then the nvidia driver will crash in /both/ VMs within a minute 
or two, and within seconds of each other.  The VMs must be hard shut down, 
despite a notification stating that the driver has recovered in some cases.



There is some variability to it.  Sometimes they will crash at the desktop, 
before anything “3D intensive”, but this is rare.  Usually one of them must 
be using its GPU to some significant extent.



One VM at a time works fine, irrespective of what I do with it.  Very well, 
actually.  In addition, I can pass two GPUs to one VM, and I can run on 
benchmark on the 660 and two on the 970 (1080p and 4K/1080p, respectively), 
etc, and it will not crash for at least 8 hours at a time.  That’s as long 
as I’ve run the tests for.  The system is actually fairly useable 
throughout, despite CPU being at redline as well.



I’ve made sure the RAM is good, and swapped the PSU for a bigger one.  I 
give each VM 8 gb, and leave about 6 gb for the host.  I have another 
motherboard I can repurpose as a test in the next few days, but in any case 
I’m pretty sure at this point I don’t have a hardware problem per se.



There could still be some conflict with my host GPU/driver.  I notice that 
on the boot display I see the vfio module get loaded, followed by “vga arb 
device changed decodes”… while still in the initrd, and then everything 
stops on that display.  This is more or less what I expect to happen, but 
maybe I’m wrong to.  I notice that I cannot pass the boot GPU to a VM.  If I 
try, the screen goes from being frozen with the vfio output mentioned above, 
to idle, and stays that way.



Obviously “something is wrong” there.  Even though I don’t use that GPU with 
VMs, and despite having blacklisted nouvou/nvidia drivers, maybe it’s 
somehow related.  Seems doubtful.



On the software side, I can try another distro.



Changing the motherboard or the distro are not really good solutions for me, 
however.  I’m at something of an impasse, and I could use a suggestion.



Thanks in advance,

Brian



From: Alex Williamson [mailto:alex l williamson gmail com]
Sent: Thursday, July 7, 2016 3:28 PM
To: Torbjorn Jansson
Cc: Brian Yglesias; vfio-users
Subject: Re: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia 
driver crashes often, seemingly under load



On Thu, Jul 7, 2016 at 1:20 PM, Torbjorn Jansson 
<torbjorn jansson mbox200 swipnet se 
<mailto:torbjorn jansson mbox200 swipnet se> > wrote:

On 2016-07-07 20:01, Brian Yglesias wrote:

I've been trying to get GPU passthrough to work more reliably for a few 
days.

I have an Asus Rampage III Forumula (X58 chipset LGA1366) with latest bios, 
Xeon X5670, kernel 4.4.13, quemu 2.5.1.1.  I'm passing through a GTX 660 and 
a GTX 970, sometimes to two different VMs, and sometimes to the same one.


i have a gtx970 and it works pretty well for gpu passthru.
but i'm not so sure a 660 will work and i suspect you will have reset 
issues.



Seems to be some growing FUD with nvidia and reset issues.  AFAIK, there are 
no reset issues for Kepler and newer cards, including the 660.  Fermi cards 
always seem to cause problems, but I don't necessarily think it's reset 
related.  Reset problems on nvidia are more likely a result of trying to 
assign the primary host graphics or getting the card into a bad state with 
host graphics drivers.  I have a GTX660, it doesn't get used often for this 
purpose but IIRC, it works just fine.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]