[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [vfio-users] Radeon 5770 Passthrough works in Chipset 440FX but not in Q35



Allright, I have a lot more info since I spend several hours gathering logs. At the end, my logging style became ridiculous messy since I couldn't keep track of several variables and settings in each case plus some edge cases, so isn't 100% scientific, but I got general info about what happens and when.

TL;DR: I got a whole bunch of specific MSR errors in BOTH FX440 and Q35 which I tried my best to log and see how reproducible they are. Some happen even with no Passthrough, just the plain qxl-vga, so I can't blame the Radeon Drivers there, should be something related to Windows 10, since I even used the AMD tool for a clean uninstall of their Drivers. Since Passthrough works in FX440 but not Q35 (And even tried with ignore_msrs, with same behavior), MSRs errors should be absolutely unrelated to my issue. Also, most of the issues where ignoring MSRs fixed something are related to specific games or applications, usually not the Drivers themselves.
I also tried with Gparted, which is based on Debian. It has A LOT more MSR errors, some are the same as W10, some are totally new. However, rather surprisingly, Passthroughs WORKS in Q35. Since I have little experience configuring the GPU in Linux I don't know if its really works (lspci -vvv in a Gparted terminal shows the radeon Driver in use for the Radeon 5770), but as far that I know, if using UEFI GOP, the OS gets the resolution than the Firmware started the GPU at, and can't change it unless Drivers are present. Since after booting it maxed out at 1900*1080 and could also change it, I suppose it works.
Also tried upgrading to QEMU 2.6. W10 still doesn't works with Q35 and Passthrough, but FX440 still works, so I'm not affected by the 2.6 VGA Passthrough regression. Since I tested Gparted with 2.6 already installed, I don't know if it works in 2.5.1 with Q35, either.




I started to test monitoring dmesg -w to see the log live in a Terminal in the host.

dmesg -w

This way I could get the exact moment where the MSR errors appeared.


When I tried ignore_msrs, I had created file:

/etc/modprobe.d/kvm.conf
options kvm ignore_msrs=y


cat /sys/module/kvm/parameters/ignore_msrs
Reported a Y, so it worked as intended

I also manually enabled or disabled it for a few tests using:

echo 1 > /sys/module/kvm/parameters/ignore_msrs
echo 0 > /sys/module/kvm/parameters/ignore_msrs

I tried this mostly for Q35 in W10, but after seeing that it didn't worked anyways, I removed ignore_msrs.


MSR ERRORS

In both FX440 and Q35, there were two types of MSR events. They both happens when Windows 10 boots, right before the splash screen with Windows Logo switches to the login screen. I think I saw them with both Catalyst 15.7 Drivers (15.200.1046.0) and Crimson 16.2.1 Beta (15.301.1901.0), and even when not doing Passthrough at all, including wiping the Radeon Drivers with an AMD uninstall utility that should fully remove them.

Sometimes, it was a single error:
kvm [658]: vcpu0, guest rIP: 0xfffff800705b6067 unhandled rdmsr: 0x641

Address vary, here are three variants:
kvm [607]: vcpu0, guest rIP: 0xfffff800ea856067 unhandled rdmsr: 0x641

kvm [703]: vcpu0, guest rIP: 0xfffff8009f4e6067 unhandled rdmsr: 0x641

kvm [629]: vcpu6, guest rIP: 0xfffff8003f3f6067 unhandled rdmsr: 0x641


Other times, they were 4 in a row:
kvm [1051]: vcpu5, guest rIP: 0xfffff80179106067 ignored rdmsr: 0x641
kvm [1051]: vcpu5, guest rIP: 0xfffff8017910607d ignored rdmsr: 0x606
kvm [1051]: vcpu5, guest rIP: 0xfffff80179106261 ignored rdmsr: 0x606
kvm [1051]: vcpu5, guest rIP: 0xfffff801791010bc ignored rdmsr: 0x641

Address vary, here is a variant:

kvm [1107]: vcpu0, guest rIP: 0xfffff801ba8b6067 ignored rdmsr: 0x641
kvm [1107]: vcpu0, guest rIP: 0xfffff801ba8b607d ignored rdmsr: 0x606
kvm [1107]: vcpu0, guest rIP: 0xfffff801ba8b6261 ignored rdmsr: 0x606
kvm [1107]: vcpu0, guest rIP: 0xfffff801ba8b10bc ignored rdmsr: 0x641

I think I got those when I was using ignore_msrs since it says ignored and not unhandled. Regardless, they also keep happening after I removed ignore_msrs.

I'm not sure about why they change, but results seems to be consistent in that I seem to always get either the single one or the four consecutive ones in the same session, including VM reboots or shutdown and start cycles. Since at times the Video Card died and I found no way to recover it, I had to do a host reboot, and that's when I started to notice those two patterns. Maybe they're related to a host cold boot or a reboot, but I didn't tested that. Regardless, there are just two MSRs there, always a 0x641, and sometimes a 0x606.


W10 also automatically enters a sort of Recovery Mode (Don't know its formal name, but should be a sort of Safe Mode replacement) after several failed boots (2-3). I saw it often since Q35 with Passthrough never worked, so after a few tries, W10 enters that mode instead. The MSRs appear just before it ask you the Keyboard Layout, and I think I also saw them in 440FX:

vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639
vcpu0, guest rIP: 0xfffff8018aec2dfe ignored rdmsr: 0x639

(Yes, 10 in a row)
I think I also saw a variant of it where each vcpu (0-7) produced a 0x639.
Another variant in Recovery Mode included a continuos, repeating spam of:

kvm_get_msr_common: 91 callbacks suppressed
...plus the 10 previous 0x639 rdmsrs, EVERY A FEW SECONDS. I think it happened only while I was ignoring msrs. Bah, for more accurate info I should test again...


Finally, I tested booting the Live CD Gparted (gparted-live-0.26.0-2-i686.iso). I used a rather reduced script with fresh OVMF copies (Of the same version that I used 3 weeks ago, mentioned in previous mail) instead of the already existing ones, just to have a clear NVRAM. I tested with both 440FX and Q35, in both cases including the PCIe Root Port.
I booted with the first option (Default settings), but at least once I tried with the second one (Default settings, KMS) and results were the same:

kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x1c9
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x1a6
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x1a7
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x3f6
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x606
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x34
These happen consecutively as soon as I choose an option. And just before it changes to the console-style GUI asking for Keymap, these appear:
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x611
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x639
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x641
kvm [875] vcpu0, guest rIP: 0xc2047b0f unhandled rdmsr: 0x619

Regardless, VGA Passthrough seems to work in Q35 since I could get to the GUI Desktop (Gnome I think). It uses the Open Source radeon Driver.


Additionally, in one of my tests with Gparted, I got a Machine Check Error between the 0x31 and the 0x611:

mce: [Hardware Error]: Machine check events logged

It happened only once, and couldn't reproduce. According to mcelog, it was produced by...

Hardware event. This is not a software error.
MCE 0
CPU 3 BANK 0
TIME 1466601211 Wed Jun 22 10:13:31 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

I tried to get more info since I was worried that it could be a real Hardware Error, but based on this:
unix.stackexchange.com/questions/165222/mce-error-mca-internal-parity-error
it can be a harmless product of an errata in Haswell Processors. Besides, the VM still worked, it finished booting, and was usable, so didn't looked like a fatal error to me. But it still worries me a bit...


Finally, the Firmware records SMBIOS errors rather often related to the Video Card, but I don't know the precise moment where they are generated. They happen in both Xen and KVM, and usually when I'm testing things like I did today, since for the previous weeks that my gaming VM was continuously on there were no errors. Common sense points that each error is generated around the time where I create the VM with Passthrough. But since I had less VM boot failures that SMBIOS errors, I don't know what generates them, or if they are always generated. They look like this:

DATE              TIME         ERROR CODE  SEVERITY
06/02/2016   07:47:39   Smbios 0x0A   Bus01(DevFn00)
06/02/2016   07:52:47   Smbios 0x0A   Bus01(DevFn01)

Since I don't know when they are generated, they may be either warnings or errors (Maybe they are generated when the Video Card refuses to work any longer and forces me to do a host reboot). I could try to monitor them if someone tells me how to log these in real time (Like I do with dmesg -w) in Linux, since otherwise, I would have to write the exact time where I start each VM session and compare them with the Firmware SMBIOS error log.


Also, at least one, I left the computer unattended with Windows 10 using Passthrough, and when I came back, there were two VFIO errors in dmesg saying that it couldn't switch power state from D0 or D3 - don't recall the exact error. It forced me to reboot the host. I think that Windows 10 entered sleep mode and the Video Card followed it, but failed to go back to full power state. I don't know if its VFIO fault, or the guest Radeon Drivers. I didn't tried to debug this one, but I suppose that its as simple as disabling sleep modes.


Anyways, if you need more specific or accurate data, I could try to re-do everything again. Sadly, I find rather chaotic when I have to keep track manually of way too many things, so I have to figure out how to do it in a more organized way that this mail.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]