[rhos-list] Nova-network v.s. Quantum in Openstack preview

Sat Feb 16 03:52:52 UTC 2013

Hi, all,

I'm Eduardo from the KVM team. Some comments and questions below:

On Sat, Feb 16, 2013 at 02:31:24AM +0000, Shixiong Shang (shshang) wrote:
> Hi, Perry and Karen:
> 
> I did some further investigation tonight. The VM instance was
> initiated with lot of parameters, among which, here is one line
> related to CPU model:
> 
> -cpu Nehalem,+rdtscp,+vmx,+ht,+ss,+acpi,+ds,+vme -enable-kvm
> 
> 
> Based on qemu-kvm command and cpu_map.xml file, Nehalem and all of the
> flags are supported. However, when I tried to perform CPU check, KVM
> crashed again. The backtrace is identical to the ones I saw in failed
> VM instance log:

The "check" parameter asks QEMU to print warnings if some CPU features
are not supported by the host CPU, but QEMU will start the guest
normally after that. So, if you got to the "VNC server running" stage,
it means all CPU features from the QEMU "Nehalem" CPU model should be
supported by your host CPU + kernel, and the crash happened while the
guest was already running, not during the CPU feature check.

> 
> 
> [root at as-cmp1 libvirt]# /usr/libexec/qemu-kvm -cpu Nehalem,check

I am assuming you used just the above command with no extra parameters
(meaning you don't even need a disk image to reproduce the bug), right?

> VNC server running on `::1:5900'
> KVM internal error. Suberror: 2

How long does the error message take to appear, after starting qemu-kvm?

> extra data[0]: 80000003
> extra data[1]: 80000603

The data above is weird: the CPU is reporting that it was trying to
deliver an int3 (but with the interrupt type bits set to "external
interrupt", which doesn't make sense), and got another int3 interrupt
generated when trying to deliver it.

It doesn't look right (the codes don't seem to make sense), and even if
it was right, simply running qemu-kvm with no arguments shouldn't end up
generating int3 interrupts at all.

I would test this in other machines, to make sure this is really not a
hardware defect. Could you send the contents of /proc/cpuinfo? If you
are able to install the x86info package, the output of 'x86info -v -a'
would be useful, too.

> rax 00000000000003c3 rbx 00000000000008f2 rcx 000000000000013f rdx 000000000000ffdf
> rsi 0000000000000006 rdi 000000000000c993 rsp 00000000000003aa rbp 000000000000f000
> r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
> r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
> rip 00000000000010e2 rflags 00000286

Interesting, RIP is different from your previous report. Does the value
change if you run "/usr/libexec/qemu-kvm -cpu Nehalem,check" again?

> cs c000 (000c0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> ds c000 (000c0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> es f000 (000f0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> tr 0000 (feffd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
> ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
> gdt fc558/37
> idt 0/3ff
> cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
> 
> FYI, I am using this qemu-kvm version:
> qemu-kvm-0.12.1.2-2.335.el6.x86_64

Thanks. What are the versions of the kernel, seabios, vgabios, and gpxe
packages?

> 
> 
> The potential workaround is to use generic CPU model, such as KVM64,
> with performance penalty. I will give it a try and keep you posted. In
> the meanwhile, if you can think of anything else, please let me at
> your early convenience.

If other CPU models work, it may simply indicate that some feature bit
enabled by the Nehalem CPU model may be triggering the problem.

If that's the case, one way to find out which feature is causing the
problem is to try:

$ /usr/lib/qemu-kvm -cpu qemu64,+sse2,+sse,+fxsr,+mmx,+clflush,+pse36,+pat,+cmov,+mca,+pge,+mtrr,+sep,+apic,+cx8,+mce,+pae,+msr,+tsc,+pse,+de,+fpu,+popcnt,+x2apic,+sse4.2,+sse4.1,+cx16,+ssse3,+sse3,+i64,+syscall,+xd,+lahf_lm,model=26

I expect the bug to be reproduced easily using the above command-line.
After that, you can gradually remove features from the command-line,
until we find which one is triggering the problem.

> 
> Thanks for your help!
> 
> Shixiong
> 
> 

-- 
Eduardo