[vfio-users] cpu core pinning with multiple cpus

Alex Williamson alex.williamson at redhat.com
Sat Sep 12 00:19:31 UTC 2015


On Sat, 2015-09-12 at 00:23 +0200, Erik Adler wrote:
> Certain games are giving me terrible frame rates on my GTX 970.
> Generally these games are not very demanding when using bare metal.
> Unigine Valley Benchmark is doing fine at about 87% native speeds.
> Same with some other GPU intensive benchmarks. The games that have bad
> fps seem to be taxing the cpu heavily and having latency issues in
> passthough.
> 
> I am not sure that I have paired my cores correctly. Using Alex's CPU
> latency script I get the following. Since this is a dual CPU system I
> need to keep everything on the same NUMA node.
> There is a definitive pattern but I am not 100% sure that I see the
> correctional with lstopo.

Looks pretty much like it's supposed to afaict.  Take the top row for
example, CPU0 has the best latency to CPUs 0-5 and 12-17.  These are
thread0 and thread1 of the cores on node0.  CPUs 6-11 and 18-23 are on
the remote socket, so the suffer a pretty big hit.  Move down to row 6,
CPU6 on node1 and we see that the latency has flipped, 6-11 and 18-23
are now closer.


>   |  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> --+------------------------------------------------------------------------
>  0| 10  8  7  8  7  8  4  4  4  4  4  4  6  7  7  7  7  7  4  4  4  4  4  4
>  1|  9 10  8  8  8  8  4  4  4  4  4  4  7  7  8  8  8  8  4  4  4  4  4  4
>  2|  8  8 10  8  8  7  4  4  4  4  5  4  7  8  7  8  8  8  4  4  4  4  4  4
>  3|  8  8  8 10  8  8  4  5  4  4  4  4  8  7  8  7  8  8  4  4  4  4  4  4
>  4|  8  8  7  8 10  7  4  4  4  4  4  4  7  8  8  8  7  8  3  4  4  4  4  4
>  5|  8  8  8  8  8 10  4  4  4  4  4  4  7  8  8  7  8  7  4  4  4  4  4  4
>  6|  4  4  4  4  4  4 10  6  6  6  6  6  4  4  3  4  4  4  5  7  6  6  6  6
>  7|  4  4  4  5  4  4  5 10  7  7  7  7  4  4  4  5  5  4  6  7  7  7  7  7
>  8|  4  5  4  5  4  5  6  7 10  7  7  6  4  4  5  4  4  5  5  6  6  8  8  7
>  9|  5  5  5  4  4  4  6  7  8 10  8  8  5  4  5  5  5  4  6  8  7  7  8  8
> 10|  4  4  4  4  4  3  5  6  6  6 10  6  4  4  4  4  4  4  5  6  6  6  5  6
> 11|  3  3  4  3  4  4  5  5  6  6  6 10  3  3  4  4  4  4  5  6  6  6  6  5
> 12|  7  8  8  8  8  7  4  5  4  4  4  4 10  8  8  8  8  8  4  5  4  4  5  5
> 13|  8  7  8  7  7  7  3  4  4  4  4  4  8 10  8  8  7  7  4  4  4  4  4  4
> 14|  8  8  7  8  8  8  4  4  4  4  4  4  7  8 10  8  7  8  4  5  4  4  4  4
> 15|  8  7  8  6  8  8  4  4  4  4  4  4  7  7  8 10  8  7  4  5  4  4  4  4
> 16|  9  8  9  9  8  9  4  5  5  4  5  5  8  8  9  9 10  9  4  5  5  4  5  5
> 17|  8  7  8  8  8  7  4  4  4  4  4  4  7  8  8  8  7 10  4  4  4  4  4  4
> 18|  4  4  4  4  4  4  5  7  6  7  7  7  4  4  4  4  4  4 10  6  7  6  7  7
> 19|  5  5  4  4  4  4  6  6  7  7  7  6  4  4  5  4  4  4  6 10  8  7  7  7
> 20|  4  5  4  4  4  4  6  8  7  8  8  8  4  4  5  3  4  4  6  8 10  8  8  8
> 21|  4  4  4  4  4  4  6  5  6  5  6  6  4  4  4  4  4  4  5  6  6 10  7  6
> 22|  5  4  4  5  4  5  6  7  7  7  7  8  4  5  5  5  5  5  6  8  8  8 10  8
> 23|  4  4  4  4  4  4  6  7  7  6  7  6  4  4  4  4  4  4  5  7  6  7  7 10
> 
> https://i.imgur.com/PQvT2oR.png

Nice, this makes it even more clear.

> Looking at lstopo (url) I have hopefully mapped out my hardware
> correctly. In numa node “0” I can see my GTX 970 on PCI 10de:13c2 .
> 
> http://i.imgur.com/GBczQvi.png

Yep, you definitely want to use 0-5 and 12-17.

> I am a assuming that if I want to use HT on CPU1 my xml file should
> look like this? Have I done something wrong with how I have pinned
> cores?
> 
> <domain type='kvm'>
>   <name>Windows</name>
>   <uuid>cc52dc82-ce9a-45ff-99e6-a92ab0f42b59</uuid>
>   <memory unit='KiB'>16777216</memory>
>   <currentMemory unit='KiB'>16777216</currentMemory>
>   <vcpu placement='static'>8</vcpu>
>   <cputune>
>     <vcpupin vcpu='0' cpuset='2'/>
>     <vcpupin vcpu='1' cpuset='3'/>
>     <vcpupin vcpu='2' cpuset='4'/>
>     <vcpupin vcpu='3' cpuset='5'/>
>     <vcpupin vcpu='4' cpuset='14'/>
>     <vcpupin vcpu='5' cpuset='15'/>
>     <vcpupin vcpu='6' cpuset='16'/>
>     <vcpupin vcpu='7' cpuset='17'/>
>   </cputune>
>   <os>
>     <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
>     <loader type='rom'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
>   </os>
>   <features>
>     <acpi/>
>     <apic/>
>     <pae/>
>     <kvm>
>       <hidden state='on'/>
>     </kvm>
>     <vmport state='off'/>
>   </features>
>   <cpu mode='host-passthrough'>
>     <topology sockets='1' cores='4' threads='2'/>
>   </cpu>

Looks ok to me, but see my post from last week:

https://www.redhat.com/archives/vfio-users/2015-September/msg00041.html

You may have better latency without exposing threads to the guest,
reserving the other half of the core to be idle or reserved for running
the emulator.




More information about the vfio-users mailing list