[vfio-users] Best pinning strategy for latency / performance trade-off

Wed Feb 1 15:46:30 UTC 2017

A while ago there was a conversation on the #vfio-users irc channel about how to
use cpuset/pinning to get the best latency and performance. I said I would run
some tests and eventually did. Writing up the result took a lot of time and
there are some more test I want to run to verify the results but don't have time
to do that now. I'll just post what I've concluded instead. First some theory.

Latency in a virtual environment have many difference causes.
* There is latency in the hardware/bios like system management interrupts.
* The host operating system introduce some latency. This is often because the
  host won't schedule the VM when it wants to run.
* The emulator got some latency because of things like nested page tables and
  handling of virtual hardware.
* The guest OS introduce it's own latency when the workload wants to run but the
  guest scheduler won't schedule it.

Point 1 and 4 are latencies you get even on bare metal but point 2 and 3 is
extra latency caused by the virtualisation. This post is mostly about reducing
the latency of point 2.

I assume you are already familiar with how this is usually done. By using cpuset
you can reserve some cores for exclusive use by the VM and put all system
processes on a separate housekeeping core. This allows the VM to run whenever it
wants which is good for latency but the downside is the VM can't use the
housekeeping core so performance is reduced.

By running pstree -p when the VM is running you get some output like this:
...
─qemu-system-x86(4995)─┬─{CPU 0/KVM}(5004)
                       ├─{CPU 1/KVM}(5005)
                       ├─{CPU 2/KVM}(5006)
                       ├─{CPU 3/KVM}(5007)
                       ├─{CPU 4/KVM}(5008)
                       ├─{CPU 5/KVM}(5009)
                       ├─{qemu-system-x86}(4996)
                       ├─{qemu-system-x86}(5012)
                       ├─{qemu-system-x86}(5013)
                       ├─{worker}(5765)
                       └─{worker}(5766)

Qemu spawn a bunch of threads for different things. The "CPU #/KVM" threads runs
the actual guest code and there is one for each virtual cpu. I call them
"VM threads" from here on. The qemu-system-x86 threads are used to emulate
virtual hardware and is called the emulator in libvirt terminology. I call them
"emulator threads". The worker threads are probably what libvirt calls iothreads
but I treat them the same as the emulator threads and refer to them both as
"emulator threads".

My cpu is a i7-4790K with 4 hyper threaded cores for a total of 8 logical cores.
A lot of people here probably have something similar. Take a look in
/proc/cpuinfo to see how it's laid out. I number my cores like cpuinfo where I
got physical cores 0-3 and logical cores 0-7. pcore 0 corresponds to lcore 0,4
and pcore 1 is lcore 1,5 and so on.

The goal is to partition the system processes, VM threads and emulator threads
on these 8 lcores to get good latency and acceptable performance but to do that
I need a way to measure latency. Mainline kernel 4.9 got a new latency tracer
called hwlat. It's designed to measure hardware latencies like SMI but if you
run it in a VM you get all latencies below the guest (point 1-3 above). Hwlat
bypasses the normal cpu scheduler so it won't measure any latency from the guest
scheduler (point 4). It basically makes it possible to focus on just the VM
related latencies.
https://lwn.net/Articles/703129/

We should perhaps also discuss how much latency is too much. That's up for
debate but the windows DPC latency checker lists 500us as green, 1000us as
yellow and 2000us as red. If a game runs at 60fps it has a deadline of 16.7ms to
render a frame. I'll just decide that 1ms (1000us) is the upper limit for what I
can tolerate.

One of the consequences of how hwlat works is that it also fails to notice a lot
of the point 3 types of latencies. Most of the latency in point 3 is caused by
vm-exits. That's when the guest do something the hardware virtualisation can't
handle and have to rely on kvm or qemu to emulate the behaviour. This is a lot
slower than real hardware but it mostly only happens when the guest tries to
access hardware resources, so I'll call it IO-latency. The hwlat tracer only
sits and spins in kernel space and never touch any hardware by itself. Since
hwlat don't trigger vm-exits it also can't measure latencies from that so it
would be good to have something else that could. They way I rigged things up is
to set the virtual disk controller to ahci. I know that has to be emulated by
qemu. I then added a ram block device from /dev/ram* to the VM as a virtual
disk. I can then run the fio disk benchmark in the VM on that disk to trigger
vm-exits and get a report on the latency from fio. It's not a good solution but
it's the best I could come up with.
http://freecode.com/projects/fio

=== Low latency setup ===

Let's finally get down to business. The first setup I tried is configured for
minimum latency at the expense of performance.

The virtual cpu in this setup got 3 cores and no HT. The VM threads are pinned
to lcore 1,2,3. The emulator threads are pinned to lcore 5,6,7. That leaves
pcore 0 which is dedicated to the host using cpuset.

Here is the layout in libvirt xml
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='2'/>
<vcpupin vcpu='2' cpuset='3'/>
<emulatorpin cpuset='5-7'/>
<topology sockets='1' cores='3' threads='1'/>

And here are the result of hwlat (all hwlat test run for 30 min each). I used a
synthetic load to test how the latencies changed under load. I use the program
stress as synthetic load on both guest and host
(stress --vm 1 --io 1 --cpu 8 --hdd 1).

                         mean     stdev    max(us)
host idle, VM idle:   17.2778   15.6788     70
host load, VM idle:   21.4856   20.1409     72
host idle, VM load:   19.7144   18.9321    103
host load, VM load:   21.8189   21.2839    139

As you can see the load on the host makes little difference for the latency.
The cpuset isolation works well. The slight decrease of the mean might be
because of reduced memory bandwidth. Putting the VM under load will increase the
latency a bit. This might seem odd since the idea of using hwlat was to bypass
the guest scheduler thereby making the latency independent of what is running in
the guest. What is probably happening is that the "--hdd" part of the stress
access the disk and this makes the emulator threads run. They are pinned to the
HT siblings of the VM threads and thereby slightly impact the latency of them.
Overall the latency is very good in this setup.

fio (us) min=40, max=1306, avg=52.81, stdev=12.60 iops=18454
Here is the result of the io latency test with fio. Since the emulator treads
are running mostly isolated on their own siblings this result must be considered
good.

=== Low latency setup, with realtime ===

In an older post to the mailing list I said "The NO_HZ_FULL scheduler mode only
works if a single process wants to run on a core. When the VM thread runs as
realtime priority it can starve the kernel threads for long period of time and
the scheduler will turn off NO_HZ_FULL when that happens since several processes
wants to run. To get the full advantage of NO_HZ_FULL don't use realtime
priority."

Let's see how much impact this really has. The idea behind realtime pri is to
always give your preferred workload priority over unimportant workloads. But to
make any difference there has to be an unimportant workload to preempt. Cpuset
is a great way to move unimportant processes to a housekeeping cpu but
unfortunately the kernel got some pesky kthreads that refuse to migrate. By
using realtime pri on the VM threads I should be able to out-preempt the kernel
threads and get lower latency. In this test I used  the same setup as above but
I used schedtool to set round-robin pri 1 on all VM related threads.

                         mean     stdev    max(us)
host idle, VM idle:   17.6511   15.3028     61
host load, VM idle:   20.2400   19.6558     57
host idle, VM load:   18.9244   18.8119    108
host load, VM load:   20.4228   21.0749    122

The result is mostly the same. Those few remaining kthreads that I can't disable
or migrate apparently doesn't make much difference on latency.

=== Balanced setup, emulator with VM threads ===

3 cores isn't a lot these days and some games like Mad max and Rise of the tomb
raider max out the cpu in the low latency setup. This results in big frame drops
when that happens. The setup below with a virtual 2 core HT cpu would probably
give ok latency but the addition of hyper threading usually only give 25-50%
extra performance for real world workloads so this setup would generally be
slower than the low latency setup. I didn't bother to test it.
<vcpupin vcpu='0' cpuset='2'/>
<vcpupin vcpu='1' cpuset='6'/>
<vcpupin vcpu='2' cpuset='3'/>
<vcpupin vcpu='3' cpuset='7'/>
<emulatorpin cpuset='1,5'/>
<topology sockets='1' cores='2' threads='2'/>

To get better performance I need at least a virtual 3 core HT cpu but if the
host use pcore 0 and the VM threads use pcore 1-3 where will the emulator
threads run? I could overallocate the system by having the emulator threads
compete with the VM threads or I could overallocate the system by having the
emulator threads compete with the host processes. Lets try to run the emulator
with the VM treads first.

<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='6'/>
<vcpupin vcpu='4' cpuset='3'/>
<vcpupin vcpu='5' cpuset='7'/>
<emulatorpin cpuset='1-3,5-7'/>
<topology sockets='1' cores='3' threads='2'/>

The odd ordering for vcpupin is done because Intel cpus lay out HT siblings as
lcore[01234567] = pcore[01230123] but qemu lays out the virtual cpu as
lcore[012345] = pcore[001122]. To get a 1:1 mapping I have to order them like
that.

                         mean     stdev    max(us)
host idle, VM idle:   17.4906   15.1180     89
host load, VM idle:   22.7317   19.5327     95
host idle, VM load:   82.3694  329.6875   9458
host load, VM load:  141.2461 1170.5207  20757

The result is really bad. It works ok as long as the VM is idle but as soon as
it's under load I get bad latencies. The reason is likely that the stressor
accesses the disk which activates the emulator and in this setup the emulator
can preempt the VM threads. We can check if this is the case by running the
stress without "--hdd".

                                       mean     stdev    max(us)
host load, VM load(but no --hdd):   57.4728  138.8211   1345

The latency is reduced quite a bit but it's still high. It's likely still the
emulator threads preempting the VM threads. Accessing the disk is just one of
many things the VM can do to activate the emulator.

fio (us) min=41, max=7348, avg=62.17, stdev=14.99 iops=15715
io latency is also a lot worse compared to the low latency setup. The reason is
the VM threads can preempt the emulator threads while they are emulating the
disk drive.

=== Balanced setup, emulator with host ===

Pairing up the emulator threads and VM threads was a bad idea so lets try
running the emulator on the core reserved for the host. Since the VM threads run
by themselves in this setup we would expect to get good hwlat latency but the
emulator threads can be preempted by host processes so io latency might suffer.
Lets start by looking at the io latency.

fio (us) min=40, max=46852, avg=61.55, stdev=250.90 iops=15893

Yup, massive io latency. Here is a situation were realtime pri could help.
If the emulator threads get realtime pri they can out-preempt the host
processes. Lets try that.

fio (us) min=38, max=2640, avg=53.72, stdev=13.61  iops=18140

That's better but it's not as good as the low latency setup where the emulator
threads got their own lcore. To reduce the latency even more we could try to
split pcore 0 in two and run host processes on lcore 0 and the emulator threads
on lcore 4. But this doesn't leave much cpu for the emulator (or the host).

fio (us) min=44, max=1192, avg=56.07, stdev=8.52 iops=17377

The max io latency now decreased to the same level as the low latency setup.
Unfortunately the number of iops also decreased a bit (down 5.8% compared to the
low latency setup). I'm guessing this is because the emulator threads don't get
as much cpu power in this setup.

                         mean     stdev    max(us)
host idle, VM idle:   18.3933   15.5901    106
host load, VM idle:   20.2006   18.8932     77
host idle, VM load:   23.1694   22.4301    110
host load, VM load:   23.2572   23.7288    120

Hwlat latency is comparable to the low latency setup so this setup gives a good
latency / performance trade-off

=== Max performance setup ===

If 3 cores with HT isn't enough I suggest you give up but for comparison let's
see what happens if we mirror the host cpu in the VM. Now we have no room at all
for the emulator or the host processes so I let them schedule free.
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='4'/>
<vcpupin vcpu='2' cpuset='1'/>
<vcpupin vcpu='3' cpuset='5'/>
<vcpupin vcpu='4' cpuset='2'/>
<vcpupin vcpu='5' cpuset='6'/>
<vcpupin vcpu='6' cpuset='3'/>
<vcpupin vcpu='7' cpuset='7'/>
<emulatorpin cpuset='0-7'/>
<topology sockets='1' cores='4' threads='2'/>

                           mean      stdev    max(us)
host idle, VM idle:    185.4200   839.7908   6311
host load, VM idle:   3835.9333  7836.5902  97234
host idle, VM load:   1891.4300  3873.9165  31015
host load, VM load:   8459.2550  6437.6621  51665

fio (us) min=48, max=112484, avg=90.41, stdev=355.10 iops=10845

I only ran these tests for 10 min each. That's all that was needed. As you can
see it's terrible. I'm afraid that many people probably run a setup similar to
this. I ran like this myself for a while until I switched to libvirt and started
looking into pinning. Realtime pri would probably help a lot here but realtime
in this configuration is potentially dangerous. Workloads on the guest could
starve the host and depending on how the guest gets its input a reset using the
hardware reset button could be needed to get the system back.

=== Testing with games ===

I want low latency for gaming so it would make sense to test the setups with
games. This turns out to be kind of tricky. Games are complicated and
interpreting the results can be hard. https://i.imgur.com/NIrXnkt.png as an
example here is a percentile plot of the frametimes in the built in benchmark of
rise of the tomb raider taken with fraps. The performance and balanced setups
looks about the same at lower percentiles but the low latency setup is a lot
lower. This means that the low latency setup, with is the weakest in terms of
cpu power, got higher frame rate for some parts of the benchmark. This doesn't
make sense at first. It only starts to make sense if I pay attention to the
benchmark while it's running. Rise of the tomb raider loads in a lot of geometry
dynamically and the low latency setup can't keep up. It has bad pop-in of
textures and objects so the scene the gpu renders is less complicated than the
other setups. Less complicated scene results in higher frame rate. An odd
counter intuitive result.

Overall the performance and balanced setups have the same percentile curve for
lower percentiles in every game I tested. This tells me that the balanced setup
got enough cpu power for all games I've tried. They only differ at higher
percentile due to latency induced framedrops. The performance setup always have
the worst max frametime in every game so there is no reason to use it over the
balanced setup. The performance setup also have crackling sound in several games
over hdmi audio even with MSI enabled. Which setup got the lowest max framtime
depends on the workload. If the game max out the cpu of the low latency setup
the max framtime will be worse than the balanced setup, if not the low latency
setup got the best latency.

=== Conclusion ===

The balanced setup (emulators with host) doesn't have the best latency in every
workload but I haven't found any workload where it performs poorly in regards to
max latency, io latency or available cpu power. Even in those workloads where
another setup performed better the balanced setup was always close. If you are
too lazy to switch setups depending on the workload use the balanced setup as
the default configuration. If your cpu isn't a 4 core with HT finding the best
setup for your cpu is left as an exercise for the reader.

=== Future work ===

https://vfio.blogspot.se/2016/10/how-to-improve-performance-in-windows-7.html
This was a nice trick for forcing win7 to use TSC. Just one problem, turns out
it doesn't work if hyper threading is enabled. Any time I use a virtual cpu with
threads='2' win7 will revert to using the acpi_pm. I've spent a lot of time
trying to work around the problem but failed. I don't even know why hyper
threading would make a difference for TSC. Microsoft's documentation is
amazingly unhelpful. But even when the guest is hammering the acpi_pm timer the
balanced setup gives better performance than the low latency setup but I'm
afraid the reduced resolution and extra indeterminism of the acpi_pm timer might
result in other problems. This is only a problem in win7 because modern versions
if windows should use hypervclock. I've read somewhere that it might be possible
to modify OVMF to work around the bug in win7 that prevents hyperv from working.
With that modification it might be possible to use hypervclock in win7.
Perhaps I'll look into that in the future. In the mean time I'll stick with the
balanced setup despite the use of acpi_pm.