[libvirt] cpu affinity, isolcpus and cgroups

Wed Oct 14 12:42:58 UTC 2015

On Thu, 2 Jul 2015 17:27:21 +0100
"Daniel P. Berrange" <berrange at redhat.com> wrote:

> On Thu, Jul 02, 2015 at 04:42:47PM +0200, Henning Schild wrote:
> > On Thu, 2 Jul 2015 15:18:46 +0100
> > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> > 
> > > On Thu, Jul 02, 2015 at 04:02:58PM +0200, Henning Schild wrote:
> > > > Hi,
> > > > 
> > > > i am currently looking into realtime VMs using libvirt. My first
> > > > starting point was reserving a couple of cores using isolcpus
> > > > and later tuning the affinity to place my vcpus on the reserved
> > > > pcpus.
> > > > 
> > > > My first observation was that libvirt ignores isolcpus. Affinity
> > > > masks of new qemus will default to all cpus and will not be
> > > > inherited from libvirtd. A comment in the code suggests that
> > > > this is done on purpose.
> > > 
> > > Ignore realtime + isolcpus for a minute. It is not unreasonable
> > > for the system admin to decide system services should be
> > > restricted to run on a certain subset of CPUs. If we let VMs
> > > inherit the CPU pinning on libvirtd, we'd be accidentally
> > > confining VMs to a subset of CPUs too. With new cgroups layout,
> > > libvirtd lives in a separate cgroups tree /system.slice, while
> > > VMs live in /machine.slice. So for both these reasons, when
> > > starting VMs, we explicitly ignore any affinity libvirtd has and
> > > set VMs mask to allow any CPU.

Since i started making heavy use of realtime priorities on 100% busy
threads i started running into starvation problems.
I just found a stuck qemu that still had the affinity of all 'f' and no
high prio yet. But it got unlucky and ended up in the scheduling q on
one of my busy cores ... that qemu never came to life.

I do not remember the details of the last time we discussed the topic,
the take-away was that libvirt itself does not do policy. The policy
(affinity and prio) comes from nova, but there should be no time where
the qemu is already running with the policy not yet applied. That can
cause starvation and disturbance of realtime workloads.
To me it seems there is such a time-window. If there is i need a way to
limit such new-born hypervisors to a cpuset, actually they should just
inherit it from libvirtd ... isolcpus.

> > Sure, that was my first guess as well. Still i wanted to raise the
> > topic again from the realtime POV.
> > I am using a pretty recent libvirt from git but did not come across
> > the system.slice yet. Might be a matter of configuration/invocation
> > of libvirtd.
> 
> Oh, I should mention that I'm referring to OS that use systemd
> for their init system here, not legacy sysvinit
> 
> FWIW our cgroups layout is described here
> 
>   http://libvirt.org/cgroups.html

The system.slice does not have a libvirtd.service in my case but my
libvirtd is running in a screen and not started using systemd. Might
that be causing the problem?

> > 
> > > > After that i changed the code to use only the available cpus by
> > > > default. But taskset was still showing all 'f's on my qemus.
> > > > Then i traced my change down to sched_setaffinity assuming that
> > > > some other mechanism might have reverted my hack, but it is
> > > > still in place.
> > > 
> > > From the libvirt POV, we can't tell whether the admin set isolcpus
> > > because they want to reserve those CPUs only for VMs, or because
> > > they want to stop VMs using those CPUs by default. As such libvirt
> > > does not try to interpret isolcpus at all, it leaves it upto a
> > > higher level app to decide on this policy.
> > 
> > I know, you have to tell libvirt that the reservation is actually
> > for libvirt. My idea was to introduce a config option in libvirt
> > and maybe sanity check it by looking at whether the pcpus are
> > actually reserved. Rik recently posted a patch to allow easy
> > programmatic checking of isolcpus via sysfs.
> 
> In libvirt we try to have a general principle that libvirt will
> provide the mechanism but not implement usage policy. So if we
> follow a strict interpretation here, then applying CPU mask
> based on isolcpus would be out of scope for libvirt, since we
> expose a sufficiently flexible mechanism to implement any
> desired policy at a higher level.
> 
> > > In the case of OpenStack, the /etc/nova/nova.conf allows a config
> > > setting  'vcpu_pin_set' to say what set of CPUs VMs should be
> > > allowed to run on, and nova will then update the libvirt XML when
> > > starting each guest.
> > 
> > I see, would it not still make sense to have that setting centrally
> > in libvirt? I am thinking about people not using nova but virsh or
> > virt-manager.
> 
> virsh aims to be a completely plain passthrough where the user is
> in total control of their setup. To a large extent that is true
> of virt-manager too. So I'd tend to expect users of both those
> apps would manually configured CPU affinity of their VMs as & when
> they used isolcpus.
> 
> Where we'd put in policies around isolcpus would be in the apps
> like OpenStack and RHEV/oVirt which define specific usage policies
> for the system as a whole.
> 
> Regards,
> Daniel