[libvirt] [REPOST] regarding cgroup v2 support in libvirt

Thu Oct 27 14:02:16 UTC 2016

On Fri, Oct 21, 2016 at 02:24:27PM -0400, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Fri, Oct 21, 2016 at 11:19:02AM +0100, Daniel P. Berrange wrote:
> > The big question I have around cgroup v2 is state of support for all
> > controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices,
> > freezer, blkio).  IIUC, not all of these have been ported to cgroup
> > v2 setup and the cpu port in particular was rejected by Linux maintainers.
> > Libvirt has a general policy that we won't support features that only
> > exist in out of tree patches (applies to kernel and any other software
> > we build against or use).
> 
> I see and that's understandable.  However, I think supporting resource
> control through systemd can be a good way of navigating the situation.
> The back and forward compatibility issues are handled by systemd
> allowing libvirt users to make use of what's available on the system
> without burdening libvirt with complications.

I don't think that's satisfactory - the risk is that the semantic
behaviour of what is finally merged in the kernel may be different
from the semantics of the cpu controller out of tree patches. This
could in turn cause behavioural differences for existing deployed
VMs.

> > IIRC from earlier discussions, the model for dealing with processes in
> > cgroup v2 was quite different. In libvirt we rely on the ability to
> > assign different threads within a process to different cgroups, because
> > we need to control CPU schedular parameters on different threads in
> > QEMU. eg we have vCPU threads, I/O threads and general emulator threads
> > each of which get different policies.
> 
> How thread granularity will be handled in cgroup v2 is still
> contentious but I believe that we'll eventually have something.  I
> have always been curious about the QEMU thread control tho.  What
> prevents it from using the usual nice level adjustments?  Does it
> actually require hierarchical resource distribution?

nice level adjustments only apply to individual threads. In some
cases we can apply controls to individual threads, but in other
cases We need to apply controls to multiple threads as a group.

We currently have the following children under the main CPU controller
group for a VM:

  $maincgroup
    |
    +- vcpu0  - single thread for VPU 0
    +- vcpu1  - single thread for VPU 1
    ...
    +- vcpuN  - single thread for VPU N
    +- iothread0 - multiple threads for device I/O thread group 0
    +- iothread1 - multiple threads for device I/O thread group 1
    ...
    +- iothreadN - multiple threads for device I/O thread group N
    +- emulator - multiple threads (main event loop, migration, file I/O threads)

Against the top level group we set the 'shares' tunable which gives
us relatively weighting of the entire VM against other VMs.

Against each of the child groups we set quota + period, so we have
absolute control over usage from different functional parts of
QEMU.

Setting per-thread nice levels can't replicate any of this
functionality afaict.

> > When I spoke with Lennart about cgroup v2, way back in Jan, he indicated
> > that while systemd can technically work with a system where some
> > controllers are mounted as v1, while others are mounted as v2, this
> > would not be an officially supported solution. Thus systemd in  Fedora
> > was not likely to switch to v2 until all required controllers could use
> > v2. I'm not sure if this still corresponds to Lennarts current views, so
> > CC'ing him to confirm/deny.
> 
> The hybrid mode implemented in systemd uses cgroup v2 for process
> management (the "name=systemd" hierarchy) but keeps using v1
> hierarchies for all resource control.  For "Delegate=" users, I don't
> think it'd matter all that much.  Such users either see all v1
> hierarchies for all resource controllers as before or the v2
> hierarchy.
> 
> > I think from Libvirt POV it would greatly simplify life if we could
> > likewise restrict ourselves to dealing with hosts which are exclusively
> > v1 or exclusively v2, and not a mixture. ie we can completely isolate
> > our codebases for v1 vs v2 management, making it easier to reason about
> > and test their correctness, reducing QA testing burden.
> 
> I think that's gonna be the case.  People *may* try to mix v1 and v2
> hierarchies for resource control manually but supporting the mixture
> in any major software project would require a lot of complications
> which are difficult to justify.

Ok, that's good to know.

> > Any way in summary, we'd like to see v2 support of course, since that
> > is clearly the future. The big question is what we do about situation
> > wrt not all controllers being supported in v2 - the lack of complete
> > conversion is what has stopped me from doing any work in this area
> > upto now.
> 
> What I'm suggesting now is, if available, to use systemd to set up
> resource control up to delegation point.  This also would make control
> ownership arbitration between systemd and libvirt easier to solve.

Libvirt currently uses machined to create the cgroup directory
eg /machines/foo and then writes to settings /machine/foo/$KEY

IIUC, with Delegate=yes, doesn't let you write to tunables at
the cgroup /machines/foo - it merely gives libvirt permissions
to create /machines/foo/bar and write at /machines/foo/bar/$KEY.

So the Delegate=yes feature is only useful to libvirt in the
context of LXC guests, as it lets the OS libvirt spawns inside
the guest control its sub-hierarchy. Libvirt sitll have to
rely on using systemd DBus API to setting the tunables at
/machine/foo/$KEY

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|