[libvirt] CFS Hardlimits and the libvirt cgroups implementation

Fri Jun 10 09:20:17 UTC 2011

On Wed, Jun 08, 2011 at 02:20:23PM -0500, Adam Litke wrote:
> Hi all.  In this post I would like to bring up 3 issues which are
> tightly related: 1. unwanted behavior when using cfs hardlimits with
> libvirt, 2. Scaling cputune.share according to the number of vcpus, 3.
> API proposal for CFS hardlimits support.
> 
> 
> === 1 ===
> Mark Peloquin (on cc:) has been looking at implementing CFS hard limit
> support on top of the existing libvirt cgroups implementation and he has
> run into some unwanted behavior when enabling quotas that seems to be
> affected by the cgroup hierarchy being used by libvirt.
> 
> Here are Mark's words on the subject (posted by me while Mark joins this
> mailing list):
> ------------------
> I've conducted a number of measurements using CFS.
> 
> The system config is a 2 socket Nehalem system with 64GB ram. Installed
> is RHEL6.1-snap4. The guest VMs being used have RHEL5.5 - 32bit. I've
> replaced the kernel with 2.6.39-rc6+ with patches from
> Paul-V6-upstream-breakout.tar.bz2 for CFS bandwidth. The test config
> uses 5 VMs of various vcpu and memory sizes. Being used are 2 VMs with 2
> vcpus and 4GB of memory, 1 VM with 4vcpus/8GB, another VM with
> 8vcpus/16GB and finally a VM with 16vcpus/16GB.
> 
> Thus far the tests have been limited to cpu intensive workloads. Each VM
> runs a single instance of the workload. The workload is configured to
> create one thread for each vcpu in the VM. The workload is then capable
> of completely saturation each vcpu in each VM.
> 
> CFS was tested using two different topologies.
> 
> First vcpu cgroups were created under each VM created by libvirt. The
> vcpu threads from the VM's cgroup/tasks were moved to the tasks list of
> each vcpu cgroup, one thread to each vcpu cgroup. This tree structure
> permits setting CFS quota and period per vcpu. Default values for
> cpu.shares (1024), quota (-1) and period (500000us) was used in each VM
> cgroup and inherited by the vcpu croup. With these settings the workload
> generated system cpu utilization (measured in the host) of >99% guest,
> >0.1 idle, 0.14% user and 0.38 system.
> 
> Second, using the same topology, the CFS quota in each vcpu's cgroup was
> set to 250000us allowing each vcpu to consume 50% of a cpu. The cpu
> workloads was run again. This time the total system cpu utilization was
> measured at 75% guest, ~24% idle, 0.15% user and 0.40% system.
> 
> The topology was changed such that a cgroup for each vcpu was created in
> /cgroup/cpu.
> 
> The first test used the default/inherited shares and CFS quota and
> period. The measured system cpu utilization was >99% guest, ~0.5 idle,
> 0.13 user and 0.38 system, similar to the default settings using vcpu
> cgroups under libvirt.
> 
> The next test, like before the topology change, set the vcpu quota
> values to 250000us or 50% of a cpu. In this case the measured system cpu
> utilization was ~92% guest, ~7.5% idle, 0.15% user and 0.38% system.
> 
> We can see that moving the vcpu cgroups from being under libvirt/qemu
> make a big difference in idle cpu time.
> 
> Does this suggest a possible problems with libvirt?
> ------------------

I can't really understand from your description what the different
setups are. You're talking about libvirt vcpu cgroups, but nothing
in libvirt does vcpu based cgroups, our cgroup granularity is always
per-VM.

> === 2 ===
> Something else we are seeing is that libvirt's default setting for
> cputune.share is 1024 for any domain (regardless of how many vcpus are
> configured.  This ends up hindering performance of really large VMs
> (with lots of vcpus) as compared to smaller ones since all domains are
> given equal share.  Would folks consider changing the default for
> 'shares' to be a quantity scaled by the number of vcpus such that bigger
> domains get to use proportionally more host cpu resource?

Well that's just the kernel default setting actually. The intent
of the default cgroups configuration for a VM, is that it should
be identical to the configuration if the VM was *not* in any
cgroups. So I think that gives some justification for setting
the cpu shares relative to the # of vCPUs by default, otherwise
we have a regression vs not using cgroups.

> === 3 ===
> Besides the above issues, I would like to open a discussion on what the
> libvirt API for enabling cpu hardlimits should look like.  Here is what
> I was thinking:
> 
> Two additional scheduler parameters (based on the names given in the
> cgroup fs) will be recognized for qemu domains: 'cfs_period' and
> 'cfs_quota'.  These can use the existing
> virDomain[Get|Set]SchedulerParameters() API.  The Domain XML schema
> would be updated to permit the following:
> 
> --- snip ---
> <cputune>
>   ...
>   <cfs_period>1000000</cfs_period>
>   <cfs_quota>500000</cfs_quota>
> </cputune>
> --- snip ---

I don't think 'cfs_' should be in the names here. These absolute
limits on CPU time could easily be applicable to non-CFS schedulars
or non-Linux hypervisors.

> To actuate these configuration settings, we simply apply the values to
> the appropriate cgroup(s) for the domain.  We would prefer that each
> vcpu be in its own cgroup to ensure equal and fair scheduling across all
> vcpus running on the system.  (We will need to resolve the issues
> described by Mark in order to figure out where to hang these cgroups).

The reason for putting VMs in cgroups is that, because KVM is multithreaded,
using Cgroups is the only way to control settings of the VM as a whole. If
you just want to control individual VCPU settings, then that can be done
without cgroups just be setting the process' schedpriority via the normal
APIs. Creating cgroups at the granularity of individual vCPUs is somewhat
troublesome, because if the administrator has mounted other cgroups
controllers at the same location as the 'cpu' controller, then putting
each VCPU in a separate cgroup will negatively impact other aspects of
the VM. Also KVM has a number of other non-VCPU threads which consume a
non-trivial amount of CPU time, which often come & go over time. So IMHO
the smallest cgroup granularity should remain per-VM.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|