[libvirt] CFS Hardlimits and the libvirt cgroups implementation

Fri Jun 10 09:45:10 UTC 2011

At 06/10/2011 05:20 PM, Daniel P. Berrange Write:
> On Wed, Jun 08, 2011 at 02:20:23PM -0500, Adam Litke wrote:
>> Hi all.  In this post I would like to bring up 3 issues which are
>> tightly related: 1. unwanted behavior when using cfs hardlimits with
>> libvirt, 2. Scaling cputune.share according to the number of vcpus, 3.
>> API proposal for CFS hardlimits support.
>>
>>
>> === 1 ===
>> Mark Peloquin (on cc:) has been looking at implementing CFS hard limit
>> support on top of the existing libvirt cgroups implementation and he has
>> run into some unwanted behavior when enabling quotas that seems to be
>> affected by the cgroup hierarchy being used by libvirt.
>>
>> Here are Mark's words on the subject (posted by me while Mark joins this
>> mailing list):
>> ------------------
>> I've conducted a number of measurements using CFS.
>>
>> The system config is a 2 socket Nehalem system with 64GB ram. Installed
>> is RHEL6.1-snap4. The guest VMs being used have RHEL5.5 - 32bit. I've
>> replaced the kernel with 2.6.39-rc6+ with patches from
>> Paul-V6-upstream-breakout.tar.bz2 for CFS bandwidth. The test config
>> uses 5 VMs of various vcpu and memory sizes. Being used are 2 VMs with 2
>> vcpus and 4GB of memory, 1 VM with 4vcpus/8GB, another VM with
>> 8vcpus/16GB and finally a VM with 16vcpus/16GB.
>>
>> Thus far the tests have been limited to cpu intensive workloads. Each VM
>> runs a single instance of the workload. The workload is configured to
>> create one thread for each vcpu in the VM. The workload is then capable
>> of completely saturation each vcpu in each VM.
>>
>> CFS was tested using two different topologies.
>>
>> First vcpu cgroups were created under each VM created by libvirt. The
>> vcpu threads from the VM's cgroup/tasks were moved to the tasks list of
>> each vcpu cgroup, one thread to each vcpu cgroup. This tree structure
>> permits setting CFS quota and period per vcpu. Default values for
>> cpu.shares (1024), quota (-1) and period (500000us) was used in each VM
>> cgroup and inherited by the vcpu croup. With these settings the workload
>> generated system cpu utilization (measured in the host) of >99% guest,
>>> 0.1 idle, 0.14% user and 0.38 system.
>>
>> Second, using the same topology, the CFS quota in each vcpu's cgroup was
>> set to 250000us allowing each vcpu to consume 50% of a cpu. The cpu
>> workloads was run again. This time the total system cpu utilization was
>> measured at 75% guest, ~24% idle, 0.15% user and 0.40% system.
>>
>> The topology was changed such that a cgroup for each vcpu was created in
>> /cgroup/cpu.
>>
>> The first test used the default/inherited shares and CFS quota and
>> period. The measured system cpu utilization was >99% guest, ~0.5 idle,
>> 0.13 user and 0.38 system, similar to the default settings using vcpu
>> cgroups under libvirt.
>>
>> The next test, like before the topology change, set the vcpu quota
>> values to 250000us or 50% of a cpu. In this case the measured system cpu
>> utilization was ~92% guest, ~7.5% idle, 0.15% user and 0.38% system.
>>
>> We can see that moving the vcpu cgroups from being under libvirt/qemu
>> make a big difference in idle cpu time.
>>
>> Does this suggest a possible problems with libvirt?
>> ------------------
> 
> I can't really understand from your description what the different
> setups are. You're talking about libvirt vcpu cgroups, but nothing
> in libvirt does vcpu based cgroups, our cgroup granularity is always
> per-VM.
> 
>> === 2 ===
>> Something else we are seeing is that libvirt's default setting for
>> cputune.share is 1024 for any domain (regardless of how many vcpus are
>> configured.  This ends up hindering performance of really large VMs
>> (with lots of vcpus) as compared to smaller ones since all domains are
>> given equal share.  Would folks consider changing the default for
>> 'shares' to be a quantity scaled by the number of vcpus such that bigger
>> domains get to use proportionally more host cpu resource?
> 
> Well that's just the kernel default setting actually. The intent
> of the default cgroups configuration for a VM, is that it should
> be identical to the configuration if the VM was *not* in any
> cgroups. So I think that gives some justification for setting
> the cpu shares relative to the # of vCPUs by default, otherwise
> we have a regression vs not using cgroups.
> 
>> === 3 ===
>> Besides the above issues, I would like to open a discussion on what the
>> libvirt API for enabling cpu hardlimits should look like.  Here is what
>> I was thinking:
>>
>> Two additional scheduler parameters (based on the names given in the
>> cgroup fs) will be recognized for qemu domains: 'cfs_period' and
>> 'cfs_quota'.  These can use the existing
>> virDomain[Get|Set]SchedulerParameters() API.  The Domain XML schema
>> would be updated to permit the following:
>>
>> --- snip ---
>> <cputune>
>>   ...
>>   <cfs_period>1000000</cfs_period>
>>   <cfs_quota>500000</cfs_quota>
>> </cputune>
>> --- snip ---
> 
> I don't think 'cfs_' should be in the names here. These absolute
> limits on CPU time could easily be applicable to non-CFS schedulars
> or non-Linux hypervisors.

Do you mean the element's name should be period and quota?

The name of the file provided by cfs bandwidth is:cpu.cfs_period_us
and cpu.cfs_quota_us.

I think he uses 'cfs_' because it's similar as the filename.
But I do not mind the element's name.

I am making the patch, so I want to know which element's name should
be used.

> 
>> To actuate these configuration settings, we simply apply the values to
>> the appropriate cgroup(s) for the domain.  We would prefer that each
>> vcpu be in its own cgroup to ensure equal and fair scheduling across all
>> vcpus running on the system.  (We will need to resolve the issues
>> described by Mark in order to figure out where to hang these cgroups).
> 
> The reason for putting VMs in cgroups is that, because KVM is multithreaded,
> using Cgroups is the only way to control settings of the VM as a whole. If
> you just want to control individual VCPU settings, then that can be done
> without cgroups just be setting the process' schedpriority via the normal
> APIs. Creating cgroups at the granularity of individual vCPUs is somewhat
> troublesome, because if the administrator has mounted other cgroups
> controllers at the same location as the 'cpu' controller, then putting
> each VCPU in a separate cgroup will negatively impact other aspects of
> the VM. Also KVM has a number of other non-VCPU threads which consume a
> non-trivial amount of CPU time, which often come & go over time. So IMHO
> the smallest cgroup granularity should remain per-VM.
> 
> 
> Daniel