[libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA

Fri Oct 14 08:19:42 UTC 2016

On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
>On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
>> On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
>> >On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
>> >>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
>> >>>At the moment, guests that are backed by hugepages in the host are
>> >>>only able to use policy to control the placement of those hugepages
>> >>>on a per-(guest-)CPU basis. Policy applied globally is ignored.
>> >>>
>> >>>Such guests would use <memoryBacking><hugepages/></memoryBacking> and
>> >>>a <numatune> block with <memory mode=... nodeset=.../> but no <memnode
>> >>>.../> elements.
>> >>>
>> >>>This patch corrects this by, in this specific case, changing the QEMU
>> >>>command line from "-mem-prealloc -mem-path=..." (which cannot
>> >>>specify NUMA policy) to "-object memory-backend-file ..." (which can).
>> >>>
>> >>>Note: This is not visible to the guest and does not appear to create
>> >>>a migration incompatibility.
>> >>>
>> >>
>> >>It could make sense, I haven't tried yet, though.  However, I still
>> >>don't see the point in using memory-backend-file.  Is it that when you
>> >>don't have cpuset cgroup the allocation doesn't work well?  Because it
>> >>certainly does work for me.
>> >
>> >Thanks for taking a look at this :-)
>> >
>> >The point of using a memory-backend-file is that with it, the NUMA policy can
>> >be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell
>> >QEMU to apply NUMA policy in the right place. It does seem odd to me to use
>> >memory-backend-file without attaching the backend to a guest NUMA node, but it
>> >seems to do the right thing in this case. (If there are guest NUMA nodes, or if
>> >hugepages aren't being used, policy is correctly applied.)
>> >
>> >I'll describe my test case in detail, perhaps there's something I don't understand
>> >happening.
>> >
>> >* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages
>> > on node 1, and none on node 0.
>> >
>> >* I create a 2G guest using virt-install:
>> >
>> >virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
>> >
>> >* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
>> >
>> > <numatune>
>> >    <memory mode='strict' nodeset='0'/>
>> > </numatune>
>> >
>> >* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy).
>> >* "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
>> >
>> >Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc)
>> >                          Node 0          Node 1           Total
>> >                 --------------- --------------- ---------------
>> >Huge                         0.00         2048.00         2048.00
>> >Heap                         8.12            0.00            8.12
>> >Stack                        0.03            0.00            0.03
>> >Private                     35.80            6.10           41.90
>> >----------------  --------------- --------------- ---------------
>> >Total                       43.95         2054.10         2098.05
>> >
>> >So it looks like it's allocated hugepages from node 1, isn't this violating the
>> >policy I set via numatune?
>> >
>>
>> Oh, now I get it.  We are doing our best to apply that policy to qemu
>> even when we don't have this option.  However, using this works even
>> better (which is probably* what we want).  And that's the reasoning
>> behind this.
>>
>> * I'm saying probably because when I was adding numactl binding to be
>>   used together with cgroups, I was told that we couldn't change the
>>   binding afterwards and it's bad.  I feel like we could do something
>>   with that and it would help us in the future, but there needs to be a
>>   discussion, I guess.  Because I might be one of the few =)
>>
>> So to recapitulate that, there are three options how to affect the
>> allocation of qemu's memory:
>>
>> 1) numactl (libnuma): it works as expected, but cannot be changed later
>>
>> 2) cgroups: so strict it has to be applied after qemu started, due to
>>    that it doesn't work right, especially for stuff that gets all
>>    pre-allocated (like hugepages).  it can be changed later, but it
>>    won't always mean the memory will migrate, so upon change there is
>>    no guarantee.  If it's unavailable, we fallback to (1) anyway
>>
>> 3) memory-backing-file's host-nodes=: this works as expected, but
>>    cannot be used with older QEMUs, cannot be changed later and in some
>>    cases (not your particular one) it might screw up migration if it
>>    wasn't used before.
>>
>> Selecting the best option from these, plus making the code work with
>> every possibility (erroring out when you want to change the memory node
>> and we had to use (1) for example) is a pain.  We should really think
>> about that and reorganize these things for the better of the future.
>> Otherwise we're going to get overwhelm ourselves.  Cc'ing Peter to get
>> his thoughts as well as he worked on some parts of this as well.
>>
>> Martin
>
>Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-)
>
>What do you mean by "changed later"? Do you mean, if the domain XML is changed
>while the machine is running?
>

E.g. by 'virsh numatune domain 1-2'

>I did look at the libnuma and cgroups approaches, but I was concerned they
>wouldn't work in this case, because of the way QEMU allocates memory when
>mem-prealloc is used: the memory is allocated in the main process, before the
>CPU threads are created. (This is based only on a bit of hacking and debugging
>in QEMU, but it does seem explain the behaviour I've seen so far.)
>

But we use numactl before QEMU is exec()'d.

>If this is the case, it would seem to be a significant problem: if policy is
>set on the main thread, it will affect all allocations not just the VCPU
>memory and if it's set on the VCPU threads it won't catch the pre-allocation at
>all. (Is this what you were referring to by "it doesn't work right"?)
>

Kind of, yes.

>That was my reasoning for trying to use the backend object in this case; it was
>the only method that worked and did not require changes to QEMU. I'd prefer
>the other approaches if they could be made to work.
>

There is a workaround, you can disable the cpuset cgroup in
libvirtd.conf, but that's not what you want, I guess.

>I think QEMU could be altered to move the preallocations into the VCPU
>threads but it didn't seem trivial and I suspected the QEMU community would
>point out that there was already a way to do it using backend objects.  Another
>option would be to add a -host-nodes parameter to QEMU so that the policy can
>be given without adding a memory backend object. (That seems like a more
>reasonable change to QEMU.)
>

I think upstream won't like that, mostly because there is already a
way.  And that is using memory-backend object.  I think we could just
use that and disable changing it live.  But upstream will probably want
that to be configurable or something.

>Cheers,
>Sam.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20161014/bc365336/attachment-0001.sig>