[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:

On 8/3/2020 7:00 PM, Martin Kletzander wrote:
On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
Hi Libvirt experts,

I would like enhence the numatune snippet configuration. Given a
example snippet:

   <memory mode="strict" nodeset="1-4,^3"/>
   <memnode cellid="0" mode="strict" nodeset="1"/>
   <memnode cellid="2" mode="preferred" nodeset="2"/>

Currently, attribute mode is either 'interleave', 'strict', or
I propose to add a new 'default'  option. I give the reason as following.

Presume we are using cgroups v1, Libvirt sets cpuset.mems for all vcpu
according to 'nodeset' in memory element. And translate the memnode
element to
qemu config options (--object memory-backend-ram) for per numa cell,
invoking mbind() system call at the end.[1]

But what if we want using default memory policy and request each guest
numa cell
pinned to different host memory nodes? We can't use mbind via qemu
config options,
because (I quoto here) "For MPOL_DEFAULT, the nodemask and maxnode
arguments must
be specify the empty set of nodes." [2]

So my solution is introducing a new 'default' option for attribute
mode. e.g.

   <memory mode="default" nodeset="1-2"/>
   <memnode cellid="0" mode="default" nodeset="1"/>
   <memnode cellid="1" mode="default" nodeset="2"/>

If the mode is 'default', libvirt should avoid generating qemu command
'--object memory-backend-ram', and invokes cgroups to set cpuset.mems
for per guest numa
combining with numa topology config. Presume the numa topology is :

   <cell id='0' cpus='0-3' memory='512000' unit='KiB' />
   <cell id='1' cpus='4-7' memory='512000' unit='KiB' />

Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and '2' for
vcpus 4-7.

Is this reasonable and feasible? Welcome any comments.

There are couple of problems here.  The memory is not (always) allocated
by the
vCPU threads.  I also remember it to not be allocated by the process,
but in KVM
in a way that was not affected by the cgroup settings.

Thanks for your reply. Maybe I don't get what you mean, could you give
me more context? But what I proposed will have no effect on other memory

Check how cgroups work.  We can set the memory nodes that a process will
allocate from.  However to set the node for the process (thread) QEMU needs to
be started with the vCPU threads already spawned (albeit stopped).  And for that
QEMU already allocates some memory.  Moreover if extra memory was allocated
after we set the cpuset.mems it is not guaranteed that it will be allocated by
the vCPU in that NUMA cell, it might be done in the emulator instead or the KVM
module in the kernel in which case it might not be accounted for the process
actually causing the allocation (as we've already seen with Linux).  In all
these cases cgroups will not do what you want them to do.  The last case might
be fixed, the first ones are by default not going to work.

That might be
fixed now,

But basically what we have against is all the reasons why we started using
QEMU's command line arguments for all that.

I'm not proposing use QEMU's command line arguments, on contrary I want
using cgroups setting to support a new config/requirement. I give a
solution about if we require default memory policy and memory numa pinning.

And I'm suggesting you look at the commit log to see why we *had* to add these
command line arguments, even though I think I managed to describe most of them
above already (except for one that _might_ already be fixed in the kernel).  I
understand the git log is huge and the code around NUMA memory allocation was
changing a lot, so I hope my explanation will be enough.

Sorry, but I think it will more likely break rather than fix stuff.
Maybe this
could be dealt with by a switch in `qemu.conf` with a huge warning above

I'm not trying to fix something, I propose how to support a new
requirement just like I stated above.

I guess we should take a couple of steps back, I don't get what you are trying
to achieve.  Maybe if you describe your use case it will be easier to reach a

Have a nice day,





Attachment: signature.asc
Description: PGP signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]