[libvirt] [RFC] NUMA topology specification

Wed Aug 24 09:13:36 UTC 2011

On Tue, Aug 23, 2011 at 7:43 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> On Fri, Aug 19, 2011 at 12:05:43PM +0530, Bharata B Rao wrote:
>> Hi,
>>
>> qemu supports specification of NUMA topology on command line using -numa option.
>>
>> -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]
>>
>> I see that there is no way to specify such NUMA topology in libvirt
>> XML. Are there plans to add support for NUMA topology specification ?
>> Is anybody already working on this ? If not I would like to add this
>> support for libvirt.
>>
>> Currently the topology specification available in libvirt ( <topology
>> sockets='1' cores='2' threads='1'/>) translates to "-smp
>> sockets=1,cores=2,threads=1" option of qemu. There is not equivalent
>> in libvirt that could generate -numa command line option of qemu.
>>
>> How about something like this ? (OPTION 1)
>>
>> <cpu>
>> ...
>> <numa nodeid='node' cpus='cpu[-cpu]' mem='size'>
>> ...
>> </cpu>
>>
>> And we could specify multiple such lines, one for each node.
>
> I'm not sure it really makes sense having the NUMA memory config
> inside the <cpu> configuration, but i like the simplicity of
> of this specification.

Yes, memory specification inside <cpu>, may be we could define a
separate <numa> section as shown in your examples below and put it
outside of <cpu>.

>
>> -numa and -smp options in qemu do not work all that well since they
>> are parsed independent of each other and one could specify a cpu set
>> with -numa option that is incompatible with sockets,cores and threads
>> specified on -smp option. This should be fixed in qemu, but given that
>> such a problem has been observed, should libvirt tie the specification
>> of numa and smp (sockets,threads,cores) together so that one is forced
>> to specify only valid combinations of nodes and cpus in libvirt ?
>
> No matter what we do, libvirt is going to have todo some kind of
> semantic validation on the different info.

Right. Given that we have <vcpus> as well as <vcpu current>, libvirt
needs to ensure that specified topology is sane.

>
>> May be something like this: (OPTION 2)
>>
>> <cpu>
>> ...
>> <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
>> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'>
>> ...
>> </cpu
>>
>> This should result in a 2 node system with each node having 1 socket
>> with 2 cores.
>
> This has the problem of redundancy of specification of the sockets,
> cores & threads, vs the new 'cpus' attribute. eg you can specify
> wierd configs like:

Yes, sockets,cores,threads become redundant, one option is to define
sockets,cores,threads once (like we currently do inside <cpu>) and
have it as a common definition for all the numa nodes defined.
Something like this:

<cpu>
  <topology sockets='1' cores='2' threads='1'>
  <numa node cpus='0-1' mems=1024>
  <numa node cpus='2-3' mems=1024>
</cpu>

This will result in a 2 node system with each node having 1 socket
with 2 cores each. But as you can see this will be restrictive since
you can't specify different topologies for different nodes. Are there
such non-symmetric systems out there and should libvirt be flexible
enough to support such NUMA topologies for VMs ?

Also looks like nodeid (from OPTION 2 of my original mail) is
redundant, may be we should assign increasing node ids based on the
number of numa topology statements that appear. In the above example,
we can implicitly assign node ids 0 and 1 for two nodes.

Adam Litke suggested that we can omit cpus= from the specification
since it can be derived, but given that there are topologies that
don't enumerate the CPUs within a socket serially, it becomes
necessary to have an explicit cpus= specification.

>
>  <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
>  <topology sockets='2' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>
>
> Or even  bogus configs
>
>  <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
>  <topology sockets='4' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>
>
> That all said, given our current XML schema, it is inevitable that we
> will have some level of duplication of information.
>
>
> Some things that are important to consider are how this interacts with
> possible CPU / memory hotplug in the future,

So what are the issues we need to take care of here ?

> and how we will be able
> to pin guest NUMA nodes, to host NUMA nodes.

This would be a good thing to do in libvirt. I think libvirt should
intelligently place VMs on host nodes based on the guest topology.

But I don't clearly see what issues we need to take care of now while
we come up with NUMA topology definition for VM.

>
> For the first point, it might be desirable to create a NUMA topology
> which supports upto 8 logical CPUs, but only have 2 physical sockets
> actually plugged in at boot time.
>
> Also, I dread to question whether we want to be able to represent a
> multi-level NUMA topology, or just assume one level. If we want to
> be able to cope with multi-level topology, can we assume the levels
> are solely grouping at the socket, or will we have to consider the
> possibility of NUMA *inside* a socket.

Given that such (NUMA inside socket) topologies exist in real word,
may be libvirt should support them. But I guess this will make the
libvirt specification complex.

>
> In other words, are we associating socket numbers with NUMA nodes,
> or are we associating logical CPU numbers with NUMA nodes.
>
> This is the difference between configuring something like:
>
>  <vpus>16</vcpus>
>  <cpu>
>    <topology sockets='4' cores='4' threads='1'>
>  </cpu>
>  <numa>
>    <node sockets='0-1' mem='0-1024'/>
>    <node sockets='2-3' mem='1024-2048'/>
>  </numa>
>
> vs
>
>  <vpus>16</vcpus>
>  <cpu>
>    <topology sockets='4' cores='4' threads='1'>
>  </cpu>
>  <numa>
>    <node cpus='0-7'  mem='0-1024'/>
>    <node cpus='8-15' mem='1024-2048'/>
>  </numa>
>

What is the difference b/n the above two ? In the first case, you put
2 sockets in one node and 2 sockets in 2nd node. Since each socket has
4 cores, you ended up having 8 cores (or CPUs) in each node. In the
2nd case, you specified 8 CPUs per node explicitly which obviously
means that each node should have 2 sockets. Did I miss your point ?

> vs
>
>
>  <vpus>16</vcpus>
>  <cpu>
>    <topology sockets='4' cores='4' threads='1'>
>  </cpu>
>  <numa>
>    <node mems='0-1024'/>
>      <node cpus='0-3'/>
>      <node cpus='4-7'/>
>    </node>
>    <node mems='1024-2048'/>
>      <node cpus='8-11'/>
>      <node cpus='12-15'/>
>    </node>
>  </numa>
>
> vs
>
>  ...more horrible examples...

I don't really have the right answer to the multi-level NUMA
specification, need to think a bit.

Regards,
Bharata.