[libvirt] [RFC] NUMA topology specification

Tue Aug 23 14:13:58 UTC 2011

On Fri, Aug 19, 2011 at 12:05:43PM +0530, Bharata B Rao wrote:
> Hi,
> 
> qemu supports specification of NUMA topology on command line using -numa option.
> 
> -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]
> 
> I see that there is no way to specify such NUMA topology in libvirt
> XML. Are there plans to add support for NUMA topology specification ?
> Is anybody already working on this ? If not I would like to add this
> support for libvirt.
> 
> Currently the topology specification available in libvirt ( <topology
> sockets='1' cores='2' threads='1'/>) translates to "-smp
> sockets=1,cores=2,threads=1" option of qemu. There is not equivalent
> in libvirt that could generate -numa command line option of qemu.
> 
> How about something like this ? (OPTION 1)
> 
> <cpu>
> ...
> <numa nodeid='node' cpus='cpu[-cpu]' mem='size'>
> ...
> </cpu>
> 
> And we could specify multiple such lines, one for each node.

I'm not sure it really makes sense having the NUMA memory config
inside the <cpu> configuration, but i like the simplicity of
of this specification.

> -numa and -smp options in qemu do not work all that well since they
> are parsed independent of each other and one could specify a cpu set
> with -numa option that is incompatible with sockets,cores and threads
> specified on -smp option. This should be fixed in qemu, but given that
> such a problem has been observed, should libvirt tie the specification
> of numa and smp (sockets,threads,cores) together so that one is forced
> to specify only valid combinations of nodes and cpus in libvirt ?

No matter what we do, libvirt is going to have todo some kind of
semantic validation on the different info.

> May be something like this: (OPTION 2)
> 
> <cpu>
> ...
> <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'>
> ...
> </cpu
> 
> This should result in a 2 node system with each node having 1 socket
> with 2 cores.

This has the problem of redundancy of specification of the sockets,
cores & threads, vs the new 'cpus' attribute. eg you can specify
wierd configs like:

  <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
  <topology sockets='2' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>

Or even  bogus configs

  <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'>
  <topology sockets='4' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>

That all said, given our current XML schema, it is inevitable that we
will have some level of duplication of information.

Some things that are important to consider are how this interacts with
possible CPU / memory hotplug in the future, and how we will be able
to pin guest NUMA nodes, to host NUMA nodes.

For the first point, it might be desirable to create a NUMA topology
which supports upto 8 logical CPUs, but only have 2 physical sockets
actually plugged in at boot time.

Also, I dread to question whether we want to be able to represent a
multi-level NUMA topology, or just assume one level. If we want to
be able to cope with multi-level topology, can we assume the levels
are solely grouping at the socket, or will we have to consider the
possibility of NUMA *inside* a socket.

In other words, are we associating socket numbers with NUMA nodes,
or are we associating logical CPU numbers with NUMA nodes.

This is the difference between configuring something like:

  <vpus>16</vcpus>
  <cpu>
    <topology sockets='4' cores='4' threads='1'>
  </cpu>
  <numa>
    <node sockets='0-1' mem='0-1024'/>
    <node sockets='2-3' mem='1024-2048'/>
  </numa>

vs

  <vpus>16</vcpus>
  <cpu>
    <topology sockets='4' cores='4' threads='1'>
  </cpu>
  <numa>
    <node cpus='0-7'  mem='0-1024'/>
    <node cpus='8-15' mem='1024-2048'/>
  </numa>

vs

  <vpus>16</vcpus>
  <cpu>
    <topology sockets='4' cores='4' threads='1'>
  </cpu>
  <numa>
    <node mems='0-1024'/>
      <node cpus='0-3'/>
      <node cpus='4-7'/>
    </node>
    <node mems='1024-2048'/>
      <node cpus='8-11'/>
      <node cpus='12-15'/>
    </node>
  </numa>

vs

  ...more horrible examples...

NB, QEMU's -numa argument may well not support some of the things I
am talking about here, but we need to consider the real possibilty
that QEMU's -numa arg will be extended, or replaced in the future.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|