[Libvir] Extending libvirt to probe NUMA topology

Thu Sep 6 16:47:16 UTC 2007

* Daniel Veillard <veillard at redhat.com> [2007-09-06 08:55]:
> On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
> > Hello all,
> > 
> > I wanted to start a discussion on how we might get libvirt to be able to
> > probe the NUMA topology of Xen and Linux (for QEMU/KVM).  In Xen, I've
> > recently posted patches for exporting topology into the [1]physinfo
> > hypercall, as well adding a [2]hypercall to probe the Xen heap.  I
> > believe the topology and memory info is already available in Linux.
> > With these, we have enough information to be able to write some simple
> > policy above libvirt that can create guests in a NUMA-aware fashion.
> 
>   Let's restart that discussion, I would really like to see this
> implemented within the next month.

Thanks for starting this back up.

> 
> > I'd like to suggest the following for discussion:
> > 
> > (1) A function to discover topology
> > (2) A function to check available memory
> > (3) Specifying which cpus to use prior to domain start
> > 
> > Thoughts?
> 
>   Okay following the discussions back in June and what seems available
> as APIs on various setups I would like to suggest the following:
> 
> 1) Provide a function describing the topology as an XML instance:
> 
>    char *	virNodeGetTopology(virConnectPtr conn);
> 
> which would return an XML instance as in virConnectGetCapabilities. I
> toyed with the idea of extending virConnectGetCapabilities() to add a
> topology section in case of NUMA support at the hypervisor level, but
> it was looking to me that the two might be used at different times
> and separating both might be a bit cleaner, but I could be convinced
> otherwise. This doesn't change much the content in any way.
> I think the most important in the call is to get the topology informations
> as the number of processors, memory and NUMA cells are already available
> from virNodeGetInfo(). I suggest a format exposing the hierarchy in the
> XML structure, which will allow for more complex topologies for example
> on Sun hardware:

Not having a deep libvirt background, I'm not sure I can argue one way
or another.  The topology discovery (nr_numa_nodes, nr_cpus,
cpu_to_node) - they won't be changing for the lifetime of the libvirt
node.

> 
> ---------------------------------
> <topology>
>   <cells num='2'>
>     <cell id='0'>
>       <cpus num='2'>
>         <cpu id='0'/>
>         <cpu id='1'/>
>       </cpus>
>       <memory size='2097152'/>
>     </cell>
>     <cell id='1'>
>       <cpus num='2'>
>         <cpu id='2'/>
>         <cpu id='3'/>
>       </cpus>
>       <memory size='2097152'/>
>     </cell>
>   </cells>
> </topology>
> ---------------------------------
> 
>   A few things to note:
>    - the <cells> element list the top sibling cells
> 
>    - the <cell> element describes as child the resources available
>      like the list of CPUs, the size of the local memory, that could
>      be extended by disk descriptions too
>      <disk dev='/dev/sdb'/>
>      and possibly other special devices (no idea what ATM).

The only concern I have is the memory size -- I don't believe we have a
way to get at anything other than current available memory.

As far as other resources, yes that makes sense, I believe there is
topology information for pci resources, though for Xen, none of that is
available.

> 
>    - in case of deeper hierarchical topology one may need to be able to
>      name sub-cells and the format could be extended for example as
>      <cells num='2'>
>        <cells num='2'>
>          <cell id='1'>
>            ...
>          </cell>
>          <cell id='2'>
>            ...
>          </cell>
>        </cells>
>        <cells num='2'>
>          <cell id='3'>
>            ...
>          </cell>
>          <cell id='4'>
>            ...
>          </cell>
>        </cells>
>      </cells>
>      But that can be discussed/changed when the need arise :-)

Yep.

> 
>    - topology may later be extended with other child elements 
>      for example to expand the description with memory access costs
>      from cell to cell. I don't know what's the best way, mapping
>      an array in XML is usually not very nice.
> 
>    - the memory size is indicated on an attribute (instead as 
>      the content as we use on domain dumps), to preserve extensibility
>      we may need to express more structure there (memory banks for
>      example). We could also add a free='xxxxx' attribute indicating the
>      amount available there, but as you suggested it's probably better
>      to provide a separate call for this.
>      
> I would expect that function to be available even for ReadOnly connections
> since it's descriptive only, which means it would need to be added to the
> set of proxy supported call. The call will of course be added to the driver
> block. Implementation on recent Xen could use the hypercall. For KVM I'm
> wondering a bit, I don't have a NUMA box around (but can probably find one),
> I assume that we could either use libnuma if found at compile time or
> get informations from /proc. On Solaris there is a specific library as
> Dan exposed in the thread. I think coming first with a Xen only support
> would be fine, others hypervisors or platforms can be added later.

Agreed.

> 
> 2) Function to get the free memory of a given cell:
> 
>    unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell);
> 
> that's relatively simple, would match the request from the initial mail
> but I'm wondering a bit. If the program tries to do a best placement it
> will usually run that request for a number of cells no ? Maybe a call
> returning the memory amounts for a range of cells would be more appropriate.

The use-case I have in mind in virt-manager would obtain the current
free memory on all cells from which it can choose from according to
whatever algorithm.  Getting the free memory from all cells within a
node would be a good call.  The Xen hypercall for querying this
information will be done on a per-cell basis.

> 
> 3) Adding Cell/CPU placement informations to a domain description
> 
> That's where I think things starts to get a bit messy, it's not that
> adding 
>    <cell>1</cell>
> or
>    <cpus>
>      <pin vcpu='0' cpulist='2,3'/>
>      <pin vcpu='1' cpulist='3'/>
>    </cpus>
> along
>    <vcpu>2</vcpu>
> 
> would be hard, it's rather what to do if the request can't be satisfied.
> Basically I still think that the hypervisor is in a better position to 
> do the placement, and doing the requirement here breaks:
>    - the virtualization, the more you rely on the physical
>      hardware property the more you loose the benefits of virtualizing
>    - if CPU 2 and 3 are not available/full or if the topology changed since
>      the domain was saved the domain may just not be able to run, or run
>      worse than if nothing had been specified.
> CPU pinning at runtime means a dynamic change, it's adaptbility and makes
> a lot of sense. But saving those dynamic instant values in the process
> description sounds a bit wrong to me, because the context which led to them
> may have changed since (or may just not make sense anymore, like after a
> migration or hardware change).
> Anyway I guess that's needed, I would tend to go the simplest way and
> just allow to specify the vcpu pinning in a very explicit way and hence
> mapping directly to the kind of capabilities already available in
> virDomainPinVcpu() with a similar cpumap syntax as used in virsh vcpupin
> command (i.e. comma separated list of CPU numbers).

With the minimal NUMA support that is available in Xen today, the best
we can do is keep guests from cross-ing node boundaries, that is ensure
that the cpus the guest uses have local memory allocated.  The current
mechanism for making this happen in Xen is to supply a cpus affinity
list in the domain config file.  This ensure that the memory is local to
those cpus and that the hypervisor does not migrate the guest vcpus to
cpus on non-local cells.

While I agree that they hypervisor is a in a better position to make
those choices, it would end up embedding placement policy in the
hypervisor.  Xen already does have a placement policy for cpus, but that
doesn't matter since we can re-pin vcpus before starting the domain.

What I'm looking for here is a way we can ensure that the guest config
can include a cpus list.  libvirt doesn;'t have to generate this list, I
would expect virt-manager or some other tool which fetched the topology
and free memory information to determine a cpulist and then "add" a
cpulist property to the domain config.

> 
>    <cpus>
>      <pin vcpu='0' cpulist='2,3'/>
>      <pin vcpu='1' cpulist='3'/>
>    </cpus>

I like this best.

> 
> If everyone agrees with those suggestions, then I guess we can try to get
> a first Xen-3.1 based implementation
> 
> Daniel
> 
> -- 
> Red Hat Virtualization group http://redhat.com/virtualization/
> Daniel Veillard      | virtualization library  http://libvirt.org/
> veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
> http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh at us.ibm.com