[libvirt] RFC: CPU counting in qemu driver

Mon Nov 22 13:52:41 UTC 2010

On Thu, Nov 18, 2010 at 06:51:20PM +0100, Jiri Denemark wrote:
> Hi all,
> 
> libvirt's qemu driver doesn't follow the semantics of CPU-related counters in
> nodeinfo structure, which is
> 
>     nodes   : the number of NUMA cell, 1 for uniform mem access
>     sockets : number of CPU socket per node
>     cores   : number of core per socket
>     threads : number of threads per core
> 
> Qemu driver ignores the "per node" part of sockets semantics, and only gives
> total number of sockets found on the host. That actually makes more sense but
> we have to fix it since it doesn't follow the documented semantics of public
> API. That is, we would do something like the following at the end of
> linuxNodeInfoCPUPopulate():
> 
>     nodeinfo->sockets /= nodeinfo->nodes;
> 
> The problem is that NUMA topology is independent on CPU topology and there are
> systems for which nodeinfo->sockets % nodeinfo->nodes != 0. An example being
> the following NUMA topology of a system with 4 CPU sockets:
> 
>     node0  CPUs: 0-5
>            total memory: 8252920
>     node1  CPUs: 6-11
>            total memory: 16547840
>     node2  CPUs: 12-17
>            total memory: 8273920
>     node3  CPUs: 18-23
>            total memory: 16547840
>     node4  CPUs: 24-29
>            total memory: 8273920
>     node5  CPUs: 30-35
>            total memory: 16547840
>     node6  CPUs: 36-41
>            total memory: 8273920
>     node7  CPUs: 42-47
>            total memory: 16547840
> 
> which shows that the cores are actually mapped via the AMD intra-socket
> interconnects. Note that this funky topology was verified to be correct so
> it's not just a kernel bug which would result in wrong topology being
> reported.

So you are saying that 1 physical CPU socket, can be associated with
2 NUMA nodes at the same time ?  If you have only 4 sockets here, then
there are 12 cores per socket, and 6 cores in each socket in a NUMA
node ?

Can you provide the full 'numactl --hardware' output. I guess we're
facing a 2-level NUMA hierarchy, where the first level is done inside
the socket, and the second level is between sockets.

What does Xen / 'xm info' report on such a host ?

> So the suggested calculation wouldn't work on such systems and we cannot
> really follow the API semantics since it doesn't work in this case.
> 
> My suggestion is to use the following code in linuxNodeInfoCPUPopulate():
> 
>     if (nodeinfo->sockets % nodeinfo->nodes == 0)
>         nodeinfo->sockets /= nodeinfo->nodes;
>     else
>         nodeinfo->nodes = 1;
> 
> That is we would lie about number of NUMA nodes on funky systems. If
> nodeinfo->nodes is greater than 1, then applications can rely on it being
> correct. If it's 1, applications that care about NUMA topology should consult
> /capabilities/host/topology/cells of capabilities XML to check the number of
> NUMA nodes in a reliable way, which I guess such applications would do anyway
> 
> However, if you have a better idea of fixing the issue while staying more
> compatible with current semantics, don't hesitate to share it.

In your example it sounds like we could alternatively lie about the number
of cores per socket. eg, instead of reporting 0.5 sockets per node with 12 cores,
report 1 socket per node each with 6 cores. Thus each of the reported sockets
would once again only be associated with 1 NUMA node at a time.

> Note, that we have VIR_NODEINFO_MAXCPUS macro in libvirt.h which computes
> maximum number of CPUs as (nodes * sockets * cores * threads) and we need to
> keep this working.

Daniel