[libvirt] [RFC] phi support in libvirt

Feng, Shaohe shaohe.feng at intel.com
Wed Dec 21 04:51:29 UTC 2016


Thanks.  Dolpher.

Reply inline.


On 2016年12月21日 11:56, Du, Dolpher wrote:
> Shaohe was dropped from the loop, adding him back.
>
>> -----Original Message-----
>> From: He Chen [mailto:he.chen at linux.intel.com]
>> Sent: Friday, December 9, 2016 3:46 PM
>> To: Daniel P. Berrange <berrange at redhat.com>
>> Cc: libvir-list at redhat.com; Du, Dolpher <dolpher.du at intel.com>; Zyskowski,
>> Robert <robert.zyskowski at intel.com>; Daniluk, Lukasz
>> <lukasz.daniluk at intel.com>; Zang, Rui <rui.zang at intel.com>;
>> jdenemar at redhat.com
>> Subject: Re: [libvirt] [RFC] phi support in libvirt
>>
>>> On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
>>>> Hi all:
>>>>
>>>> As we are know Intel® Xeon phi targets high-performance computing and
>>>> other parallel workloads.
>>>> Now qemu has supported phi virtualization,it is time for libvirt to
>>>> support phi.
>>> Can you provide pointer to the relevant QEMU changes.
>>>
>> Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one
>> is up to 288 CPUs which needs patches to support and we are pushing it,
>> the other is Multi-Channel DRAM (MCDRAM) which does not need any changes
>> currently.
>>
>> Let me introduce more about MCDRAM, MCDRAM is on-package
>> high-bandwidth
>> memory (~500GB/s).
>>
>> On KNL platform, hardware expose MCDRAM as a seperate, CPUless and
>> remote NUMA node to OS so that MCDRAM will not be allocated by default
>> (since MCDRAM node has no CPU, every CPU regards MCDRAM node as
>> remote
>> node). In this way, MCDRAM can be reserved for certain specific
>> applications.
>>
>>>> Different from the traditional X86 server, There is a special numa
>>>> node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
>>>>
>>>> Now libvirt requires nonempty cpus argument for NUMA node, such as.
>>>> <numa>
>>>>    <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
>>>>    <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
>>>>
>>>> In order to support phi virtualization, libvirt needs to allow a numa
>>>> cell definition without 'cpu' attribution.
>>>>
>>>> Such as:
>>>> <numa>
>>>>    <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
>>>>    <cell id='1' memory='16' unit='GiB'/> </numa>
>>>>
>>>> When a cell without 'cpu', qemu will allocate memory by default MCDRAM
>> instead of DDR.
>>> There's separate concepts at play which your description here is mixing up.
>>>
>>> First is the question of whether the guest NUMA node can be created with
>> only RAM or CPUs, or a mix of both.
>>> Second is the question of what kind of host RAM (MCDRAM vs DDR) is used
>> as the backing store for the guest
>> Guest NUMA node shoulde be created with memory only (keep the same as
>> host's) and the more important things is the memory should bind to (come
>> from) host MCDRAM node.
So I suggest libvirt distinguish the MCDRAM

And the MCDRAM numa config as follow, add a "mcdram" attribute for 
"cell" element:
<numa>
   <cell id='1'  mcdram='16' unit='GiB'/> </numa>
   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>

>>
>>> These are separate configuration items which don't need to be conflated in
>> libvirt.  ie we should be able to create a guest with a node containing only
>> memory, and back that by DDR on the host. Conversely we should be able to
>> create a guest with a node containing memory + cpus and back that by
>> MCDRAM on the host (even if that means the vCPUs will end up on a different
>> host node from its RAM)
>>> On the first point, there still appears to be some brokness in either QEMU or
>> Linux wrt configuration of virtual NUMA where either cpus or memory are
>> absent from nodes.
>>> eg if I launch QEMU with
>>>
>>>      -numa node,nodeid=0,cpus=0-3,mem=512
>>>      -numa node,nodeid=1,mem=512
>>>      -numa node,nodeid=2,cpus=4-7
>>>      -numa node,nodeid=3,mem=512
>>>      -numa node,nodeid=4,mem=512
>>>      -numa node,nodeid=5,cpus=8-11
>>>      -numa node,nodeid=6,mem=1024
>>>      -numa node,nodeid=7,cpus=12-15,mem=1024
>>>
>>> then the guest reports
>>>
>>>    # numactl --hardware
>>>    available: 6 nodes (0,3-7)
>>>    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
>>>    node 0 size: 487 MB
>>>    node 0 free: 230 MB
>>>    node 3 cpus: 12 13 14 15
>>>    node 3 size: 1006 MB
>>>    node 3 free: 764 MB
>>>    node 4 cpus:
>>>    node 4 size: 503 MB
>>>    node 4 free: 498 MB
>>>    node 5 cpus:
>>>    node 5 size: 503 MB
>>>    node 5 free: 499 MB
>>>    node 6 cpus:
>>>    node 6 size: 503 MB
>>>    node 6 free: 498 MB
>>>    node 7 cpus:
>>>    node 7 size: 943 MB
>>>    node 7 free: 939 MB
>>>
>>> so its pushed all the CPUs from nodes without RAM into the first node, and
>> moved CPUs from the 7th node into the 3rd node.
seems it is a bug.

He Chen, Do you know how qemu generates the numa node for  guest.
Can qemu do sanity check of Host Physical Numa topology, and generate a 
smart guest Numa topology?

>> I am not sure why this happens, but basically, I lauch QEMU like:
>>
>> -object
>> memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod
>> e0 \
>> -numa
>> node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n
>> ode0 \
>>
>> -object
>> memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod
>> e1 \
>> -numa
>> node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=
>> node1 \
>>
>> -object
>> memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod
>> e2 \
>> -numa
>> node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev
>> =node2 \
>>
>> -object
>> memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod
>> e3 \
>> -numa
>> node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde
>> v=node3 \
>>
>> -object
>> memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node
>> 4 \
>> -numa node,nodeid=4,memdev=node4 \
>>
>> -object
>> memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node
>> 5 \
>> -numa node,nodeid=5,memdev=node5 \
>>
>> -object
>> memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node
>> 6 \
>> -numa node,nodeid=6,memdev=node6 \
>>
>> -object
>> memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node
>> 7 \
>> -numa node,nodeid=7,memdev=node7 \
>>
>> (Please ignore the complex cpus parameters...)
>> As you can see, the pair of `-object memory-backend-ram` and `-numa` is
>> used to specify where the memory of the guest NUMA node is allocated
>> from. It works well for me :-)

When a "mcdram" in "cell", we banding it to the Physical numa by specify 
the "object"

<numa>
   <cell id='1'  mcdram='16' unit='GiB'/> </numa>
>>
>>> So before considering MCDRAM / Phi, we need to fix this more basic NUMA
>> topology setup.
>>>> Now here I'd like to discuss these questions:
>>>> 1. This feature is only for Phi at present, but we
>>>>     will check Phi platform for CPU-less NUMA node.
>>>>     The NUMA node without CPU indicates MCDRAM node.
>>> We should not assume such semantics - it is a concept that is specific to
>> particular Intel x86_64 CPUs. We need to consider that other architectures
>> may have nodes without CPUs that are backed by normal DDR.
>>> IOW, we shoud be explicit about presence of MCDRAM in the host.
>>>
>> Agreed, but for KNL, that is how we detect MCDRAM on host:
>> 1. detect CPU family is Xeon Phi X200 (means KNL)
>> 2. enumerate all NUMA nodes and regard the nodes that contain memory
>> only as MCDRAM nodes.


When a "mcdram" in "cell", we detect the MCDRAM, do some check and
  banding it to the Physical numa

<numa>
   <cell id='1'  mcdram='16' unit='GiB'/> </numa>

>>
>> ...
>>
>> Thanks,
>> -He





More information about the libvir-list mailing list