[libvirt] [RFC 1/3] xml: introduce more config elements for NVDIMM memory

Zhong, Luyao luyao.zhong at intel.com
Fri Nov 16 16:21:54 UTC 2018



On 11/16/2018 11:35 PM, Zhong, Luyao wrote:

I maybe misunderstand some comments. so sorry for that. I have updated 
my reply.
> 
> 
> On 11/8/2018 5:09 AM, John Ferlan wrote:
>>
>>
>> On 10/16/18 10:21 PM, Luyao Zhong wrote:
>>> In order to align with QEMU ,four more parameters about NVDIMM will
>>> be introduced into Libvirt xml.
>>>
>>> 1.alignsize
>>> The 'alignsize' option allows users to specify the proper alignment. 
>>> When
>>> mmap(2) the backend files, QEMU uses the host page size by default as
>>> the alignment of mapping address. However, some backends may require
>>> alignments different from the pagesize. For example, mmap a device DAX
>>> on NVDIMM maybe 2M-aligned.
>>>
>>> 2.pmem
>>> The 'pmem' option allows users to specify whether the backend storage of
>>> memory-backend-file is a real persistent memory that can be accessed in
>>> SNIA NVM Programming Model. If the vNVDIMM backend is in host persistent
>>> memory , and 'pmem' is 'on' and QEMU is built with libpmem, qemu will
>>> guarantee the persistence of its own writes to the vNVDIMM backend.
>>>
>>> 3.persistence
>>> The 'persistence' option allows users to set platform-supported features
>>> about NVDIMM data persistence of a guest. It has two values, 'mem-ctrl'
>>> means the platform supports flushing dirty data from the memory 
>>> controller
>>> to the NVDIMMs in the event of power loss, 'cpu' means The platform 
>>> supports
>>> flushing dirty data from the CPU cache to the NVDIMMs in the event of 
>>> power
>>> loss, which contains what 'mem-ctrl' means.
>>>
>>> 4.unarmed
>>> The 'unarmed' option allows users to mark vNVDIMM read-only. Only one 
>>> type
>>> of vNVDIMM backends can guarantee the guest write persistence, which 
>>> is the
>>> device DAX on the real NVDIMM device (e.g., /dev/dax0.0). When using 
>>> other
>>> types of backends, it's suggested to set 'unarmed' option to 'on', so 
>>> the
>>> guest Linux NVDIMM driver will mark such vNVDIMM device as read-only.
>>>
>>
>> caveat: I didn't research too deeply into each of these options, but
>> I'll give you some feedback from the aspect of how you should formulate
>> your patches.
>>
>> Rather than essentially replicate what was added in formatdomain provide
>> the examples here and a summary of what the proposed XML would "look
>> like", e.g.
>>
>> Since 'alignsize' and 'pmem' seem to be related specifically to memory
>> mapped files:
>>
>>    <memory model='nvdimm'>
>>      <source>
>>        <path>/tmp/nvdimm</path>
>>        <alignsize unit='KiB'>2048</alignsize>
>>        <pmem>on</pmem>
>>      </source>
>>      ...
>>    </memory>
>>
> Got it.
> 
>> The <alignsize> would seem to similar in description to the existing
>> <pagesize> - although yes for a different model type.....
>>
> No, the 'alignsize' is totally another conception.It means our data 
> structure must aligned with a fixed size.
> 
>> The <pmem> seems to have two states "on" or "off", thus having just
>> <pmem/> similar to perhaps how nosharepages is handled.
>>
> The 'pmem' has nothing to do with the 'nosharepages'. Actually, the 
> 'nvdimm' model can only support 'shared' access mode now. The 'pmem' 
> just tell the QEMU if the backend file is a real persistence memory 
> device, so QEMU will know if itself should guarantee the write persistent.
> 
Ignore the reply above.I will check nosharepages usage first.

>> Since they're both memory mapped related changes, keep them together
>> from conf, xml, rng, etc. Then the "next" patch would be to do the qemu
>> and capability changes. With the last being any virsh changes to allow
>> modification on the command line (if possible).
>>
> Got it.
> 
>> The "persistence" seems to be less of a <memory> option and more of a
>> <machine> option, true/false?  It would thus need to be separated from
>> the others. Still similar to the first one, separate patch for conf,
>> xml, rng, html... Then patch for qemu...
>>
> It really show the platform features, but it's memory-related. So my 
> opinion is keep this along with other options in <source>, which cover 
> all configurations about backend NVDIMM on host.
> 
Sorry for misunderstanding your comment. Thank you very much for point 
that. It's really a little bit big problem. I think I need redesign this 
'persistence' part. My opinion is kicking it out and put it into another 
patch set.
>> The "unarmed" is a <target> option since it's being added in patch2
>> after node and label are processed, so it should be separate from the
>> <source> and <machine> option patches.  If it's just going to have 2
>> states, then it doesn't need <unarmed>on</unarmed> - instead it could
>> just be <unarmed/>.
>>
> Got it. My opinion is keep it in <target>, which cover all 
> configurations about guest vNVDIMM.
> 
>>> For more details, see
>>> https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt
>>
>> Not a commit message worthy...  Place it under the ---...
>>
>>>
>>> Signed-off-by: Zhong,Luyao <luyao.zhong at intel.com>
>>> ---
>>>   docs/formatdomain.html.in     | 98 
>>> ++++++++++++++++++++++++++++++++++++-------
>>>   docs/schemas/domaincommon.rng | 31 ++++++++++++--
>>>   2 files changed, 111 insertions(+), 18 deletions(-)
>>>
>>
>> As noted above, but perhaps difficult to determine - the html.in, .rng,
>> domain_conf, xml2xml tests, etc. should be in one patch.  Then another
>> patch for the qemu, capabilities, xml2argv tests, etc. in the next
>> patch. Then if you're going to add virsh options a separate patch for
>> that (although probably doesn't apply here).
>>
> Got it.
> 
>> After each patch, running make check syntax-check needs to pass.
>>
> Got it.
> 
>>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>>> index 8189959..9dec984 100644
>>> --- a/docs/formatdomain.html.in
>>> +++ b/docs/formatdomain.html.in
>>> @@ -8258,6 +8258,8 @@ qemu-kvm -net nic,model=? /dev/null
>>>     <memory model='nvdimm'>
>>>       <source>
>>>         <path>/tmp/nvdimm</path>
>>> +      <alignsize unit='KiB'>2048</alignsize>
>>> +      <pmem>on</pmem>
>>>       </source>
>>>       <target>
>>>         <size unit='KiB'>524288</size>
>>> @@ -8265,6 +8267,8 @@ qemu-kvm -net nic,model=? /dev/null
>>>         <label>
>>>           <size unit='KiB'>128</size>
>>>         </label>
>>> +      <persistence>cpu</persistence>
>>> +      <unarmed>on</unarmed>
>>>       </target>
>>>     </memory>
>>>   </devices>
>>> @@ -8339,10 +8343,38 @@ qemu-kvm -net nic,model=? /dev/null
>>>           </dl>
>>>           <p>
>>> -          For model <code>nvdimm</code> this element is mandatory 
>>> and has a
>>> -          single child element <code>path</code> that represents a path
>>> -          in the host that backs the nvdimm module in the guest.
>>> +          For model <code>nvdimm</code> this element is mandatory. The
>>> +          mandatory child element <code>path</code> represents a 
>>> path in
>>> +          the host that backs the nvdimm module in the guest. If
>>> +          <code>nvdimm</code> is provided, then the following optional
>>> +          elements can be provided as well:
>>
>> You'd need to add a since tags here for the "new" work...e.g.:
>>
>> <span class='since'>since 4.10.0</span>
>>
>> Whenever something new is added for a specific version, that's what's
>> necessary.
> 
> Got it.
> 
>>
>>>           </p>
>>> +
>>> +        <dl>
>>> +          <dt><code>alignsize</code></dt>
>>> +          <dd>
>>> +            <p>
>>> +              This element can be used to specify a proper alignment.
>>> +              When mmap(2) the backend files, QEMU uses the host page
>>> +              size by default as the alignment of mapping address. 
>>> However,
>>> +              some backends may require alignments different from 
>>> the page.
>>> +              For example, mmap a device DAX on NVDIMM maybe 
>>> 2M-aligned.
>>> +            </p>
>>> +          </dd>
>>> +
>>> +          <dt><code>pmem</code></dt>
>>> +          <dd>
>>> +            <p>
>>> +              This element can be used to specify whether the 
>>> backend storage
>>> +              of memory-backend-file is a real persistent memory 
>>> that can be
>>> +              accessed in SNIA NVM Programming Model. If the 
>>> backend  is a real
>>                                                                         ^
>> there's two spaces between backend and is
>>
> Got it. It's my mistake.
> 
>>> +              persistence memory and <code>pmem</code> is set to 
>>> 'on', QEMU will
>>> +              guarantee the persistence of its own writes to the 
>>> vNVDIMM backend,
>>> +              but which can work well with libpmem support (QEMU 
>>> configured with
>>> +              --enable-libpmem).
>>> +            </p>
>>> +          </dd>
>>> +        </dl>
>>>         </dd>
>>
>> The probably should be reworded in syntax more suitable for libvirt's
>> usage without including QEMU details.
>>
> Got it. I will reword this.
> 
>> The pmem one in particular may be hard to describe... The fact that it
>> doesn't work without libpmem does raise a couple of warning flags. I
>> would just say that if <pmem/> is there, then it's on if it's not, then
>> it's off. Internally that then isn't a tristate, it just an on/off.
>>
>>>         <dt><code>target</code></dt>
>>> @@ -8361,19 +8393,55 @@ qemu-kvm -net nic,model=? /dev/null
>>>             NUMA nodes configured.
>>>           </p>
>>>           <p>
>>> -          For NVDIMM type devices one can optionally use
>>> -          <code>label</code> and its subelement <code>size</code>
>>> -          to configure the size of namespaces label storage
>>> -          within the NVDIMM module. The <code>size</code> element
>>> -          has usual meaning described
>>> -          <a href="#elementsMemoryAllocation">here</a>.
>>> -          For QEMU domains the following restrictions apply:
>>> +          Besides, the following optional elements can be provided 
>>> as well for
>>> +          NVDIMM type devices:
>>>           </p>
>>> -        <ol>
>>> -          <li>the minimum label size is 128KiB,</li>
>>> -          <li>the remaining size (total-size - label-size) has to be 
>>> aligned to
>>> -            4KiB</li>
>>> -        </ol>
>>> +
>>> +        <dl>
>>> +          <dt><code>label</code></dt>
>>> +          <dd>
>>> +            <p>
>>> +              For NVDIMM type devices one can optionally use
>>> +              <code>label</code> and its subelement <code>size</code>
>>> +              to configure the size of namespaces label storage
>>> +              within the NVDIMM module. The <code>size</code> element
>>> +              has usual meaning described
>>> +              <a href="#elementsMemoryAllocation">here</a>.
>>> +              For QEMU domains the following restrictions apply:
>>> +            </p>
>>> +            <ol>
>>> +              <li>the minimum label size is 128KiB,</li>
>>> +              <li>the remaining size (total-size - label-size) will 
>>> be aligned to
>>> +                4KiB as default.</li>
>>> +            </ol>
>>> +          </dd>
>>> +
>>> +          <dt><code>persistence</code></dt>
>>> +          <dd>
>>> +            <p>
>>> +              The <code>persistence</code> element can be set to 
>>> "mem-ctl" or "cpu",
>>
>> "mem-ctrl"
>>
>>> +              which indicate platform-supported features about 
>>> NVDIMM data persistence.
>>> +              'mem-ctrl' means the platform supports flushing dirty 
>>> data from the memory
>>> +              controller to the NVDIMMs in the event of power loss, 
>>> 'cpu' means The platform
>>
>> means the platform
>>
>>> +              supports flushing dirty data from the CPU cache to the 
>>> NVDIMMs in the event
>>> +              of power loss, which contains what 'mem-ctrl' means.
>>> +              ACPI 6.2 Errata A added support for a new Platform 
>>> Capabilities Structure
>>> +              in NFIT, so the guest ACPI NFIT will be filled in 
>>> according to the persistence
>>> +              option.
>>> +            </p>
>>> +          </dd>
>>
>> That's a hard read - someone has to know about ACPI 6.2 Errata A and
>> whatever it is and where to find it?
>>
> Sorry, maybe I put too much details here, there is no need to have 
> knowledge about ACPI. I will delete the ACPI things.
> 
>> Also, I'm not sure this belongs on the NVDIMM line... Allowing it means
>> more than one could use this and they could use separate values, each
>> then being placed on the command line and then you don't know what goes
>> with what.
>>
>> That is what happens if we have
>>
>>    <memory model='nvdimm'>
>>    ...
>>      <target>
>>      ...
>>        <persistence>mem-ctrl</persistence>
>>      ...
>>    </memory>
>>    <memory model='nvdimm'>
>>    ...
>>      <target>
>>      ...
>>        <persistence>cpu</persistence>
>>      ...
>>    </memory>
>>
> I have not considered this situation, I need do more tests and find a 
> solution.
> 
>>> +
>>> +          <dt><code>unarmed</code></dt>
>>> +          <dd>
>>> +            <p>
>>> +              The <code>unarmed</code> element can be used to mark 
>>> vNVDIMM read-only
>>> +              through setting unarmed flag in guest NFIT.Currently 
>>> the only vNVDIMM backend
>>
>> What is NFIT?
>>
> 'NVDIMM Firmware Interface Table' defined in ACPI. I will delete this 
> for its hard to understand and not very necessary.
> 
>> Add a space before Currently
>>
> Got it.
> 
>>> +              can guarantee the guest write persistence is device 
>>> DAX on Linux, so it's
>>
>> What is DAX?
>>
> 'DAX' means 'direct access'.
> 
>>> +              suggested to set <code>unarmed</code> to 'on' when 
>>> using other types of
>>> +              backends.
>>> +            </p>
>>> +          </dd>
>>> +        </dl>
>>
>> unarmed will be similar to pmem...
>>
>> Once these are properly separated it'll be easier to review the text in
>> more detail. But again, like I noted earlier try to use more generic
>> terms and avoid specifically calling out QEMU. Although that may be
>> difficult - you have to consider the audience - they just want to know
>> what the feature does in general and how it will help them or make
>> things better for them. The details of what we rename it to under the
>> covers hides the complexity.
>>
>>>         </dd>
>>>       </dl>
>>> diff --git a/docs/schemas/domaincommon.rng 
>>> b/docs/schemas/domaincommon.rng
>>> index 099a949..cca7869 100644
>>> --- a/docs/schemas/domaincommon.rng
>>> +++ b/docs/schemas/domaincommon.rng
>>> @@ -5353,9 +5353,21 @@
>>>             </interleave>
>>>           </group>
>>>           <group>
>>> -          <element name="path">
>>> -            <ref name="absFilePath"/>
>>> -          </element>
>>> +          <interleave>
>>> +            <element name="path">
>>> +              <ref name="absFilePath"/>
>>> +            </element>
>>> +            <optional>
>>> +              <element name="alignsize">
>>> +                <ref name="scaledInteger"/>
>>> +              </element>
>>> +            </optional>
>>> +            <optional>
>>> +              <element name="pmem">
>>> +                <ref name="virOnOff"/>
>>
>> Again, I'd follow the nosharepages example here.
>>
> See my explain above, NVDIMM model only supports 'shared' access now and 
> the <pmem> has nothing to do with 'nosharepages'.
> 
Ignore the reply above.I will check nosharepages usage first.
>>> +              </element>
>>> +            </optional>
>>> +          </interleave>
>>>           </group>
>>>         </choice>
>>>       </element>
>>> @@ -5379,6 +5391,19 @@
>>>               </element>
>>>             </element>
>>>           </optional>
>>> +        <optional>
>>> +          <element name="persistence">
>>> +            <choice>
>>> +              <value>mem-ctrl</value>
>>> +              <value>cpu</value>
>>> +            </choice>
>>> +          </element>
>>> +        </optional>
>>
>> While the data is correct, the placement isn't...
>>
>>> +        <optional>
>>> +          <element name="unarmed">
>>> +            <ref name="virOnOff"/>
>>
>> Similar here to be like nosharepages.
>>
> The <unarmed> has nothing to do with 'nosharepages'.
> 
Ignore the reply above.I will check nosharepages usage first.
>> John
> Thanks for your review and so much detailed comments.
> Luyao
>>
>>> +          </element>
>>> +        </optional>
>>>         </interleave>
>>>       </element>
>>>     </define>
>>>




More information about the libvir-list mailing list