[libvirt] [RFC 1/3] xml: introduce more config elements for NVDIMM memory
Zhong, Luyao
luyao.zhong at intel.com
Fri Nov 16 16:21:54 UTC 2018
On 11/16/2018 11:35 PM, Zhong, Luyao wrote:
I maybe misunderstand some comments. so sorry for that. I have updated
my reply.
>
>
> On 11/8/2018 5:09 AM, John Ferlan wrote:
>>
>>
>> On 10/16/18 10:21 PM, Luyao Zhong wrote:
>>> In order to align with QEMU ,four more parameters about NVDIMM will
>>> be introduced into Libvirt xml.
>>>
>>> 1.alignsize
>>> The 'alignsize' option allows users to specify the proper alignment.
>>> When
>>> mmap(2) the backend files, QEMU uses the host page size by default as
>>> the alignment of mapping address. However, some backends may require
>>> alignments different from the pagesize. For example, mmap a device DAX
>>> on NVDIMM maybe 2M-aligned.
>>>
>>> 2.pmem
>>> The 'pmem' option allows users to specify whether the backend storage of
>>> memory-backend-file is a real persistent memory that can be accessed in
>>> SNIA NVM Programming Model. If the vNVDIMM backend is in host persistent
>>> memory , and 'pmem' is 'on' and QEMU is built with libpmem, qemu will
>>> guarantee the persistence of its own writes to the vNVDIMM backend.
>>>
>>> 3.persistence
>>> The 'persistence' option allows users to set platform-supported features
>>> about NVDIMM data persistence of a guest. It has two values, 'mem-ctrl'
>>> means the platform supports flushing dirty data from the memory
>>> controller
>>> to the NVDIMMs in the event of power loss, 'cpu' means The platform
>>> supports
>>> flushing dirty data from the CPU cache to the NVDIMMs in the event of
>>> power
>>> loss, which contains what 'mem-ctrl' means.
>>>
>>> 4.unarmed
>>> The 'unarmed' option allows users to mark vNVDIMM read-only. Only one
>>> type
>>> of vNVDIMM backends can guarantee the guest write persistence, which
>>> is the
>>> device DAX on the real NVDIMM device (e.g., /dev/dax0.0). When using
>>> other
>>> types of backends, it's suggested to set 'unarmed' option to 'on', so
>>> the
>>> guest Linux NVDIMM driver will mark such vNVDIMM device as read-only.
>>>
>>
>> caveat: I didn't research too deeply into each of these options, but
>> I'll give you some feedback from the aspect of how you should formulate
>> your patches.
>>
>> Rather than essentially replicate what was added in formatdomain provide
>> the examples here and a summary of what the proposed XML would "look
>> like", e.g.
>>
>> Since 'alignsize' and 'pmem' seem to be related specifically to memory
>> mapped files:
>>
>> <memory model='nvdimm'>
>> <source>
>> <path>/tmp/nvdimm</path>
>> <alignsize unit='KiB'>2048</alignsize>
>> <pmem>on</pmem>
>> </source>
>> ...
>> </memory>
>>
> Got it.
>
>> The <alignsize> would seem to similar in description to the existing
>> <pagesize> - although yes for a different model type.....
>>
> No, the 'alignsize' is totally another conception.It means our data
> structure must aligned with a fixed size.
>
>> The <pmem> seems to have two states "on" or "off", thus having just
>> <pmem/> similar to perhaps how nosharepages is handled.
>>
> The 'pmem' has nothing to do with the 'nosharepages'. Actually, the
> 'nvdimm' model can only support 'shared' access mode now. The 'pmem'
> just tell the QEMU if the backend file is a real persistence memory
> device, so QEMU will know if itself should guarantee the write persistent.
>
Ignore the reply above.I will check nosharepages usage first.
>> Since they're both memory mapped related changes, keep them together
>> from conf, xml, rng, etc. Then the "next" patch would be to do the qemu
>> and capability changes. With the last being any virsh changes to allow
>> modification on the command line (if possible).
>>
> Got it.
>
>> The "persistence" seems to be less of a <memory> option and more of a
>> <machine> option, true/false? It would thus need to be separated from
>> the others. Still similar to the first one, separate patch for conf,
>> xml, rng, html... Then patch for qemu...
>>
> It really show the platform features, but it's memory-related. So my
> opinion is keep this along with other options in <source>, which cover
> all configurations about backend NVDIMM on host.
>
Sorry for misunderstanding your comment. Thank you very much for point
that. It's really a little bit big problem. I think I need redesign this
'persistence' part. My opinion is kicking it out and put it into another
patch set.
>> The "unarmed" is a <target> option since it's being added in patch2
>> after node and label are processed, so it should be separate from the
>> <source> and <machine> option patches. If it's just going to have 2
>> states, then it doesn't need <unarmed>on</unarmed> - instead it could
>> just be <unarmed/>.
>>
> Got it. My opinion is keep it in <target>, which cover all
> configurations about guest vNVDIMM.
>
>>> For more details, see
>>> https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt
>>
>> Not a commit message worthy... Place it under the ---...
>>
>>>
>>> Signed-off-by: Zhong,Luyao <luyao.zhong at intel.com>
>>> ---
>>> docs/formatdomain.html.in | 98
>>> ++++++++++++++++++++++++++++++++++++-------
>>> docs/schemas/domaincommon.rng | 31 ++++++++++++--
>>> 2 files changed, 111 insertions(+), 18 deletions(-)
>>>
>>
>> As noted above, but perhaps difficult to determine - the html.in, .rng,
>> domain_conf, xml2xml tests, etc. should be in one patch. Then another
>> patch for the qemu, capabilities, xml2argv tests, etc. in the next
>> patch. Then if you're going to add virsh options a separate patch for
>> that (although probably doesn't apply here).
>>
> Got it.
>
>> After each patch, running make check syntax-check needs to pass.
>>
> Got it.
>
>>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>>> index 8189959..9dec984 100644
>>> --- a/docs/formatdomain.html.in
>>> +++ b/docs/formatdomain.html.in
>>> @@ -8258,6 +8258,8 @@ qemu-kvm -net nic,model=? /dev/null
>>> <memory model='nvdimm'>
>>> <source>
>>> <path>/tmp/nvdimm</path>
>>> + <alignsize unit='KiB'>2048</alignsize>
>>> + <pmem>on</pmem>
>>> </source>
>>> <target>
>>> <size unit='KiB'>524288</size>
>>> @@ -8265,6 +8267,8 @@ qemu-kvm -net nic,model=? /dev/null
>>> <label>
>>> <size unit='KiB'>128</size>
>>> </label>
>>> + <persistence>cpu</persistence>
>>> + <unarmed>on</unarmed>
>>> </target>
>>> </memory>
>>> </devices>
>>> @@ -8339,10 +8343,38 @@ qemu-kvm -net nic,model=? /dev/null
>>> </dl>
>>> <p>
>>> - For model <code>nvdimm</code> this element is mandatory
>>> and has a
>>> - single child element <code>path</code> that represents a path
>>> - in the host that backs the nvdimm module in the guest.
>>> + For model <code>nvdimm</code> this element is mandatory. The
>>> + mandatory child element <code>path</code> represents a
>>> path in
>>> + the host that backs the nvdimm module in the guest. If
>>> + <code>nvdimm</code> is provided, then the following optional
>>> + elements can be provided as well:
>>
>> You'd need to add a since tags here for the "new" work...e.g.:
>>
>> <span class='since'>since 4.10.0</span>
>>
>> Whenever something new is added for a specific version, that's what's
>> necessary.
>
> Got it.
>
>>
>>> </p>
>>> +
>>> + <dl>
>>> + <dt><code>alignsize</code></dt>
>>> + <dd>
>>> + <p>
>>> + This element can be used to specify a proper alignment.
>>> + When mmap(2) the backend files, QEMU uses the host page
>>> + size by default as the alignment of mapping address.
>>> However,
>>> + some backends may require alignments different from
>>> the page.
>>> + For example, mmap a device DAX on NVDIMM maybe
>>> 2M-aligned.
>>> + </p>
>>> + </dd>
>>> +
>>> + <dt><code>pmem</code></dt>
>>> + <dd>
>>> + <p>
>>> + This element can be used to specify whether the
>>> backend storage
>>> + of memory-backend-file is a real persistent memory
>>> that can be
>>> + accessed in SNIA NVM Programming Model. If the
>>> backend is a real
>> ^
>> there's two spaces between backend and is
>>
> Got it. It's my mistake.
>
>>> + persistence memory and <code>pmem</code> is set to
>>> 'on', QEMU will
>>> + guarantee the persistence of its own writes to the
>>> vNVDIMM backend,
>>> + but which can work well with libpmem support (QEMU
>>> configured with
>>> + --enable-libpmem).
>>> + </p>
>>> + </dd>
>>> + </dl>
>>> </dd>
>>
>> The probably should be reworded in syntax more suitable for libvirt's
>> usage without including QEMU details.
>>
> Got it. I will reword this.
>
>> The pmem one in particular may be hard to describe... The fact that it
>> doesn't work without libpmem does raise a couple of warning flags. I
>> would just say that if <pmem/> is there, then it's on if it's not, then
>> it's off. Internally that then isn't a tristate, it just an on/off.
>>
>>> <dt><code>target</code></dt>
>>> @@ -8361,19 +8393,55 @@ qemu-kvm -net nic,model=? /dev/null
>>> NUMA nodes configured.
>>> </p>
>>> <p>
>>> - For NVDIMM type devices one can optionally use
>>> - <code>label</code> and its subelement <code>size</code>
>>> - to configure the size of namespaces label storage
>>> - within the NVDIMM module. The <code>size</code> element
>>> - has usual meaning described
>>> - <a href="#elementsMemoryAllocation">here</a>.
>>> - For QEMU domains the following restrictions apply:
>>> + Besides, the following optional elements can be provided
>>> as well for
>>> + NVDIMM type devices:
>>> </p>
>>> - <ol>
>>> - <li>the minimum label size is 128KiB,</li>
>>> - <li>the remaining size (total-size - label-size) has to be
>>> aligned to
>>> - 4KiB</li>
>>> - </ol>
>>> +
>>> + <dl>
>>> + <dt><code>label</code></dt>
>>> + <dd>
>>> + <p>
>>> + For NVDIMM type devices one can optionally use
>>> + <code>label</code> and its subelement <code>size</code>
>>> + to configure the size of namespaces label storage
>>> + within the NVDIMM module. The <code>size</code> element
>>> + has usual meaning described
>>> + <a href="#elementsMemoryAllocation">here</a>.
>>> + For QEMU domains the following restrictions apply:
>>> + </p>
>>> + <ol>
>>> + <li>the minimum label size is 128KiB,</li>
>>> + <li>the remaining size (total-size - label-size) will
>>> be aligned to
>>> + 4KiB as default.</li>
>>> + </ol>
>>> + </dd>
>>> +
>>> + <dt><code>persistence</code></dt>
>>> + <dd>
>>> + <p>
>>> + The <code>persistence</code> element can be set to
>>> "mem-ctl" or "cpu",
>>
>> "mem-ctrl"
>>
>>> + which indicate platform-supported features about
>>> NVDIMM data persistence.
>>> + 'mem-ctrl' means the platform supports flushing dirty
>>> data from the memory
>>> + controller to the NVDIMMs in the event of power loss,
>>> 'cpu' means The platform
>>
>> means the platform
>>
>>> + supports flushing dirty data from the CPU cache to the
>>> NVDIMMs in the event
>>> + of power loss, which contains what 'mem-ctrl' means.
>>> + ACPI 6.2 Errata A added support for a new Platform
>>> Capabilities Structure
>>> + in NFIT, so the guest ACPI NFIT will be filled in
>>> according to the persistence
>>> + option.
>>> + </p>
>>> + </dd>
>>
>> That's a hard read - someone has to know about ACPI 6.2 Errata A and
>> whatever it is and where to find it?
>>
> Sorry, maybe I put too much details here, there is no need to have
> knowledge about ACPI. I will delete the ACPI things.
>
>> Also, I'm not sure this belongs on the NVDIMM line... Allowing it means
>> more than one could use this and they could use separate values, each
>> then being placed on the command line and then you don't know what goes
>> with what.
>>
>> That is what happens if we have
>>
>> <memory model='nvdimm'>
>> ...
>> <target>
>> ...
>> <persistence>mem-ctrl</persistence>
>> ...
>> </memory>
>> <memory model='nvdimm'>
>> ...
>> <target>
>> ...
>> <persistence>cpu</persistence>
>> ...
>> </memory>
>>
> I have not considered this situation, I need do more tests and find a
> solution.
>
>>> +
>>> + <dt><code>unarmed</code></dt>
>>> + <dd>
>>> + <p>
>>> + The <code>unarmed</code> element can be used to mark
>>> vNVDIMM read-only
>>> + through setting unarmed flag in guest NFIT.Currently
>>> the only vNVDIMM backend
>>
>> What is NFIT?
>>
> 'NVDIMM Firmware Interface Table' defined in ACPI. I will delete this
> for its hard to understand and not very necessary.
>
>> Add a space before Currently
>>
> Got it.
>
>>> + can guarantee the guest write persistence is device
>>> DAX on Linux, so it's
>>
>> What is DAX?
>>
> 'DAX' means 'direct access'.
>
>>> + suggested to set <code>unarmed</code> to 'on' when
>>> using other types of
>>> + backends.
>>> + </p>
>>> + </dd>
>>> + </dl>
>>
>> unarmed will be similar to pmem...
>>
>> Once these are properly separated it'll be easier to review the text in
>> more detail. But again, like I noted earlier try to use more generic
>> terms and avoid specifically calling out QEMU. Although that may be
>> difficult - you have to consider the audience - they just want to know
>> what the feature does in general and how it will help them or make
>> things better for them. The details of what we rename it to under the
>> covers hides the complexity.
>>
>>> </dd>
>>> </dl>
>>> diff --git a/docs/schemas/domaincommon.rng
>>> b/docs/schemas/domaincommon.rng
>>> index 099a949..cca7869 100644
>>> --- a/docs/schemas/domaincommon.rng
>>> +++ b/docs/schemas/domaincommon.rng
>>> @@ -5353,9 +5353,21 @@
>>> </interleave>
>>> </group>
>>> <group>
>>> - <element name="path">
>>> - <ref name="absFilePath"/>
>>> - </element>
>>> + <interleave>
>>> + <element name="path">
>>> + <ref name="absFilePath"/>
>>> + </element>
>>> + <optional>
>>> + <element name="alignsize">
>>> + <ref name="scaledInteger"/>
>>> + </element>
>>> + </optional>
>>> + <optional>
>>> + <element name="pmem">
>>> + <ref name="virOnOff"/>
>>
>> Again, I'd follow the nosharepages example here.
>>
> See my explain above, NVDIMM model only supports 'shared' access now and
> the <pmem> has nothing to do with 'nosharepages'.
>
Ignore the reply above.I will check nosharepages usage first.
>>> + </element>
>>> + </optional>
>>> + </interleave>
>>> </group>
>>> </choice>
>>> </element>
>>> @@ -5379,6 +5391,19 @@
>>> </element>
>>> </element>
>>> </optional>
>>> + <optional>
>>> + <element name="persistence">
>>> + <choice>
>>> + <value>mem-ctrl</value>
>>> + <value>cpu</value>
>>> + </choice>
>>> + </element>
>>> + </optional>
>>
>> While the data is correct, the placement isn't...
>>
>>> + <optional>
>>> + <element name="unarmed">
>>> + <ref name="virOnOff"/>
>>
>> Similar here to be like nosharepages.
>>
> The <unarmed> has nothing to do with 'nosharepages'.
>
Ignore the reply above.I will check nosharepages usage first.
>> John
> Thanks for your review and so much detailed comments.
> Luyao
>>
>>> + </element>
>>> + </optional>
>>> </interleave>
>>> </element>
>>> </define>
>>>
More information about the libvir-list
mailing list