[libvirt] [RFC PATCH 0/2] Introduce RDT memory bandwidth allocation support

Fri Jun 8 10:51:02 UTC 2018

On Wed, Jun 06, 2018 at 01:56:53PM +0800, bing.niu wrote:
> Hi Pavel,
> Thanks for your valuable inputs here. please see my respond.
> 
> On 2018年06月05日 20:11, Pavel Hrdina wrote:
> > On Tue, May 29, 2018 at 06:58:01PM +0800, bing.niu at intel.com wrote:
> > > From: Bing Niu <bing.niu at intel.com>
> > > 
> > > This series is to introduce RDT memory bandwidth allocation support by extending
> > > current virresctrl implementation.
> > > 
> > > The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate
> > > control over memory bandwidth available per-core. This feature provides a method to
> > > control applications which may be over-utilizing bandwidth relative to their priority
> > > in environments such as the data-center. The details can be found in Intel's SDM 17.19.7.
> > > Kernel supports MBA through resctrl file system same as CAT. Each resctrl group have a
> > > MB parameter to control how much memory bandwidth it can utilize in unit of percentage.
> > > 
> > > In this series, MBA is enabled by enhancing existing virresctrl implementation. The
> > > policy employed for MBA is similar with CAT: The sum of each MBA group's bandwidth
> > > dose not exceed 100%. The enhancement of virresctrl include two parts:
> > > 
> > > Patch 1: Add two new structure virResctrlInfoMB and virResctrlAllocMB for collecting
> > >           host system MBA capability and domain memory bandwidth allocation.
> > > 
> > > Patch 2: On frontend XML parsing, add new element "llc" in cachetune section for
> > >           MBA allocation.
> > 
> > Hi,
> > 
> > Thanks for the patches.  Before we start with the actual implementation
> > it would be nice to agree on the design.
> Total agree. The RFC code acts as baseline for discuss.
> > 
> > ------------------------------------------------------------------------
> > 
> > So first point is that we should do it similarly as the cache
> > allocation, we will not allow to "share" the bandwidth so the sum should
> > be 100% as you already have that in your patches, but we need to do it
> > in a way that in the future we can allow to "share" the bandwidth.
> > Yes, the memory bandwidth allocation policy is derived from existing CAT
> in libvirt. no share or overlap. In the patch, I follow the existing CAT
> behavior. When allocating memory bandwidth. First, calculate the unused
> memory bandwidth by subtracting all existing RDT groups. If we want to
> enable memory bandwidth sharing. We can just simply skip this part and do
> allocation directly.
> Could this fit your comment " we need to do it in a way that in the future
> we can allow to "share" the bandwidth."?
> If there is anything missing or my understanding incorrect, Please point me
> out. :)

Sounds good to me.

> > Second point is how the XML will look like.  There are two parts, one is
> > the capabilities XML and second one is domain XML.
> > 
> > It looks like that your patches don't expose any information in
> > capabilities, we should do that in order to let management applications
> > know that the feature is available and what are the possible values that
> > they can use.
> > 
> > ------------------------------------------------------------------------
> > 
> > I've tried to configure MBA on one machine that I have access to witch
> > has this cpu: 'Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz' and it behaves
> > strangely.  If I configure 'schemata' the output of 'pqos -s' command is
> > in some situations different:
> > 
> >      schemata                pqos -s output
> > 
> >      MB:0=10                 MBA COS0 => 10% available
> >      MB:0=20                 MBA COS0 => 20% available
> >      MB:0=30                 MBA COS0 => 30% available
> >      MB:0=40                 MBA COS0 => 40% available
> >      MB:0=50                 MBA COS0 => 50% available
> >      MB:0=60                 MBA COS0 => 60% available
> >      MB:0=70                 MBA COS0 => 90% available
> >      MB:0=80                 MBA COS0 => 90% available
> >      MB:0=90                 MBA COS0 => 90% available
> >      MB:0=100                MBA COS0 => 100% available
> > If you look at the table you can see that for values 70-90 the pqos
> > shows that the available bandwidth is 90%.
> > 
> > Tested using Fedora 28:
> > kernel-4.16.13-300.fc28.x86_64
> > intel-cmt-cat-1.2.0-2.fc28.x86_64
> > 
> hmm.., that is strange. I directly manipulate resctrl fs. So I didn't hit
> such kind of issue. I will take a look at this pqos package and let you
> know.

Yes, I was directly manipulating the resctrl fs as well and after the
modification to schemata file the content was correct, however, I wanted
to validate what was actually configured and in order to do that I used 
'pqos' tool.  The question now is whether that tool is broken or if you
configure some values the actual configuration is different.

> > ------------------------------------------------------------------------
> > 
> > Since CAT (cache allocation technology) and MBA (memory bandwidth
> > allocation) are unrelated and they are controlling different limitation
> > we should not group MBA together with CAT in our XML files.  From poor
> > documentation it looks like that MBA is related to memory controller.
> From Intel sdm 17.19. MBA used to control the request rate for flushing data
> from llc to memory, usually MBA and llc have a 1:1 mapping relation. Yes, I
> miss exposing capability part. Thanks for pointing out.
> > 
> > Currently the cache allocation in capabilities XML is reported like
> > this:
> > 
> > <capabilities>
> >    <host>
> >      ...
> >      <cache>
> >        <bank id='0' level='3' type='both' size='30720' unit='KiB' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'>
> >          <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/>
> >          <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/>
> >        </bank>
> >        <bank id='1' level='3' type='both' size='30720' unit='KiB' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'>
> >          <control granularity='1536' unit='KiB' type='code' maxAllocs='8'/>
> >          <control granularity='1536' unit='KiB' type='data' maxAllocs='8'/>
> >        </bank>
> >      </cache>
> >      ...
> >    </host>
> > </capabilities>
> > 
> > So the possible capabilities XML could look like this:
> > 
> > <capabilities>
> >    <host>
> >      ...
> >      <memory>
> >        <bank id='0' cpus='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46'>
> >          <control granularity='10' maxAllocs='8'/>
> >        </bank>
> >        <bank id='1' cpus='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'>
> >          <control granularity='10' maxAllocs='8'/>
> >        </bank>
> >      </memory>
> >      ...
> >    </host>
> > </capabilities>
> > 
> > The element names 'memory' and 'bank' can be named differently,
> > suggestions are welcome.
> How about change bank to node?

Definitely can be 'node', this was just to make it consistent with cache
allocation.

> > Then there is the domain XML, for CAT we use this:
> > 
> > <domain>
> >    ...
> >    <cputune>
> >      ...
> >      <cachetune vcpus='0-3'>
> >        <cache id='0' level='3' type='both' size='3' unit='MiB'/>
> >        <cache id='1' level='3' type='both' size='3' unit='MiB'/>
> >      </cachetune>
> >      ...
> >    <cputune>
> >    ...
> > </domain>
> > 
> > so the possible domain XML could look like this:
> > 
> > <domain>
> >    ...
> >    <cputune>
> >      ...
> >      <memory vcpus='0-3'>
> >        <socket id='0' bandwidth='30'/>
> >        <socket id='1' bandwidth='20'/>
> >      </memory>
> >      ...
> >    <cputune>
> >    ...
> > </domain>
> > 
> > Again, the element names 'memory' and 'socket' can be named differently.
> socket --> node?

Same here, I'm OK with 'node'.

> Since the existing virrestrl implementation only care about cache part
> during development, So we may need change some names of structure and
> functions when enable MBA. How do you think

Yes, the existing virresctrl implementation was mainly focused to cache
allocation.  There is a new patch series with some cleanups for the
existing implementation which renames few things but we will probably
need to rename some other functions/structures/etc.  One important note,
the rename should be done in separate patch without any functional
changes.

Pavel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20180608/b8f17956/attachment-0001.sig>