[libvirt] [PATCH 02/10] util: add interface retrieving CMT capability

John Ferlan jferlan at redhat.com
Tue Sep 11 17:23:34 UTC 2018


[...]

>> Maybe the capability output should:
>>
>> <monitor level='3' threshold='%u' maxAlloc='%u' >
>>   <feature name='%s'/>
>>   <feature name='%s'/>
>> ...
>> </monitor>
>>
>> Where the '3' is because you read from "L3_MON" and only important if you feel
>> 1 or 2 or something else would be generated eventually.
>>
> 
> The adding of 'level' attribute make it more clear for indicating the cache
> level that supports cache monitoring technology, great.
> 
> I have read your comments that newly added in the email, but I haven't
> reply them one by one for those related to the relationship for @monitor
> and @bank. Let's move on here ...
> 
> As I said, I was also confused at the time I began my first design, I also
> noticed something is not that right for putting the @monitor under the @bank.
> 
> how about placing the XML element 'monitor' under 'cache' in following style:
> 
>     <cache>
>       <bank id='0' level='3' type='both' size='55' unit='MiB' cpus='0-21,44-65'>
>         <control granularity='2816' unit='KiB' type='both' maxAllocs='16'/>
>       </bank>
>       <bank id='1' level='3' type='both' size='55' unit='MiB' cpus='22-43,66-87'>
>         <control granularity='2816' unit='KiB' type='both' maxAllocs='16'/>
>       </bank>
> +     <monitor level='@cache level' maxAllocs='@maxAlloc'
> +       <feature name='@feature names'/>
> +     </monitor>
>     </cache>

Seems reasonable.

> 
> The 'unit' is removed.
> for cache, the monitor only support feature 'llc_occupancy'.
> 

Shall I assume then that the <monitor...> would also appear in the
<memory_bandwidth> and have entries for "mbm_"? IDC what the answer is,
all I'm trying to point out is be sure that whatever API's get created
will be able to reuse the same code, but just change the prefix.

>> And then you correlate however you have to "later".  You "know" that <cache>
>> would be related to "llc_occupancy" and take it from there.
>>
>> I see no way for each feature to have a different num_rmids or
>> max__threshold_occupancy value, so that's why I'm putting them as attributes
>> of <monitor>. What is of concern is how someone knows <monitor> relates to
>> both <cache> and <memory_bandwidth> - I guess that has to be left for the
>> documentation portion.  If you wanted to name it something different than
>> <monitor> that's fine - naming is hard (TM).
>>
> 
> num_rmids is common for all kind of monitors, both cache monitor and memory
> bandwidth monitor.
> 
> But max_threshold_occupancy is special for cache monitor, more precisely,
> special for feature 'llc_occupancy',  Maybe following configuration is more
> reasonable for cache monitor:
> 
> +     <monitor level='3' maxAllocs='176'>
> +       <feature name='llc_occupancy' threshold='270336' />

This is a strange one... Part of me would say that there's then some
file llc_occupancy that has in it the @threshold value. Going into the
future if a new feature name was created, one wonders how or if it's
attribute would/could be similarly named 'threshold' or would it
"assume" the same threshold as the other uses.

Hard to know without seeing in the future.... Hint, buy some lottery
tickets too when you're there.

> +     </monitor>
> 
> And for memory bandwidth it would be:
> 
> +     <monitor level='3' maxAllocs='176'>
> +       <feature name='mbm_total_bytes'/>
> +       <feature name='mbm_local_bytes'/>
> +     </monitor>
> 
> I don't find a better name than 'monitor' for naming CMT technology in
> libvirt. I am even worse at naming.
> 

"Naming is hard" (TM, Andrea Bolognani)...  Not many are good at naming,
but everyone is good at complaining about someone else's chosen name for
something.  I think monitor is fine - I see it as "Cache Monitoring..."
or a "Memory Bandwidth Monitoring...".

Still for this one, because threshold is listed as a property, one
wonders then is there a similar "threshold" that describes the total or
local bytes value that could/should be printed.

>>
>>>>> +
>>>>> +    return 0;
>>>>>   error:
>>>>>      while (*ncontrols)
>>>>>          VIR_FREE((*controls)[--*ncontrols]);
>>>>>      VIR_FREE(*controls);
>>>>> -    goto cleanup;
>>>>> +    virStringListFree(cachemon->features);
>>>>> +    VIR_FREE(cachemon);
>>>>> +    return ret;
>>>>>  }
>>>>>
>>>>>
>>>>> diff --git a/src/util/virresctrl.h b/src/util/virresctrl.h index
>>>>> cfd56dd..51bb68b 100644
>>>>> --- a/src/util/virresctrl.h
>>>>> +++ b/src/util/virresctrl.h
>>>>> @@ -61,6 +61,19 @@ struct _virResctrlInfoMemBWPerNode {
>>>>>      unsigned int max_allocation;
>>>>>  };
>>>>>
>>>>> +typedef struct _virResctrlInfoMon virResctrlInfoMon; typedef
>>>>> +virResctrlInfoMon *virResctrlInfoMonPtr;
>>>>> +/* Information about resource monitoring group */ struct
>>>>> +_virResctrlInfoMon {
>>>>> +    /* null-terminal string list for hw supported monitor feature */
>>>>> +    char **features;
>>>>> +    size_t nfeatures;
>>>>> +    /* Maximum number of simultaneous allocations */
>>>>> +    unsigned int max_allocation;
>>>>
>>>> What kind of allocations? From your cover you state maximum number of
>>>> monitoring groups that could be created, but it's impossible to know
>>>> how this value is expected to be used by what's provided as a comment here.
>>>>
>>>
>>> Changing the comment
>>> from
>>>   /* Maximum number of simultaneous allocations */ to
>>>   /* Maximum number of monitoring groups that could be created */
>>>
>>>> The code shows this is the value from /info/L3_MON/num_rmids - I
>>>> don't see the correlation in FS name to structure name. Since you
>>>> later print a unit of 'B' I assume it's bytes of something, but the
>>>> comment seems to imply a maximum simultaneous number of something.
>>>>
>>>
>>> Looks the capability string make you confused.
>>> The XML output of cache monitor capability looks like: (leveraging the
>>> format of 'cache control' capability)
>>>
>>>         <monitor threshold='270336' unit='B' maxAllocs='176'>
>>>           <feature name='llc_occupancy'/>
>>>         </monitor>
>>>
>>> 'B' is the unit for 'threshold', 'maxAllocs' is for max_allocation
>>> (parsed through /info/L3_MON/num_rmids).
>>> The 'unit' are not limited to 'B', could be 'KiB' ...
>>
>> True, but you read a 'B' and never change that anywhere - it's also not clear
>> threshold is a 'B' value. At least when I see 'size' followed by 'unit' - I'm certain
>> the size is a related to unit. To me "threshold" is just a "value" not a necessarily
>> byte related. Could be a count of something.
> 
> You have convinced me, let's remove attribute 'unit'.
> 
>> Of course the longer wording
>> "max_threshold_occupancy"
>> doesn't help me much either.  A maximum threshold occupancy of what?
>> It's not unique to each feature name, it's unique to the L3_MON.
>>
>>
> 
> Still trying to recap its behavior better...
> 
>>
>>> This tells docs are necessary for interpreting these settings.
>>
>> clearly!
>>
>>>
>>>     "monitor": describes cache monitor capability.
>>>     "threshold": This is cache occupancy threshold value used in kernel
>>>          resource control system, and affects the actual release of  hardware
>>>          resource, the RMID (resource monitoring ID). A greater value of this
>>>          will make the request of creating a new resource monitoring group more
>>>          likely to fail if the existing number of monitoring groups reaches up
>>>          to 'maxAlloc'.
>>
>> Again it's not something a "normal consumer" would probably change...
>>
>>>     "unit": This is the unit of "threashold", could be 'B', 'KiB', 'MiB' or 'GiB'.
>>
> 
> Remove it. The threshold would be treated as byte, and make
> corresponding changes in document.
> 
>> Unless you do the logic to present it as calculated value, then what's the
>> purpose. I know there's code out there that will "prettify" output such that for
>> the example from patch 4 a value of 270336 bytes is printed as '264 MiB'.  If
>> you're not going to do that, then just present as a value and note that it's a byte
>> value.  /me no wonders if you should be sure to store this in "unsigned long
>> long" field since you mention 'GiB' an 'unsigned int' only gets you so far.
>>
>>>     "maxAllocs": This is a number that maximum number of  monitoring groups
>> could
>>>          be created.
>>>     "feature": describes the feature name supported by 'monitor'.
>>>
>>> Hope this documentation clears your confusion.
>>>
>>>> Perhaps if documentation was added I would have had my answer without
>>>> needing to go research that is.
>>>>
>>>
>>> See docs above.
>>>
>>>>> +    /* determines the occupancy at which an RMID can be freed */
>>>>
>>>> Again, alone this comment is difficult to decipher as it relates to
>>>> the structure field name. The code shows the value read is
>>>> "/info/L3_MON/max_threshold_occupancy".  It's not clear what "occupancy"
>>>> means. Is there something related to this number that some consumer
>>>> could change that would improve some performance?
>>>>
>>>
>>> "max_threshold_occupancy" is a concept involved by kernel resctrl. It
>>> is a cache value, in bytes, affects the release of hardware 'RMID', thus, affects
>>> the maximum number of monitor group could be created. Get more
>> information from
>>> the cache monitor attribute 'threshold''s description.
>>>
>>> Thanks for your efforts of the review.
>>
>> Thanks for your return of details - I'm still not sure I really understand the
>> maxAllocs and threshold values.  I see them purely as display values at this point.
>> I cannot imagine providing an interface or description that would help some
>> consumer adjust the value to fix some problem on their host. There are patch
>> series in the virtual bit bucket that have tried to do that in other areas.
>>
> 
> The 'maxAllocs' should be easily understood, while 'threshold' is very
> obscure from the description of kernel document(intel_rdt_ui.txt).
> 
> The maxAllocs number is the hardware RMID number, which is CPU chip level
> resource, each resource monitor will cost one RMID. So the maxAllocs number
> determines the maximum number monitor could be created. If hardware RMID is
> used up, the next creation of resource monitor will return an error.
> 

So somewhat similar to vHBA's where there's a limited number of
vport_ops available.  Still shouldn't maxAllocs get decremented for each
<monitor> created? Thus if <monitor>[1] has maxAllocs=176, then
<monitor>[2] would display maxAllocs=175...

See: https://wiki.libvirt.org/page/NPIV_in_libvirt and search on vports
- you'll see "vports" and "max_vports", but those are tied to the
"parent" scsi_host. As more vHBA's are created the vports count changes,
but max_vports stays the same.

> The threshold, or the max_threshold_occupancy, is harder to understand. It is
> related to the cache monitor, or the llc_occupancy feature. The following
> paragraph is my understanding of it, hope it helps you.
> 
> If we have the goal to get to know the last level cache consumption for
> special Linux process for some time. With the help of resctrl file system,
> we need to create a monitor group and then remove this monitor in order to
> save the resource. We also need to put the target process's PID into the
> monitor's 'tasks' file.
> 
> The underlying details is something like:
> 
> A hardware RMID is allocated from hardware resource pool.
> 
> The RMID is assigned to particular hardware CPU thread (or threads, RMID could
> be assigned to multiple threads at same time) during context switch.
> 
> At the same time, the RMID is tagged the last level cache lines.
> 
> When resctrl monitor removing operation is performed. The RMID could not be
> immediately reclaimed, because it is still tagged the cache lines, while these
> cache lines will be 'maintained' for some while. 
> 
> If we let this RMID to be used by another monitor immediately, the cache
> occupancy/consumption data will be inaccurate because it still tagged
> with cache lines which are used by previous Linux process.
> 
> So this 'max_threshold_occupancy' is provided by RMID manager, actually the
> kernel CPU driver, to help make the judgment that if the number of cache line
> associated with the RMID is small enough. If the cache occupancy is less than
> this safety threshold, the RMID will be released for next reuse.
> 
> To me, it hard to recap these information from a high level for a libvirt user.
> By the way, do you think if we really necessary to expose this 'threshold' to
> Libvirt user? I doubt that anyone would change it.

If something is hard to recap, then the question becomes is it
worthwhile to expose. To some degree having the data available is nice,
but knowing how to interpret or use the data is even better. At other
times though exposing the data leads to the inevitable is the value good
or bad. If bad, then what can we do to make things better.  I think in
the long run it's a question of interpretation.

Exposing performance type data is both good and bad. It's a double edged
sword of truth. Still for something like libvirt - only providing the
raw numbers is perfectly fine. Some other tool can be developed to help
interpret those values, just make it so that tool has half a chance to
figure out what the data represents. Changing the name "too much" from
the what is represented makes for harder interpretation. That may just
be the case with "rmid" and "maxAlloc". It may be that "Allocs" should
be replaced by "Monitors", but then does that mean each child element is
"one" monitor object.  Guess I still don't have a great picture in mind
regarding how this is used from a client perspective.  Right now it's
just a lot of data.

John

> 
> Thanks for review.
> Huaqiang
> 
>> John
>>
>>>> John
>>>>
>>>>> +    unsigned int cache_threshold;
>>>>> +};
>>>>> +
>>>>>  typedef struct _virResctrlInfo virResctrlInfo;  typedef
>>>>> virResctrlInfo *virResctrlInfoPtr;
>>>>>
>>>>> @@ -72,7 +85,9 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>>>>                         unsigned int level,
>>>>>                         unsigned long long size,
>>>>>                         size_t *ncontrols,
>>>>> -                       virResctrlInfoPerCachePtr **controls);
>>>>> +                       virResctrlInfoPerCachePtr **controls,
>>>>> +                       virResctrlInfoMonPtr *monitor);
>>>>> +
>>>>>
>>>>>  int
>>>>>  virResctrlInfoGetMemoryBandwidth(virResctrlInfoPtr resctrl,
>>>>>




More information about the libvir-list mailing list