[libvirt] [PATCH 2/5] virCaps: expose huge page info

Mon Jun 16 07:45:28 UTC 2014

On 16.06.2014 08:38, Martin Kletzander wrote:
> On Fri, Jun 13, 2014 at 04:30:41PM +0200, Michal Privoznik wrote:
>> On 13.06.2014 10:28, Daniel P. Berrange wrote:
>>> On Thu, Jun 12, 2014 at 07:21:47PM +0200, Martin Kletzander wrote:
>>>> On Thu, Jun 12, 2014 at 02:30:50PM +0100, Daniel P. Berrange wrote:
>>>>> On Tue, Jun 10, 2014 at 07:21:12PM +0200, Michal Privoznik wrote:
>>>>>> There are two places where you'll find info on huge pages. The first
>>>>>> one is under <cpu/> element, where all supported huge page sizes are
>>>>>> listed. Then the second one is under each <cell/> element which
>>>>>> refers
>>>>>> to concrete NUMA node. At this place, the size of huge page's pool is
>>>>>> reported. So the capabilities XML looks something like this:
>>>>>>
>>>>>> <capabilities>
>>>>>>
>>>>>>   <host>
>>>>>>     <uuid>01281cda-f352-cb11-a9db-e905fe22010c</uuid>
>>>>>>     <cpu>
>>>>>>       <arch>x86_64</arch>
>>>>>>       <model>Westmere</model>
>>>>>>       <vendor>Intel</vendor>
>>>>>>       <topology sockets='1' cores='1' threads='1'/>
>>>>>>       ...
>>>>>>       <pages unit='KiB' size='1048576'/>
>>>>>>       <pages unit='KiB' size='2048'/>
>>>>>
>>>>> Should have normal sized pages (ie 4k on x86) too, to avoid
>>>>> apps having to special case small pages.
>>>>>
>>>>
>>>> Since we have to special-case small pages and kernel (at least to my
>>>> knowledge) doesn't expose that information by classic means, I think
>>>> reporting only hugepages is actually what we want here.  For normal
>>>> memory there are existing APIs already.
>>>>
>>>> Hugepages are different mainly because of one thing.  The fact that
>>>> there are some hugepages allocated is known by the user of the machine
>>>> (be it mgmt app or an admin) and these hugepages were allocated for
>>>> some purpose.  It is fairly OK to presume that the number of hugepages
>>>> (free or total) will change only when and if the user wants to
>>>> (e.g. running a machine with specified size and hugepages).  That
>>>> cannot be said about small pages, though, and I think it is fair
>>>> reason to special-case normal pages like this.
>>>
>>> That difference is something that's only relevant to the person who
>>> is provisioning the machine though. For applications consuming the
>>> libvirt APIs it is not relevant. For OpenStack we really want to have
>>> normal size pages dealt with the in the same way as huge pages since
>>> it will simplify our schedular/placement logic. So I really want these
>>> APIs to do this in libvirt so that OpenStack doesn't have to reverse
>>> engineer this itself.
>>
>> But if we go this way, there are black holes hidden. For instance, the
>> sizeof(ordinary pages pool). This is not accessible anywhere and the
>> only algorithm I can think of is to take [(MemTotal on NODE #i) -
>> sum(mem taken by all huge pages)] / PAGE_SIZE. So for instance on my
>> machine where I have 1GB huge page per NUMA node, and 3 2MB per NUMA
>> node:
>>
>> # grep MemTotal /sys/devices/system/node/node0/meminfo
>> Node 0 MemTotal:        4054408 kB
>>
>> # cat
>> /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
>> 1
>>
>> # cat
>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> 3
>>
>> # getconf PAGESIZE
>> 4096
>>
>> (4054408 - (1*1048576 + 3*2048)) / 4 = 2999688 / 4 = 749922 ordinary
>> pages. But it's not that simple as not all pages are available. Some are
>> reserved for DMA transfers, some for kernel itself, etc. Without
>> overcommit it's impossible to allocate that nearly 3GB. Is this
>> something we really want to do?
>>
>
> I've found one other way to get the number of free normal pages.  It
> looks like nr_free_pages in /proc/zoneinfo is what Daniel wants to
> report probably.  But given the fact that this is something that might
> not be true even in the time of parsing the file, I'm still not
> convinced it's something you want to report.  Bit more accurate would
> be having the amount of memory that might be available to the
> machine.  Although with overcommit settings and file caches this might
> not be feasible.
>

No, the zoneinfo file is bringing merely the same info as 
/sys/devices/system/node/node*/meminfo.

# getconf PAGESIZE
4096

# grep -i memfree /sys/devices/system/node/node3/meminfo
Node 3 MemFree:         2370272 kB

which is then 2370272/4 = 592568 free pages. And the corresponding field 
in the zoneinfo shows:

# grep nr_free_pages /proc/zoneinfo  | tail -n 1
     nr_free_pages 592639

which is merely the same number (I wonder where the slight difference is 
coming from though). And the problem is not getting info on free pages 
rather than the size of the pages pool.

Michal