[libvirt] [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

John Ferlan jferlan at redhat.com
Fri Jun 1 12:21:52 UTC 2018


[...]

First thanks for taking the time to elaborate - it is helpful. Much
better than just stating no because I don't like it ;-).

>>
>>>> 1. Add poll-max-ns property of each iothread:
>>>> https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html
>>>>
>>>
>>> This is about tunables.  It might change the performance/latency of QEMU
>>> slightly, but that's about it.
>>>
>>
>> and there are those that would find it useful to have (bz 1545732)... If
> 
> Much of what I read (underscore emphasis mine) suggests otherwise:
> 
> - "There is a lot of sentiment *against* providing too many low level knobs
>    like this _without proper guidance_ on how they should be set."
> 
> - "To address this issue QEMU implements self-tuning algorithm that
> modifies
>    the current polling time to _adapt to different workloads_ and it can
> also
>    fallback to blocking syscalls."
> 
> - "The QEMU commits say the tunables all default to sane parameters so I'm
>    inclined to say we ignore them at the libvirt level entirely."
> 
> - "I'm fine if libvirt doesn't add a dedicated API for setting <iothread>
>    polling parameters.  It's _unlikely_ that users will need to change the
>    setting.  In an emergency (e.g. disabling it due to a performance
>    regression) _they can_ use <qemu:arg value='-newarg'/>."
> 
> The only points for the polling to be enabled were along the lines of:
> 
> It _may_ help in _some_ workloads when you want a bit more throughput
> for the price of more CPU cycles.
> 
> With vague definitions of how much CPU, throughput and without
> description of how to find out if a particular workload fits this.  Even
> when all of that is there, then you need yet another explanation on how
> to calculate the value to be set.  And then it all goes down back to the
> fact that QEMU is already doing some automated balancing for this
> (because they can, because this is not part of the guest ABI).  That way
> you can never actually say if it will help and how much.
> 
> So for this one it is a clear "NO".
> 

Another opposing viewpoint is:

https://bugzilla.redhat.com/show_bug.cgi?id=1545732#c8

If it were only a "documentation issue" - someone would have figured out
much earlier how to get beyond that.

FWIW: Without guidance in the Contributor's Guide over what is/isn't
acceptable I have a feeling we'll continue to see patches such as this
and the one below.

Still my point was less the actual feature or details of it, but rather
my feeling is there are more examples where exposing low level knobs has
been panned in the past. From the lack of details/knowledge I have/had
about TSEG during my original review - I saw it as just another low
level knob and while I saw value in the knob, I knew there has been a
high degree of sentiment in previous patches regarding adding such knobs
so I wanted to "be sure" it was desired/necessary adjustment by more
than just my opinion (it is a community after all, right)?

BTW: I'm not in disagreement that I found poll-max-ns as an odd tunable
to add for many of the reasons supplied. Of course I'm the one stuck
with the bz /-| and providing the "bad news" or just continually move
the bz to a future release ;-)

>> you don't have enough memory and your VM is paging like crazy, you just
>> add more memory.  Requires a reboot. Likewise, if your VM doesn't boot
>> you add/alter the magic TSEG value using some algorithm as described
>> above. From a 90,000 foot customer view is there a difference? It's just
>> a knob that the hypervisor has to allow something to be accomplished for
>> which libvirt provides the attribute to fine tune.
>>
> 
> Yeah, you're right.  That's why I think both of them should be exposed. 
> Some
> small differences to other knobs, just for completeness:
> 
> - by the time you realize that the VM doesn't have enough memory, it
> might be
>   too late as reboot isn't that easy of a thing for some production
> workloads
> 
> - on the other hand, you have a way to see that happening (compare it to
> the
>   polling interval above which you have no idea without proper benchmarks)
> 
> One more thing that's common to the memory size (and I hope TSEG in the
> future)
> is that in mgmt apps the TSEG setting already has a place where to live
> and it
> is exactly where the memory size lives currently.  In templates.  You have
> "small vm" teplate and "ginormous vm".  For the latter one you can just
> add a
> setting of TSEG _once_ per file.  How's TSEG better and easier than
> memory?  You
> figure it once for the VM settings (and possibly firmware, but that's
> not going
> to change much) and then it doesn't depend on the workflow, not even a
> little
> bit.
> 
> Anyway.
> 

True TSEG is much more bounded and in your face when it doesn't work.
There still is this voice rumbling around in the back of my head that
says QEMU should be the owner of deciding upon the algorithm for the
value. Unlike a performance knob, it seems there's a solid way to
calculate a 'correct value' to make the boot work. The problem is if
that automatic calculation ended up being wrong at some point, then
there'd be no way to change the value without adding a knob. So, in a
way the knob could be the exception rather than the rule. It's a
mechanism to make sure the guest can boot given outside interference.

> What I see as the differences between tunables that make it in and
> tunables that
> don't is that:
> 
> - the former are usually understandable and easy to see what they are in
> bare
>   metal.  Everyone knows what memory is in the hardware, how it looks
> like, how
>   much is "not enough" and how much is "more that needed".  We are used to
>   those things back from the hardware times, even to changing them.
> 

Still the calculation of a proper TSEG value is based on multiple
factors (memory/vcpus). Historically I've found these also need a fudge
factor built in - it's the fudge factor that is the sticking point. On
real hardware you'd be told - well you don't have enough memory, so buy
some more - it's like printing money at that point for the sales guy.
You'll be guided to buy a more expensive and larger piece than you may
need to "ensure future expand-ability". You may not use the entire
thing, but you have it. For software it's a much easier knob.

> - The latter is usually something we were not able control in HW or
> didn't even
>   know it existed.  For virtual workloads it might be completely
> different, but
>   sometimes people are forgetting that.
> 

Sounds like a job for virtuned (or virtunefixd or virhighavaild). Years
ago I worked on a project that would essentially show bottlenecks for
the OS and provide the capability to "fix" those via various means
(whether it was CPU, memory, or disk overutilization... even deadlocks
and cluster quorum hangs).

>>>> 2. Add support for qcow2 cache (many times, but most recently):
>>>> https://www.redhat.com/archives/libvir-list/2017-September/msg00553.html
>>>>
>>>>
>>>
>>> Similarly here, it allows setting something that can be (at least
>>> slightly) abstracted and in the worst case the performance will be
>>> slightly hindered.
>>
>> This one I understand more why it would be rejected, but still providing
>> the value allows certain things to work a whole lot better. I also know
>> Berto has been "fine tuning" the algorithm in later QEMU releases - so
>> that's like hitting a moving target.
>>
> 
> This is very similar, it's just that there is no automatic balancing
> done by QEMU.  But it usually is also about how you write the docs.  The
> option can make very much sense, but if someone writes "Setting asdf can
> allows fine-tuning of the asdf value in the underlying hypervisor", then
> no matter how much that value makes sense it is not reflected in the
> docs.  That's why I tried to add all the relevant info into the docs so
> that it's clear what it is doing, how to set it, to what values and
> when.
> 
> Apart from the fact that there is a "link" to some file in the QEMU
> repository that someone is supposed to read, plus the decision for the
> value determination are written there (but not why they are not
> automatically calculated, or maybe I missed it), it:
> 
> - is not possible to try using the <qemu:arg value='-newarg'/> approach
> 
> - the docs say:
>   <b>In general you should leave this option alone, unless you
>      are very certain you know what you are doing.</b>
> 
> So in this particular case I wouldn't be totally against having it
> there.  If you don't want to use it, then "just don't touch that" is an
> approach that shouldn't hurt anyone.
> 

Search the formatdomain page for 'unless' - there are examples where
knobs have been added that aren't well described and the consumer better
know what they're doing in order to use them. Perhaps another case of
alibistic behaviors (a/k/a CYA).

[...]

>> And there are those that could say if the underlying hypervisor knows
>> that for certain memory sizes and/or vCPU counts that the TSEG will be
>> too small for specific machine types that then the underlying hypervisor
>> should be the one to "choose" a value that's programatically appropriate
>> which to a degree IIUC is the argument being used against allowing a
>> libvirt knob for the poll-max-ns and qcow2 cache sizes.
>>
> 
> And they would be wrong as for TSEG the hypervisor a) doesn't know that
> and b) cannot change that once it was started.
> 

I think you lost me here.... From the bz problem statement:

"The necessary size is technically predictable (see bug 1468526 comment
8 point (2a) e.g.), but the formula is neither exact nor easy to
describe, so as a first step, libvirt should please expose this value in
an optional element or attribute."

I read that as a proper size could be calculated by the hypervisor, but
"just in case" let's make sure we have a fallback option. Perfectly
reasonable to me and even more pointed, (so far) only for q35. Of course
it's possible I read it wrong.


John

[...]




More information about the libvir-list mailing list