[libvirt] RFC: New CPU hot(un)plug API and XML

Wed Jun 15 17:10:01 UTC 2016

On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:

> # API #
> 
> As for the new API I'm thinking of the following design:
> 
> int
> virDomainVcpu(virDomainPtr domain,
>               unsigned int id,
>               unsigned int flags);
> 
>  The flags for this API would be following:
>   - usual domain modification impact:
>     * VIR_DOMAIN_SET_VCPU_CURRENT
>     * VIR_DOMAIN_SET_VCPU_LIVE
>     * VIR_DOMAIN_SET_VCPU_CONFIG
>   - for specifying the operation as the default operation would query the cpu
>     state:
>     * VIR_DOMAIN_SET_VCPU_ENABLE
>     * VIR_DOMAIN_SET_VCPU_DISABLE

We generally use flags to control behaviour of a specific action,
not to actually choose which action to run.

IOW, I'd prefer to see us have separate APIs for this

   virDomainVCPUPlug / virDomainVCPUUnplug

Or, have a single API with separate argument to reflect
the state, eg

   virDomainVCPUSetHardware(virDomainPtr domain,
                            unsigned int id,
                            unsigned int present, /* 0 == unplugged, 1 == plugged */
                            unsigned int flags);

of course, inside the QEMU driver, its fine if you want to have
these separate APIs call into common code path todo their work.

>   - misc:
>     * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug

I think in general it was/is a mistake to overload guest agent
CPU online/offline-ing, with real CPU hotplug/unplug. The two
operations are really distinct from each other, and having them
in the same API gives the misleading impression that they're
just 2 ways to doing the same thing, where as really they are
two completely different beasts with very different security
implications too.

IOW, I'd suggest we make this more explicit by having another
method

       virDomainVCPUSetGuestUsage(virDomainPtr domain,
                                  unsigned int id,
                                  unsigned int online, /* 0 == disabled, 1 == enabled */
                                  unsigned int flags);

>     * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the
>       cpu should be enabled/disabled rather than CPU id. This is a convenience
>       flag that will allow to add cpu to a given numa node rather than having
>       to find the correct ID.

Using flags is fine if we want to retrofit this kind of thing into an
existing API, but if we're deciding a new API, we should aim to represent
NUMA node ID explicitly I think.

Then again, I wonder if the desire for this flag is really just a reflection
on us making it too hard to figure out the correct vCPU ID for a NUMA node.

>     * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This
>                                  makes sure that the CPU will be plugged in
>                                  on platforms that require to plug in multiple
>                                  threads at once.

So the use of 'int id' rather strongly suggests we're only plugging
a single vCPU at a time. Given that with PPC you're forced to plug
an entire set of threads at once, in a single "core", I think perhaps
we should have a bitmask instead of an 'int id'. eg

    unsigned char *cpumask,
    unsigned int cpumasklen

The domain guest capabilities XML could describe the minimum required
granularity for hotplugging/unplugging CPUs. So the app would query
that and discover whether they need to plug 1 or 8 cpus at once.
The app would be responsible for setting the CPU mask to reflect
how many CPUs they want to plug at once.  That would remove the
need to have an explicit VIR_DOMAIN_SET_VCPU_CORE flag I think.
We'd simply return an error if the app didn't pay attention to the
reuqired CPU hotplug granularity described in the capabilites.

> VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually
> exclusive as the guest agent doesn't report the guest numa node the CPU is
> belonging to .
> 
> If the idea of one API that will both query and set is too nonconformist to
> our existing API design I have no problem adding Get/Set versions and/or
> explode the ADD/REMOVE flags into a separate parameter.

Yep, I'd rather do that.

> The new API will require us to add new XML that will allow to track the state
> of VCPUs individually. Internally we now have a data structure allowing to
> keep the relevant data in one place.

Presumably we're only going to track the plugged/unplugged state,
and *not* the guest OS  online/offline state. This would be another
reason to deal with guest OS online/offline via a separate API
from the real hotplug/unplug API.

> Currently we are setting data relevant to VCPUs in many places.
> 
> <domain>
>   [...]
>   <vcpu current='1'>3</vcpu>
>   [...]
>   <cputune>
>     <cpupin ... />
>   </cputune>
>   [...]
>   <cpu>
>     <numa>
>       <cell id='0' cpus='0' memory='102400' unit='KiB/>
>       <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
>     </numa>
> 
> As we'll be required to keep the state for every single cpu I'm thinking of
> adding a new subelement called '<vcpus>' to <domain>. This will have a
> '<vcpu>' subelement for every configured cpu.
> 
> I'm specifically not going to add any of the cpupin or numa node ids to the
> /domain/vcpus/vcpu as input parameters to avoid introducing very compicated
> checking code that would be required to keep the data in sync.
> 
> I'm thinking of adding the numa node id as an output only attribute since it's
> relevant to the hotplug case and it's misplaced otherwise. I certainly can add
> the duplicated data as output-only attributes.
> 
> The XML with the new elements should look like:
> 
> <domain>
>   [...]
>   <vcpu current='1'>3</vcpu>
>   <vcpus>
>     <vcpu id='0' state='enabled'/> <-- option 1, no extra data
>     <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node,
>                                                   since it's non-obvious
>     <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/>
>      <!-- option 3 all the data duplicated -->
>   </vcpus>
>   [...]
>   <cputune>
>     <cpupin ... />
>   </cputune>
>   [...]
>   <cpu>
>     <numa>
>       <cell id='0' cpus='0' memory='102400' unit='KiB/>
>       <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
>     </numa>

In the 'virsh capabilities' XML, the CPUs are listed underneath the
<cell>. So rather than adding <vcpu> as children of <vcpus>, I think
perhaps we should just put then under the <cell> here too. That would
avoid the need to add a cell=NN attribute.eg

   <cpu>
     <numa>
       <cell id='0' cpus='0' memory='102400' unit='KiB>
         <cpus>
           <cpu id="0" state="enabled"/>
	 </cpus>
       </cell>
       <cell id='1' cpus='1-2' memory='102400' unit='KiB>
         <cpus>
           <cpu id="1" state="disabled"/>
           <cpu id="2" state="enabled"/>
         </cpus>
       </cell>
     </numa>

> # qemu/platform implementation caveats #
> 
> When starting the VM for the first time it might be necessary to start a
> throw-away qemu process to query some details that we'll need to pass in on a
> command line. I'm not sure if this is still necessary and I'll try to avoid it
> at all cost.

If that is needed, then I think we would need to fix QEMU to let us
introspect that at time we normally probe capabilities. Spawning an
extra throw-away QEMU is absolutely not an option.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|