[libvirt] Network device abstraction aka virtual switch - V3

Fri Jun 17 13:18:10 UTC 2011

On Sun, Jun 12, 2011 at 08:29:08PM -0400, Laine Stump wrote:
> This is a followup to
> https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html
> (and an even earlier draft) which I alluded to here:
> 
>    https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
> 
> Network device abstraction aka virtual switch - V3
> ==================================================
> 
> The <interface> element of a guest's domain config in libvirt has a
> <source> element that describes what resources on a host will be used
> to connect the guest's network interface to the rest of the
> world. This is very flexible, allowing several different types of
> connection (virtual network, host bridge, direct macvtap connection to
> physical interface, qemu usermode, user-defined via an external
> script), but currently has the problem that unnecessary details of the
> host resources are embedded into the guest's config; if the guest is
> migrated to a different host, and that host has a different hardware
> or network config (or possibly the same hardware, but that hardware is
> currently in use by a different guest), the migration will fail.
> 
> I am proposing a change to libvirt's network XML that will allow us to
> (optionally - old configs will remain valid) remove the host details
> from the guest's domain XML (which can move around from host to host)
> and place them in the network XML (which remains with a single host);
> the domain XML will then use existing config elements to associate
> each guest interface with a "network".
> 
> The motivating use case for this change is the "direct" connection
> type (which uses macvtap for vepa and vnlink connections directly
> between a guest and a physical interface, rather than through a
> bridge), but it is applicable for all types of connection. (Another
> hopeful side effect of this change will be to make libvirt's network
> connection model easier to realize on non-Linux hypervisors (eg,
> VMWare ESX) and for other network technologies, such as openvswitch,
> VDE, and various VPN implementations).
> 
> Background
> ==========
> 
> (parts lifted from Dan Berrange's last mail on this subject)
> 
> Currently <network> supports 3 connectivity modes
> 
>  - Non-routed network, separate subnet        (no <forward> element
> present)
>  - Routed network, separate subnet with NAT   (<forward mode='nat'/>)
>  - Routed network, separate subnet            (<forward mode='route'/>)
> 
> Each of these is implemented in the existing network driver by
> creating a bridge device using brctl, and connecting the guest network
> interfaces via tap devices (a detail which, now that I've stated it,
> you should promptly forget!). All traffic between that bridge and the
> outside network is done via the host's IP routing stack (ie, there is
> no physical device directly connected to the bridge)
> 
> In the future, these two additional routed modes might be useful:
> 
>  - Routed network, IP subnetting
>  - Routed network, separate subnet with VPN
> 
> The core goal of this proposal, though, is to replace type=bridge and
> type=direct from the domain interface XML with new types of <network>
> definitions so that the domain can just give "type='network'" and have
> all the necessary details filled in at runtime. This basically means
> we're adding several bridging modes (the submodes of "direct" have
> been flattened out here):
> 
>  - Bridged network, eth + bridge + tap
>  - Bridged network, eth + macvtap + vepa
>  - Bridged network, eth + macvtap + private
>  - Bridged network, eth + macvtap + passthrough
>  - Bridged network, eth + macvtap + bridge
> 
> Another "future expansion" could be to add:
> 
>  - Bridged network, with VPN
> 
> Likewise, support for other technologies, such as openvswitch and VDE
> would each be another entry on this list.
> 
> (Dan also listed each of the above "+sriov" separately, but that ends
> up being handled in an orthogonal manner (by just specifying a pool of
> interfaces for a single network), so I'm only giving the abbreviated
> list)
> 
> I. Changes to domain <interface> element
> ========================================
> 
> In many cases, the <interface> element of the domain XML will be
> identical to what is used now when connecting the interface to a
> libvirt-style virtual network:
> 
> <interface type='network'>
> <source network='red-network'/>
> <mac address='xx:xx:xx:xx:xx:xx'/>
> </interface>
> 
> Depending on the definition of the network "red-network" on the host
> the guest was started on / migrated to, this could be either a direct
> (macvtap) connection using one of the various direct modes
> (vepa/private/bridge/passthrough), a bridge (again, pointed to by the
> definition of 'red-network'), or a virtual network (using the current
> network definition syntax). This way the same guest could be migrated
> not only between macvtap-enabled hosts, but from there to a host using
> a bridge, or maybe a host in a remote location that used a virtual
> network with a secure tunnel to connect back to the rest of the
> red-network.

When I originally thought of the goal of making the guest networking
XML "host independant", I was mainly thinking in terms of avoidance
of physical network device names. Obviously this design could also
enable us to change the type of connection, bridge/vepa/etc, but this
feels like a secondary goal, because I believe this would result in
interruption of the guest network connections, so migration would not
be seemless to the guest.

> <virtualport> element of <interface>
> ------------------------------------
> 
> Since many of the attributes/sub-elements of <virtualport> (used by
> some modes of "direct" interface connections) are identical for all
> interfaces connecting to any given switch, most of the information in
> <virtualport> will be optional in the domain's interface definition -
> it can be filled in from a similar <virtualport> element that will be
> added to the <network> definition.
> 
> Some parameters in <virtualport> ("instanceid", for example) must be
> unique for every interface, though, so those will still be specified
> in the <interface> XML. The two <virtualport> elements will be OR'ed
> at runtime to arrive at the actual set of parameters that are
> used.
> 
> (Open Question: What should be the policy when a parameter is
> specified in both places? Should one take precedence? Or should it be
> considered an error?)

The guest <interface> XML should in general take preference, since
that is considered a specialization. I believe certain of the VEPA
parameters shouldn't be overridable per guest though, since they
are really host-level configuration.

> portgroup attribute of <source>
> -------------------------------
> 
> The <source> element of an interface definition will be able to
> optionally specify a "portgroup" attribute. If portgroup is *NOT*
> given, the default (first) portgroup of the network will be used (if
> any are defined). If portgroup *IS* specified, the source network must
> have a portgroup by that name (or the domain startup/migration will
> fail), and the attributes of that portgroup will be used for the
> connection. Here is an example <interface> definition that has both a
> reduced <virtualport> element, as well as a portgroup attribute:
> 
> <interface type='network'>
> <source network='red-network' portgroup='engineering'/>
> <virtualport type="802.1Qbg">
> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/>
> </virtualport>
> <mac address='de:ad:be:ef:ca:fe'/>
> </interface>
> 
> (The specifics of what can be in a portgroup are given below)
> 
> 
> II. Changes to <network> definition
> ===================================
> 
> As Dan has pointed out, any additions to <network> must be designed so
> that existing management applications (written to understand <network>
> prior to these new additions) will at least recognize that the XML
> they've been given is for something new that they don't fully
> understand. At the same time, the new types of network definition
> should attempt to re-use as much of the existing elements/attributes
> as possible, both to make it easier to extend these applications, as
> well as to make the status displays of un-updated applications make as
> much sense as possible.
> 
> Dan's suggestion (which I obviously endorse :-) is that the new types
> of network should be specified by extending the choices for <forward
> mode='....'>.
> 
> He also suggested adding a new "layer='network|link'" attribute to
> <forward>. I'm not convinced that item is necessary (it seems
> redundant), but am including it here for sake of discussion.
> 
> The current modes are:
> 
> <forward layer='network' mode='route|nat'/>
> 
> (in addition to not listing any mode, which equates to "isolated")
> 
> Here are suggested new modes:
> 
> <forward layer='link'
>          mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
> 
> A description of each:
> 
> bridge-brctl - equivalent to "<interface type='bridge'>" in the
>                interface definition. The bridge device to use would be
>                given in the existing <forward dev='xxx'>. (Dan also
>                suggests putting this in <network>'s <bridge
>                name='xxx'/> - opinions?)
>                (Question: better name for this?)
> 
> vepa         - same as "<interface type='direct'>..." with <source
>                mode='vepa'/>
> 
> private      - <interface type='direct'> ... <source mode='private'/>
> 
> passthrough  - <interface type='direct'> ... <source mode='passthrough'/>
> 
> bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/>
>                (Question: better name for this?)

I like the suggestion elsewhere in this thread, of detecting
whether todo macvtap vs brctl, based on the interface declared,
so we could then just use 'bridge' as the name.

eg, Do macvtap mode:

    <forward mode='bridge' dev='eth0'/>

Or do brctl mode:

    <forward mode='bridge'/>
    <bridge dev='br0'/>

(Remember that '<bridge>' element already exists in our schema
 so we might as well use it)

> Interface Pools
> ---------------
> 
> In many cases, a single host network may have multiple physical
> network devices associated with it (especially in the case of an
> SRIOV-capable ethernet card, which will have several "virtual
> functions" associated with a single physical ethernet connection). The
> host will at least want to balance the load of multiple guests between
> these multiple devices, and may even require (in the case of
> passthrough mode, for example) that only a single guest interface be
> attached to each host device.
> 
> The current specification for <forward> only allows for a single "dev"
> attribute, though. In order to support multiple device names, we will
> extend <forward> to allow 0 or more <interface> sub-elements:
> 
> <forward mode='vepa' dev='eth10'/>
> <interface dev='eth10'/>
> <interface dev='eth11'/>
> <interface dev='eth12'/>
> <interface dev='eth13'/>
> </forward>
> 
> Note that, as a convenience, the first of these elements will always
> be a duplicate of the "dev" attribute in <forward> itself. (Is this
> necessary/desirable?)

Yes it is a key back/for-wards compat issue.  Currently applications
will be just doing an XPath of  "/network/forward/@dev".

New applications will want to ignore '@dev' completely and just
do  "/network/forward/interface/@dev".

If we didn't duplicate the <forward @dev/> attribute as the
first child <interface>, then new applications would have to
run 2 xpath queries to get the information out. We might
also want to add further attributes to <interface> in the
future, so we want all interfaces listed there regardless.

> In the case of mode='passthrough', only one guest interface can be
> connected to a device at a time. libvirt will keep track of which
> devices are in use, and attempt to assign a free device; failure to
> assign a device will result in a failure of the domain to
> start/migrate. For the other direct modes, libvirt will simply keep
> track of the number of guest interfaces currently using each device,
> and attempt to keep them balanced.
> 
> (Open question: where will we keep track of this allocation/assignment?)

That's a job for the network driver. As we do with when running
QEMU guests, the network driver would want to keep a persistent
state file in /var/lib/libvirt/networks to store any data like
this which needs to be preserved across libvirtd restarts/crashes.

> Portgroups
> -----------
> 
> A <portgroup> (sub-element of <network>) is just a way of easily
> putting connections to the network into different classes, with each
> class having a different level/type of service. Each <network> can
> have multiple <portgroup> elements, and each <portgroup> has a name,
> as well as various attributes associated with it. The first thing we
> will use portgroups for is as an alternate place to specify
> <virtualport> parameters:
> 
> <portgroup name='engineering'>
> <virtualport type="802.1Qbg">
> <parameters managerid="11" typeid="1193047" typeidversion="2"/>
> </virtualport>
> </portgroup>
> 
> Anything that is valid in an interface's <virtualport> is also valid here.
> 
> The next thing to specify in a portgroup will be bandwidth limiting /
> QoS configuration. Since I don't know exactly what's needed for that,
> I won't specify it here.
> 
> If anything is specified both directly under <network> and in a
> <portgroup>, the value in portgroup will take precedence. (Again -
> what will the precedence of items specified in the <interface> be?)

Precendence should go from most specific, to least specific. ie

  1. Guest <interface>
  2. Network <portgroup>
  3. Network top level

> EXAMPLES
> --------
> 
> Examples of 'red-network' for different types of connections (all of
> these would work with minor variations of the interface XML given
> above, e.g. the 'vepa' version would require <virtualport> in the
> interface that specified an instanceid, and if the <interface>
> specified a portgroup, it would need to also be in the <network>
> definition (even if it was empty aside from name).
> 
> 
> <!-- Existing usage - a libvirt virtual network -->
> <network>
> <name>red-network</name>
> <bridge name='virbr0'/>
> <forward layer='network' mode='route'/>
>         ...
> </network>
> 
> <!-- The simplest - an existing host bridge -->
> <network>
> <name>red-network</name>
> <forward mode='bridge-brctl' dev='br0'/>
> </network>
> 
> <!-- A macvtap connection to a vepa bridge -->
> <network>
> <name>red-network</name>
> <forward layer='link' mode='vepa' dev='eth10'/>
> <virtualport type='802.1Qbg'>
> <parameters managerid='11' typeid='1193047' typeidversion='2'/>
> </virtualport>
> <!-- NB: if <interface> doesn't specify portgroup, -->
> <!-- 'accounting' is assumed -->
> <portgroup name='accounting'>
> <virtualport>
> <parameters typeid='22'/>
> </virtualport>
> </portgroup>
> <portgroup name='engineering'>
> <virtualport>
> <parameters typeid='33'/>
> </virtualport>
> </portgroup>
> </network>
> 
> <!-- A macvtap passthrough connection (one guest interface per dev) -->
> <network>
> <name>red-network</name>
> <forward layer='link' mode='passthrough' dev='eth10'/>
> <interface dev='eth10'/>
> <interface dev='eth11'/>
> <interface dev='eth12'/>
> <interface dev='eth13'/>
> <interface dev='eth14'/>
> <interface dev='eth15'/>
> <interface dev='eth16'/>
> <interface dev='eth17'/>
> </forward>
> </network>
> 
> =============
> 
> Open Questions:
> 
> * Is there a good reason to include the "layer='network|link'"
>   attribute in forward? (maybe just because it's useful info for a
>   management application that doesn't know the details of the modes?)
>   Or is it redundant?

I think it is likely redundant. We can leave it out for now and if
we feel a need, can add it back in the future. It was to be primarily
an output-only attribute, though I did perhaps think we could let
you do  <forward layer='network'/> and use that to auto-pick a
suitable mode, but lets not bother.

> * What should be the policy when a virtualport parameter is specified
>   in both the <interface> and the <network>/<portgroup>? Should one take
>   precedence? Or should it be considered an error?
> 
> * Is it okay for the domain's own definition to specify what portgroup
>   it will be in? Or are there cases where we want to allow someone to
>   modify their domain XML, but force them into a particular portgroup
>   beyond their control?

Yes, the domain should be able to specify the portgroup. Access
control over portgroup usage is a matter for a more general
ACL system in libvirt drivers.

> * Is it really necessary/desirable for the first ethernet device in a
>   pool to be duplicated in the <forward dev='xxx'...> attribute? Or
>   can that attribute be omitted when there is a pool of devices?

Yes, it is key to getting good backwards/forwards compatibility
and simplifying app usage.

It should of course be an error to specify both when giving XML
to libvirt, if they are conflicting in what they say. Typically
an app should only specify one of them for input though.

> * Where will we keep track of the count of guest interfaces connected
>   to each host interface device, and where will we keep track of which
>   device is being used by a particular guest interface? In the
>   network/domain XML?

In the network driver i reckon.

> * Does anyone have better names for "brctl-bridge" and
>   "macvtap-bridge"?

'brctl' and 'macvtap' are both impl details, so we don't
really want to expose them. Just have one called 'bridge'
which is a reflection of the connection type.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|