[libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU

Fri Dec 18 12:49:27 UTC 2009

On Thursday 17 December 2009, Anthony Liguori wrote:

> When invoking qemu directly, for the first go about, I'd expect -net 
> vhost,dev=eth0 for a raw device and -net vhost,mode=tap,tap-arguments.
> 
> Long term, there are so many possible ways to layer things, that I'd 
> really like to see:
> 
> -net vepa,dev=eth0
> 
> Which ends up invoking /usr/libexec/qemu-net-helper-vepa --arg-dev=eth0 
> --socketpair=X --try-vhost.
> 
> qemu-net-helper-vepa would do all of the fancy stuff of creating a 
> macvtap device, trying to hook that up with vhost, sending us an fd over 
> the socketpair telling us which interface it's using and what features 
> were enabled.

We need to make sure not to hardcode the dependency from VEPA to macvtap
in your example, so I'm not sure if a VEPA specific helper is helpful.

We really have a tuple of policy, kernel implementation and qemu implementation,
with many possibly combinations, currently at least (ignoring UDP, TCP and VDE
modes):

nat-socket-user
nat-bridge-tap
nat-bridge-tap+vhost
route-none-tap
route-none-tap+vhost
route-veth+macvlan-tap
route-veth+macvlan-tap+vhost
route-veth+macvlan-socket
route-veth+macvlan-socket+vhost
veb-bridge-tap
veb-bridge-tap+vhost
veb-macvlan-tap
veb-macvlan-tap+vhost
veb-macvlan-socket
veb-macvlan-socket+vhost
veb-sriov-socket
veb-sriov-socket+vhost
vepa-macvlan-tap
vepa-macvlan-tap+vhost
vepa-macvlan-socket
vepa-macvlan-socket+vhost
vepa-sriov-socket
vepa-sriov-socket+vhost
private-macvlan-tap
private-macvlan-tap+vhost
private-macvlan-socket
private-macvlan-socket+vhost
private-sriov-socket
private-sriov-socket+vhost
private-physdev-socket
private-physdev-socket+vhost

If my plans for extending macvlan for SR-IOV work out, we will also
have

bridge-sriov-tap
bridge-sriov-tap+vhost
vepa-sriov-tap
vepa-sriov-tap+vhost
private-sriov-tap
private-sriov-tap+vhost

As you can see, the policy is mostly independent from the qemu
implementation and even from the kernel implementation. Naming the
macvtap code in qemu '-net vepa' would completely mix up things
for people that want to use vepa with an SR-IOV card, or macvtap
in bridge mode!

The concept with the callout to an external program to deal with the
enourmous number of variations absolutely makes sense, but the
naming needs to get better. In particular, I think that the policy
should be only known between the helper and libvirt (or the user),
but not show up anywhere in qemu, which can just pass all the options
to the helper, and let that one decide what to do. E.g.
"qemu -net host,mode=vepa,dev=eth0" can result in calling
"/usr/libexec/qemu-net-helper --mode=vepa --dev=eth0 --socketpair=X
--protocols=tap,socket,vhost".

Then qemu-net-helper tries to find the best way to set up a vepa
on eth0, given the choice of tap, socket, tap+vhost or socket+vhost,
the system capabilities (sr-iov, macvlan, macvtap driver) and the
user permissions it is running on.

> That lets people infinitely extend qemu's networking support while allow 
> us to focus on just implementing backends for the interfaces we're 
> exposed to.  AFAICT, that's just /dev/vhost, /dev/net/tun, and a normal 
> socket.  The later two can be reduced to a single read/write interface 
> honestly.

Well, I think you are still required to use sendmsg/recvmsg with the
raw socket, not write/read, but aside from that I agree.

> No, net/ would essentially become a series of helper programs.  What's 
> nice about this approach is that libvirt could potentially use helpers 
> too which would allow people to run qemu directly based on the output of 
> ps -ef.  Would certainly make debugging easier.

Right. Also, if we put the helpers into netcf or a similar library, more applications
that are unrelated to qemu could use them, e.g. user-mode-linux, if they
are interested.

> > Nope, not at all ;-)
> >
> > We do need to know if a VF is available or not (and if a PF has any of
> > its VFs used).
> 
> "We need to know" or "it would be nice to know"?
> 
> You can make the same argument about a physical network interface.

The difference to what we have today is that you can add an arbitrary
number of taps to a bridge, so you don't need to know if any other
guests are running when you add another one.
But when you add a guest to a VF, you need to be sure tha t no other
guest uses the same VF, so this needs system-wide coordination.

libvirt can keep the state if it manages all guests, but if you want to
run guests without libvirt, you need something like lock-files.

	Arnd