[libvirt] [Qemu-devel] [RFC 0/5] block: File descriptor passing using -open-hook-fd

Mon Jul 9 20:29:15 UTC 2012

On 07/09/2012 02:00 PM, Anthony Liguori wrote:

>> with the fd:name approach, the sequence is:
>>
>> libvirt calls getfd:name1 over normal monitor
>> qemu responds
>> libvirt calls getfd:name2 over normal monitor
>> qemu responds
>> libvirt calls transaction around blockdev-snapshot-sync over normal
>> monitor, using fd:name1 and fd:name2
>> qemu responds

This general layout is true whether we rewrite all commands to
understand fd:nnn (proposal 1) or whether we add new magic parsing
(/dev/fd/nnn of proposal 3, or even /dev/fdset/nnn of proposal 5), all
as called out in these messages:

https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00227.html
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01098.html

>>
>> but with -open-hook-fd, the approach would be:
>>
>> libvirt calls transaction
>> qemu calls open(file1) over hook
>> libvirt responds
>> qemu calls open(file2) over hook
>> libvirt responds
>> qemu responds to the original transaction

whereas this approach is quite different in semantics, but may indeed be
easier for qemu to implement, at the expense of some more complexity on
the part of libvirt.

At the high level, I think both approaches have one thing in common: by
refactoring all qemu code to go through qemu_open(), we can then
implement our desired complexity (whether fd:nn, /dev/fd/nnn,
/dev/fdset/nnn, or some other magic name parsing; or whether it is an
rpc call over a second socket in parallel to the monitor socket) in just
one location.  Likewise, both approaches have to deal with libvirtd
restarts (magic name parsing by changing an 'inuse' flag when the
monitor detects EOF; rpc passing by failing a qemu_open() when the rpc
socket detects EOF).

>>
>> The 'transaction' operation is thus blocked by the time it takes to do
>> two intermediate opens over a second channel, which kind of defeats the
>> purpose of making the transaction take effect with minimal guest
>> downtime.
> 
> How are you defining "guest down time"?
> 
> It's important to note that code running in QEMU does not equate to
> guest visible down time unless QEMU does an explicit vm_stop() which is
> not happening here.
> 
> Instead, a VCPU may become blocked *if* it attempts to acquire qemu_mute
> while QEMU is holding it.
> 
> If your concern is qemu_mutex being held while waiting for libvirt, it
> would be fairly easy to implement a qemu_open_async() that dropped
> allowed dropping back to the main loop and then calling a callback when
> the open completes.
> 
> It would be pretty trivial to convert qmp_transaction to use such a
> command.

In other words, remembering that transactions are divided into phases:

phase 1 - prepare: obtain all needed fds (whether by pre-opening them
via 'pass-fd' or other new 'getfd' relative, or whether by RPC calls);
no guest downtime, and with cleanup that avoids any leaks on any failures
phase 2 - commit: flush all devices and actually make the changes in
qemu state to use the fds obtained in phase 1

and where the guest downtime (if any) is more likely due to flushing
changes in phase 2

> 
> But this is all speculative.  There's no reason to believe that an RPC
> would have a noticable guest visible latency unless you assume there's
> lot contention.  I would strongly suspect that the bdrv_flush() is going
> to be a much greater source of lock contention than the RPC would be. 
> An RPC is only bound by scheduler latency whereas synchronous disk I/O
> is bound spinning a platter.
> 
>> And libvirt code becomes a lot trickier to deal with the fact
>> that two channels are in use, and that the channel that issued the
>> 'transaction' command must block while the other channel for handling
>> hooks must be responsive.
> 
> All libvirt needs to do is listen on a socket and delegate access
> according to a white list.  Whatever is providing fd's needs to have no
> knowledge of anythign other than what the guest is allowed to access
> which shouldn't depend on an executing command.

That's not quite accurate.  What the guest is allowed to access should
indeed change depending on the executing command.  That is, if I start a
guest with:

base <- delta

then I only want to permet O_RDONLY access to base but O_RDWR access to
delta.  If I then call 'blockdev-snapshot-sync', I want to change to the
situation:

base <- delta <- snap

and give O_RDWR permissions to snap; it would also be nice if qemu
attempts to reopen delta with O_RDONLY permissions (although from a
trust perspective, libvirt must assume that delta is still O_RDWR
because qemu may have been compromised and lie about the tightening of
permissions); at any rate, depending on SELinux capabilities of the
file, libvirt may be able to enforce no further writes to 'delta' by
toggling a SELinux label (obviously, this should only be done after
'blockdev-snapshot-sync' completes).

On the other hand, the user could decide to do a 'block-commit', to
squash things into:

base

where base is now O_RDWR.  But libvirt doesn't want to grant
write-access to 'base' up-front, so the whitelist allowing O_RDWR access
to base is indeed dependent on the user running a monitor command that
would cause qemu to need to change to a more permissive access in the
first place.

Maybe the only reason that I'm still leaning towards a 'pass-fd'
solution instead of a hook fd solution is that libvirt would have less
code to write to get it working.  But it was originally Dan's complaint
that an rpc solution has too much risk for deadlock or leaks; if we are
confident that we can avoid deadlock, and that the idea of passing in
fds in response to an rpc involves less bookkeeping and speculation on
libvirt's part about which monitor commands will require opening an fd,
then maybe it really is a better technical solution, even if it costs
more libvirt code to implement.

-- 
Eric Blake   eblake at redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 620 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20120709/3b13754f/attachment-0001.sig>