[libvirt] RFC / Braindump: public APIs needing data streams

Tue May 26 19:35:23 UTC 2009

On Tue, May 26, 2009 at 06:57:18PM +0100, Richard W.M. Jones wrote:
> FWIW this is the libguestfs RPC protocol:
> 
> http://et.redhat.com/~rjones/libguestfs/guestfs.3.html#communication_protocol
> http://git.et.redhat.com/?p=libguestfs.git;a=blob;f=src/guestfs_protocol.x;hb=HEAD
> 
> It's not directly relevant because at present the server is single-
> threaded and answers calls in order.

It is actually pretty relevant from the wire protocol POV, and matches
the ideas I'd been having. With your chunked encoding, you've only got
4 bytes overhead per chunk sent. I was thinking of introducing a new
message type to the existing three

enum remote_message_direction {
    REMOTE_CALL = 0,            /* client -> server */
    REMOTE_REPLY = 1,           /* server -> client */
    REMOTE_MESSAGE = 2          /* server -> client, asynchronous [NYI] */
};

aka, 

  REMOTE_DATA_CHUNK = 3

This indicates a message which has 'struct remote_message_header' then
followed by the data. The idea of this new type, instead of REMOTE_MESSAGE,
is that we treat the payload of REMOTE_DATA_CHUNK as totally opaque and
thus avoid the extra data copies inherant in defining the payload to be
an XDR byte array.  So my idea would have 24 bytes overhead per chunk
instead of your four. It would also allow us to maintain concurrency, 
with other threads can be making RPC calls over the same socket,  and them
being interleaved with individual data chunk mesages. 

> These are the relevant points of the file transfer system:
> 
> - At the API level, you pass in filenames.  The caller is responsible
> for creating a named pipe in the filesystem, or passing in names like
> "/dev/fd/N".

That has the problem though, that you can't neccesarily assume that the
file handle you have has the data in the same encoding you want to 
process it in.   In the case of libvirtd invoking a libvirt API to
handle an RPC request, the data is coming in off the client socket
and thus needs passing through SASL/TLS decryption. To do this with a
API taking a filename, you'd need to create a named pipe, and read off
the socket, write into the pipe and then pass the pipe name to the API
which adds several more data copies. With the RAM size of VMs this will
have a significant impact on CPU & memory bandwidth utilization during
migration. If we can pass the data directly from SASL/TLS decryption
to the driver, then we can limit ourselves to 2 data copies in the libvirt
space. Normal RPC calls have 3 copies in libvirt, the 3rd coming from 
the XDR format deserialization, but we avoid the third with the custom
message type for data streams.

> - File transfers are sent using chunked encoding.  The key was to
> allow cancellation *initiated from either side* (not as easy as it
> seems).  So if an error occurs at either end, the transfer can be
> stopped almost immediately, and synchronization can be reestablished.
> The details are in the link above.

Yes, those are the points that are particularly fun / interesting. It
looks like the scenarios you've identified there all match up to those
I've been worrying about. So that's good reassurance that I'm thinking
along the rights lines. I reckon the extra 20 bytes overhead per chunk of
using an explicit message type, instead of just sending a serious of
len+payload chunks is a worthwhile tradeoff in libvirt's case to allow
better message interleaving on the socket. 

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|