This post continues where the "Virtio devices and drivers overview" leaves off. After we have explained the scenario in the previous post, we are reaching the main point: how does the data travel from the virtio-device to the driver and back?

Buffers and notifications: The work routine

As stated earlier, a virtqueue is just a queue of guest’s buffers that the host consumes, either reading them or writing to them. A buffer can be read-only or write-only from the device point of view, but never both.

The descriptors can be chained, and the framing of the message can be spread whatever way is more convenient. For example, to spread a 2000 byte message in one single buffer or to use two 1000 byte buffers should be the same. 

Also, it provides driver to device notifications (doorbell) method, to signal that one or more buffers have been added to the queue, and vice-versa, devices can interrupt the driver to signal used buffers. It is up to the underlying driver to provide the right method to dispatch the actual notification, for example using PCI interruptions or memory writing: The virtqueue only standardizes the semantics of it.

As stated before, the driver and the device can advise the other to not to emit notifications to reduce its dispatching overhead. Since this operation is asynchronous we will describe how to do so in further sections.

Split virtqueue: the beauty of simplicity

The split virtqueue format separates the virtqueue into three areas, where each area is writable by either the driver or the device, but not both:

  • Descriptor Area: used for describing buffers.

  • Driver Area: data supplied by driver to the device. Also called avail virtqueue.

  • Device Area: data supplied by device to driver. Also called used virtqueue.

They need to be allocated in the driver’s memory for it to be able to access them in a straightforward way. Buffer addresses are stored from the driver's point of view, and the device needs to perform an address translation. There are many ways for the device to access it depending on the latter nature:

  • For an emulated device in the hypervisor (like qemu), the guest's address is in its own process memory.

  • For other emulated devices like vhost-net or vhost-user, a memory mapping needs to be done, like POSIX shared memory. A file descriptor to that memory is shared through vhost protocol.

  • For a real device a hardware-level translation needs to be done, usually via IOMMU.

 

Shared memory with split ring elements

Shared memory with split ring elements

Descriptor ring: Where is my data?

The descriptor area (or descriptor ring) is the first one that needs to be understood. It contains an array of a number of guest addressed buffers and its length. Each descriptor also contains a set of flags indicating more information about it. For example, the buffer continues in another descriptor buffer if the 0x1 bit is set, and the buffer is write-only for the device if the bit 0x2 is set, and is read-only if it is clear.

This is the layout of a single descriptor. We will call leN for N bits in little endian format.

struct virtq_desc { 
        le64 addr;
        le32 len;
        le16 flags;
        le16 next; // Will explain this one later in the section "Chained descriptors"
};
Listing: Split Virtqueue descriptor layout

Avail ring: Supplying data to the device

The next interesting structure is the driver area, or avail ring. Is the room where the driver places the descriptor (indexes) the device is going to consume. Note that placing a buffer here doesn’t mean that the device needs to consume immediately: virtio-net, for example, provides a bunch of descriptors for packet receiving that are only used by the device when a packet arrives, and are “ready to consume” until that moment.

The avail ring has two important fields that only the driver can write and the device only can read them: idx and flags. The idx field indicates where the driver would put the next descriptor entry in the avail ring (modulo the queue size). On the other hand, the least significant bit of flags indicates if the driver wants to be notified or not (called VIRTQ_AVAIL_F_NO_INTERRUPT).

After these two fields, an array of integers of the same length as the descriptors ring. So the avail virtqueue layout is:

struct virtq_avail {
        le16 flags;
        le16 idx;
        le16 ring[ /* Queue Size */ ];
};
Listing: Avail virtqueue layout
Figure 1 shows a descriptor table with a 2000 bytes long buffer that starts in position 0x8000, and an avail ring that still does not have any entry. After all the steps, a components diagram highlighting the descriptor area update. The first step for the driver is to allocate the buffer with the memory and fill it (this is the step 1 in the "Process to make a buffer available" diagram), and to make available on the descriptor area after that (step 2).

 

 

Figure 1: Driver writes a buffer in descriptor ring

Figure 1: Driver writes a buffer in descriptor ring

After populating descriptor entry, driver advises of it using the avail ring: It writes the descriptor index #0 in the first entry of the avail ring, and updates idx entry accordly. The result of this is shown in Figure 2. In the case that supply chained buffers, only the descriptor head index should be added this way, and avail idx would increase only by 1. This is the step 3 in the diagram.

 

Figure 2: Driver offers the buffer with avail ring

Figure 2: Driver offers the buffer with avail ring

From now on, the driver should not modify the available descriptor or the exposed buffer at any moment: It is under the device's control. Now the driver needs to notify the device if the latter has enabled notifications at that moment (more on how the device manages this later). This is the last step 4 in the diagram.

 

Diagram: Process to make a buffer available

Diagram: Process to make a buffer available

The avail ring must be able to hold the same number of descriptors as the descriptor area, and the descriptor area must have a size power of two, so idx wraps naturally at some point. For example, if the ring size is 256 entries, idx 1 references the same descriptor as idx 257, 513... And it will wrap at a 16 bit boundary. This way, neither side needs to worry about processing an invalid idx: They are all valid.

Note that descriptors can be added in any order to the avail ring, one does not need to start from descriptor table entry 0 nor continue by the next descriptor.

Chained descriptors: Supplying large data to the device

The driver can also chain more than one descriptor using its next member. If the NEXT (0x1) flag of a descriptor is set, the data continue in another buffer, making a chain of descriptors. Note that the descriptors in a chain do not share flags: Some descriptors can be read-only, and the others can be write-only. In this case, write-only descriptors must come after all write-only ones.

For example, if the driver has sent us two buffers in a chain with descriptor table indexes 0 and 1 as first operation, the device would see the scenario in Figure 3, and it would be the step 2 again.

 

Figure 3: Device sees chained buffers

Figure 3: Device sees chained buffers

Used ring: When the device is done with the data

The device employs the used ring to return the used (read or written) buffers to the driver. As the avail ring, it has the flags and idx members. They have the same layout and serve the same purpose, although the notification flag is now called VIRTQ_USED_F_NO_NOTIFY.

After them, it maintains an array of used descriptors. In this array, the device returns not only the descriptor index but also the used length in case of writing.

struct virtq_used {
        le16 flags;
        le16 idx;
        struct virtq_used_elem ring[ /* Queue Size */];
};

struct virtq_used_elem {
        /* Index of start of used descriptor chain. */
        le32 id;
        /* Total length of the descriptor chain which was used (written to) */
        le32 len;
};
Listing: Used virtqueue layout
In case of returning a chain of descriptors, only the id of the head of the chain is returned, and the total written length through all descriptors, not increasing it when data is read. The descriptor table is not touched at all, it is read-only for the device. This is step 5 in the “Process to make a buffer as used” diagram.

For example, if the device uses the chain of descriptors exposed in the Chained descriptors version:

 

Figure 4: Device returns buffer chain

Figure 4: Device returns buffer chain

 

Diagram: Process to mark a buffer as used

Diagram: Process to mark a buffer as used

Lastly, the device will notify the driver if it sees that the driver wants to be notified, using the used queue flags to know it (step 6).

Indirect descriptors: supplying a lot of data to the device

Indirect descriptors are a way to dispatch a larger number of descriptors in a batch, increasing the ring capacity. The driver stores a table of indirect descriptors (the same layout as the regular descriptors) anywhere in memory, and inserts a descriptor in the virtqueue with the flag VIRTQ_DESC_F_INDIRECT (0x4) set. The descriptor’s address and length correspond to the indirect table’s ones.

If we want to add the chain described in section Chained descriptors to an indirect table, the driver first allocates the memory region of 2 entries (32 bytes) to hold the latter (step 2 in the diagram after allocate the buffers in the step 1):

Buffer

Len

Flags

Next

0x8000

0x2000

W|N

1

0xD000

0x2000

W

...

Figure 4: Indirect table for indirect descriptors

Let’s suppose it has been allocated on memory position 0x2000, and it is the first descriptor made available. As usual, the first step is to include it in the Descriptor area (step 3 in the diagram), so it would look like:

Descriptor Area

Buffer

Len

Flags

Next

0x2000

32

I

...

Figure 5: Add indirect table to Descriptor area

After that, the steps are the same as with regular descriptors: The driver adds the index of the descriptor marked with the flag in the descriptor area to the avail ring (#0 in this case, step 4 in the diagram), and notify the device as usual (step 5).

 

Diagram: Driver make available indirect descriptors

Diagram: Driver make available indirect descriptors

For the device to use its data, and would use the same memory addresses to return its 0x3000 bytes (all 0x8000-0x9FFF and 0xD000-0xDFFF) (Step 6 and 7, same as with regular descriptors). Once used by the device, the driver can release the indirect memory or do whatever it wants with it, as it could do with any regular buffer.

 

Diagram: Device mark the indirect descriptor as used

Diagram: Device mark the indirect descriptor as used

Descriptors with INDIRECT flag cannot have NEXT or WRITE flags set, so you cannot chain indirect descriptors in the descriptor table, and the indirect table can contain at maximum the same number of descriptors as the descriptor table.

Notifications. Learning the “do not disturb” mode

In many systems used and available buffer notifications involve significant overhead. To mitigate it, each virtring maintains a flag to indicate when it wants to be notified. Remember that the driver’s one is read-only by the device, and the device’s one is read-only by the driver.

We already know all of this, and its use is pretty straightforward. The only thing you need to take care of is the asynchronous nature of this method: The side of the communication that disables or enables it can’t be sure that the other end is going to know the change, so you can miss notifications or to have more than expected.

A more effective way of notifications toggle is enabled if the VIRTIO_F_EVENT_IDX feature bit is negotiated by device and driver: Instead of disable them in a binary fashion, driver and device can specify how far the other can progress before a notification is required using an specific descriptor id. This id is advertised using a extra le16 member at the end of the structure, so they grow like this:

The struct layout is:

struct virtq_avail {              struct virtq_used {
  le16 flags;                       le16 flags;
  le16 idx;                         le16 idx;
  le16 ring[ /* Queue Size */ ];    struct virtq_used_elem ring[Q. size];
  le16 used_event;                  le16 avail_event;
};                                };

Listing 3: Event suppression struct notification

This way, every time the driver wants to make available a buffer it needs to check the avail_event on the used ring: If driver’s idx field was equal to avail_event, it’s time to send a notification, ignoring the lower bit of used ring flags member (VIRTQ_USED_F_NO_NOTIFY).

Similarly, if VIRTIO_F_EVENT_IDX has been negotiated, the device will check used_event to know if it needs to send a notification or not. This can be very effective for maintaining a virtqueue of buffers for the device to write, like in the virtio-net device receive queue.

In our next post, we're going to wrap up and take a look at a number of optimizations on top of both ring layouts which depend on the communication/device type or how each part is implemented.


About the author

Eugenio Pérez works as a Software Engineer in the Virtualization and Networking (virtio-net) team at Red Hat. He has been developing and promoting free software on Linux since his career start. Always closely related to networking, being with packet capture or classic monitoring. He enjoys to learn about how things are implemented and how he can expand them, keeping them simple (KISS) and focusing on maintainability and security.

Read full bio