This post continues where the "Virtio devices and drivers overview" leaves off. After we have explained the scenario in the previous post, we are reaching the main point: how does the data travel from the virtio-device to the driver and back?
Buffers and notifications: The work routine
As stated earlier, a virtqueue is just a queue of guest’s buffers that the host consumes, either reading them or writing to them. A buffer can be read-only or write-only from the device point of view, but never both.
The descriptors can be chained, and the framing of the message can be spread whatever way is more convenient. For example, to spread a 2000 byte message in one single buffer or to use two 1000 byte buffers should be the same.
Also, it provides driver to device notifications (doorbell) method, to signal that one or more buffers have been added to the queue, and vice-versa, devices can interrupt the driver to signal used buffers. It is up to the underlying driver to provide the right method to dispatch the actual notification, for example using PCI interruptions or memory writing: The virtqueue only standardizes the semantics of it.
As stated before, the driver and the device can advise the other to not to emit notifications to reduce its dispatching overhead. Since this operation is asynchronous we will describe how to do so in further sections.
Split virtqueue: the beauty of simplicity
The split virtqueue format separates the virtqueue into three areas, where each area is writable by either the driver or the device, but not both:
-
Descriptor Area: used for describing buffers.
-
Driver Area: data supplied by driver to the device. Also called avail virtqueue.
-
Device Area: data supplied by device to driver. Also called used virtqueue.
They need to be allocated in the driver’s memory for it to be able to access them in a straightforward way. Buffer addresses are stored from the driver's point of view, and the device needs to perform an address translation. There are many ways for the device to access it depending on the latter nature:
-
For an emulated device in the hypervisor (like qemu), the guest's address is in its own process memory.
-
For other emulated devices like vhost-net or vhost-user, a memory mapping needs to be done, like POSIX shared memory. A file descriptor to that memory is shared through vhost protocol.
-
For a real device a hardware-level translation needs to be done, usually via IOMMU.
Shared memory with split ring elements
Descriptor ring: Where is my data?
The descriptor area (or descriptor ring) is the first one that needs to be understood. It contains an array of a number of guest addressed buffers and its length. Each descriptor also contains a set of flags indicating more information about it. For example, the buffer continues in another descriptor buffer if the 0x1 bit is set, and the buffer is write-only for the device if the bit 0x2 is set, and is read-only if it is clear.
This is the layout of a single descriptor. We will call leN for N bits in little endian format.
struct virtq_desc { le64 addr; le32 len; le16 flags; le16 next; // Will explain this one later in the section "Chained descriptors" };
Avail ring: Supplying data to the device
The next interesting structure is the driver area, or avail ring. Is the room where the driver places the descriptor (indexes) the device is going to consume. Note that placing a buffer here doesn’t mean that the device needs to consume immediately: virtio-net, for example, provides a bunch of descriptors for packet receiving that are only used by the device when a packet arrives, and are “ready to consume” until that moment.
The avail ring has two important fields that only the driver can write and the device only can read them: idx and flags. The idx field indicates where the driver would put the next descriptor entry in the avail ring (modulo the queue size). On the other hand, the least significant bit of flags indicates if the driver wants to be notified or not (called VIRTQ_AVAIL_F_NO_INTERRUPT
).
After these two fields, an array of integers of the same length as the descriptors ring. So the avail virtqueue layout is:
struct virtq_avail { le16 flags; le16 idx; le16 ring[ /* Queue Size */ ]; };
Figure 1: Driver writes a buffer in descriptor ring
After populating descriptor entry, driver advises of it using the avail ring: It writes the descriptor index #0 in the first entry of the avail ring, and updates idx entry accordly. The result of this is shown in Figure 2. In the case that supply chained buffers, only the descriptor head index should be added this way, and avail idx would increase only by 1. This is the step 3 in the diagram.
Figure 2: Driver offers the buffer with avail ring
From now on, the driver should not modify the available descriptor or the exposed buffer at any moment: It is under the device's control. Now the driver needs to notify the device if the latter has enabled notifications at that moment (more on how the device manages this later). This is the last step 4 in the diagram.
Diagram: Process to make a buffer available
The avail ring must be able to hold the same number of descriptors as the descriptor area, and the descriptor area must have a size power of two, so idx wraps naturally at some point. For example, if the ring size is 256 entries, idx 1 references the same descriptor as idx 257, 513... And it will wrap at a 16 bit boundary. This way, neither side needs to worry about processing an invalid idx: They are all valid.
Note that descriptors can be added in any order to the avail ring, one does not need to start from descriptor table entry 0 nor continue by the next descriptor.
Chained descriptors: Supplying large data to the device
The driver can also chain more than one descriptor using its next member. If the NEXT (0x1) flag of a descriptor is set, the data continue in another buffer, making a chain of descriptors. Note that the descriptors in a chain do not share flags: Some descriptors can be read-only, and the others can be write-only. In this case, write-only descriptors must come after all write-only ones.
For example, if the driver has sent us two buffers in a chain with descriptor table indexes 0 and 1 as first operation, the device would see the scenario in Figure 3, and it would be the step 2 again.
Figure 3: Device sees chained buffers
Used ring: When the device is done with the data
The device employs the used ring to return the used (read or written) buffers to the driver. As the avail ring, it has the flags and idx members. They have the same layout and serve the same purpose, although the notification flag is now called VIRTQ_USED_F_NO_NOTIFY
.
After them, it maintains an array of used descriptors. In this array, the device returns not only the descriptor index but also the used length in case of writing.
struct virtq_used { le16 flags; le16 idx; struct virtq_used_elem ring[ /* Queue Size */]; }; struct virtq_used_elem { /* Index of start of used descriptor chain. */ le32 id; /* Total length of the descriptor chain which was used (written to) */ le32 len; };
For example, if the device uses the chain of descriptors exposed in the Chained descriptors version:
Figure 4: Device returns buffer chain
Diagram: Process to mark a buffer as used
Lastly, the device will notify the driver if it sees that the driver wants to be notified, using the used queue flags to know it (step 6).
Indirect descriptors: supplying a lot of data to the device
Indirect descriptors are a way to dispatch a larger number of descriptors in a batch, increasing the ring capacity. The driver stores a table of indirect descriptors (the same layout as the regular descriptors) anywhere in memory, and inserts a descriptor in the virtqueue with the flag VIRTQ_DESC_F_INDIRECT (0x4)
set. The descriptor’s address and length correspond to the indirect table’s ones.
If we want to add the chain described in section Chained descriptors to an indirect table, the driver first allocates the memory region of 2 entries (32 bytes) to hold the latter (step 2 in the diagram after allocate the buffers in the step 1):
Buffer |
Len |
Flags |
Next |
0x8000 |
0x2000 |
W|N |
1 |
0xD000 |
0x2000 |
W |
... |
Figure 4: Indirect table for indirect descriptors
Let’s suppose it has been allocated on memory position 0x2000
, and it is the first descriptor made available. As usual, the first step is to include it in the Descriptor area (step 3 in the diagram), so it would look like:
Descriptor Area |
|||
Buffer |
Len |
Flags |
Next |
0x2000 |
32 |
I |
... |
Figure 5: Add indirect table to Descriptor area
After that, the steps are the same as with regular descriptors: The driver adds the index of the descriptor marked with the flag in the descriptor area to the avail ring (#0 in this case, step 4 in the diagram), and notify the device as usual (step 5).
Diagram: Driver make available indirect descriptors
For the device to use its data, and would use the same memory addresses to return its 0x3000
bytes (all 0x8000-0x9FFF
and 0xD000-0xDFFF
) (Step 6 and 7, same as with regular descriptors). Once used by the device, the driver can release the indirect memory or do whatever it wants with it, as it could do with any regular buffer.
Diagram: Device mark the indirect descriptor as used
Descriptors with INDIRECT
flag cannot have NEXT
or WRITE
flags set, so you cannot chain indirect descriptors in the descriptor table, and the indirect table can contain at maximum the same number of descriptors as the descriptor table.
Notifications. Learning the “do not disturb” mode
In many systems used and available buffer notifications involve significant overhead. To mitigate it, each virtring maintains a flag to indicate when it wants to be notified. Remember that the driver’s one is read-only by the device, and the device’s one is read-only by the driver.
We already know all of this, and its use is pretty straightforward. The only thing you need to take care of is the asynchronous nature of this method: The side of the communication that disables or enables it can’t be sure that the other end is going to know the change, so you can miss notifications or to have more than expected.
A more effective way of notifications toggle is enabled if the VIRTIO_F_EVENT_IDX
feature bit is negotiated by device and driver: Instead of disable them in a binary fashion, driver and device can specify how far the other can progress before a notification is required using an specific descriptor id. This id is advertised using a extra le16 member at the end of the structure, so they grow like this:
The struct layout is:
struct virtq_avail { struct virtq_used { le16 flags; le16 flags; le16 idx; le16 idx; le16 ring[ /* Queue Size */ ]; struct virtq_used_elem ring[Q. size]; le16 used_event; le16 avail_event; }; };
Listing 3: Event suppression struct notification
This way, every time the driver wants to make available a buffer it needs to check the avail_event on the used ring: If driver’s idx field was equal to avail_event, it’s time to send a notification, ignoring the lower bit of used ring flags member (VIRTQ_USED_F_NO_NOTIFY
).
Similarly, if VIRTIO_F_EVENT_IDX
has been negotiated, the device will check used_event to know if it needs to send a notification or not. This can be very effective for maintaining a virtqueue of buffers for the device to write, like in the virtio-net device receive queue.
In our next post, we're going to wrap up and take a look at a number of optimizations on top of both ring layouts which depend on the communication/device type or how each part is implemented.
Sobre el autor
Eugenio Pérez works as a Software Engineer in the Virtualization and Networking (virtio-net) team at Red Hat. He has been developing and promoting free software on Linux since his career start. Always closely related to networking, being with packet capture or classic monitoring. He enjoys to learn about how things are implemented and how he can expand them, keeping them simple (KISS) and focusing on maintainability and security.
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit