This is the final post of a three-post series, the previous posts are "Virtio devices and drivers overview: The headjack and the phone," and "Virtqueues and virtio ring: How the data travels." 

Split virtqueue issues: Too much spinning around

While the split virtqueue shines because of the simplicity of its design, it has a fundamental problem: The avail-used buffer cycle needs to use memory in a very sparse way. This puts pressure on the CPU cache utilization, and in the case of hardware means several PCI transactions for each descriptor.

Packed virtqueue amends it by merging the three rings in just one location in virtual environment guest memory. While this may seem complicated at first glance, it’s a natural step after the split version if we realize that the device can discard and overwrite the data it already has read from the driver, and the same happens the other way around.

Supplying descriptors to the device: How to fill device todo-list

After initialization in the same process as described in Virtio device initialization: feature bits, and after the agreement on RING_PACKED feature flag, the driver and the device starts with a shared blank canvas of descriptors with an agreed length (up to 215 entries) in a agreed guest’s memory location. The layout of these descriptors is:

struct virtq_desc { 
        le64 addr;
        le32 len;
        le16 id;
        le16 flags;
};

Listing: Memory layout of a packed virtqueue descriptor

This time, the id field is not an index for the device to look for the buffer: it is an opaque value for it, only has meaning for the driver.

The driver also maintains an internal single-bit ring wrap counter initialized to 1. The driver will flip its value every time it makes available the last descriptor in the ring.

As with split descriptors, the first step is to write the different fields: address, length, id and flags. However, packed descriptors take into account two new flags: AVAIL(0x7) and USED(0x15). To mark a descriptor as available, the driver makes the AVAIL(0x7) flag the same as its internal wrap counter, and the used flag the inverse. While just a binary flag avail/used would be easier to implement, it would prevent useful optimizations we will describe later.

As an example, if the driver allocates a write buffer with 0x1000 bytes on position 0x80000000 in the step 1 in the diagram, and makes it the first available descriptor setting AVAIL(0x7) flag the same as internal wrap counter (set) in step 2. The descriptor table would look like this:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

0x1000

0

W|A

...

       

Figure: Descriptor table after add the first buffer

Note that the avail and used idx columns are in the table just for guidance, they don’t exist in the descriptor table: Each side should have its internal counter to know which position needs to poll or write next, and also the device must track the driver’s wrap counter. Lastly, as with used virtqueue, the driver notifies the device if the latter has notifications enabled (step 3 in the diagram).

And the usual diagram of the updates. Note the lack of the avail and used ring, as only the descriptor table is needed now.

 

Diagram: Driver makes available a descriptor using a packed queue

Diagram: Driver makes available a descriptor using a packed queue

Returning used descriptors: How the device fills the “done” list

As the driver, the device maintains an internal single-bit ring wrap counter initialized to 1, and knows that the driver also has its internal ring wrap counter set. When the latter first searches for the first descriptor the driver has made available, it polls the first entry of the ring, looking for the avail flag equal to the driver internal wrap flag (set in this case).

As with a used ring, the length of the written data is returned in the “length” entry (if any), and the id of the used descriptor. At last, the device will make the avail (A) and used (U) flag the same as the device’s internal wrap counter.

Following the example, the device will let the descriptor table as figure 6. The device will know that the buffer has been returned because the used flag matches the available flag, and with the device internal wrap counter at the moment it wrote the descriptor. The returned address is not important: only the ID.

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

0x1000

0

W|A|U

 

...

     

Figure: Descriptor table after add the first buffer

 

Diagram: Device marks a descriptor as used using a packed queue

Diagram: Device marks a descriptor as used using a packed queue

Wrapping the descriptor ring: How the lanes keep separated?

When the driver fills the complete descriptor table, it wraps and changes its internal Driver Ring Wrap. So, in the second round, the available descriptions will have the avail and used flags clear, so the device will have to poll looking for this condition once it wraps reading descriptors. Let’s see a full example of the different situations.

If we have a descriptor table with only two entries, the Driver Ring Wrap Counter is set, and it fills the descriptor table making available two buffers at the beginning of the operation, driver will reverse its internal wrap counter, so it will be clear (0). We have the next table:

Avail idx

Address

Length

ID

Flags

Used idx

0x80000000

0x1000

0

W|A

 

0x81000000

0x1000

1

W|A

 

Figure: Full two-entries descriptor table

After that, the device realizes that has both descriptors with id #0  and #1 available: it knows that the driver had its wrap counter set when it wrote them, the avail flag is set on them, and the used one is clear on both. If device uses the descriptor with id #1, we have the Figure 8 descriptor table. The buffer #0 still belongs to the device!

Avail idx

Address

Length

ID

Flags

Used idx

0x80000000

0x1000

1

W|A|U

 
 

0x81000000

0x1000

1

W|A

Figure: Using first buffer out of order

Now the driver realize the buffer #1 has been used, since avail and used flags are the same (set) and match the device’s internal wrap counter at the moment it wrote it. If device now uses the buffer id #0, it will make the table look like this:

Avail idx

Address

Length

ID

Flags

Used idx

0x80000000

0x1000

1

W|A|U

 

0x81000000

0x1000

0

W|A|U

 

Figure: Using second buffer out of order

But there is a more interesting case: Starting from the "first buffer out of order" situation, the driver makes available the buffer #1 again. In that case, the descriptor table goes directly from the "first buffer" to the next figure, "Full two-entries descriptor table."

Avail idx

Address

Length

ID

Flags

Used idx

 

0x81000000

0x1000

1

W|(!A)|U

0x81000000

0x1000

1

W|A

 

Figure: Full two-entries descriptor table

Note that, after the wrap, the driver needs to clear the available flag, and to make the used flag the opposite. When the device wraps looking for available buffers, it needs to start looking for this combination, obviously, so it will stop at index 1 of the table: it has the “available” combination of the previous round, not this one. At this moment, both buffers #0 and #1 are available for the device. And it could decide to use the #1 again!

Chained descriptors: No more jumps

Chained descriptors work likewise: no need for the next field in the head (or subsequent) descriptor in the chain to search subsequent ones, since the latter always occupies the next position. However, while in the split used ring you only need to return as used the id of the head of the chain, in packed you only need to return the tail id.

Back to the used ring, every time we use chained descriptors, we make the used idx lag regarding the avail idx. More than one descriptor mark as available to the device, but we only send one as used to the driver. While this is not a problem in the split ring, this would cause descriptor entry exhaustion in the packed version.

The straightforward solution is to make the device mark as used every descriptor in the chain. However, this can be expensive, since we are modifying a shared area of memory, and could cause cache bounces.

However, the driver already knows the chain, so it can skip all the chain with only the last id. This is why we need to compare the used/avail pair with the driver/device Wrap Counter: after a jump, we wouldn’t know if the next descriptor has been made available in this driver’s round or in the next if we only have a binary available/used flag.

For example, in a four entries ring, the driver makes available the chain of three descriptors:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

0x1000

0

W|A

 

0x81000000

0x1000

1

W|A

 
 

0x82000000

0x1000

2

W|A

 

     

0

 

Figure: Three chained descriptors available

After that, the device discovers the chain (polling position 0) and marks it as used, overwriting only the position 0. It skips completely the positions 1 and 2. When the driver polls for used, it will skip them too, knowing that the chain was 3 descriptors long:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

0x1000

2

W|A|U

 
 

0x81000000

0x1000

1

W|A

 
 

0x82000000

0x1000

2

W|A

 

     

0

Figure: Using the descriptor chain

Now the driver produces another two descriptor long chain, and it has to take into account the wrapping:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x81000000

0x1000

1

W|(!A)|U

 

0x81000000

0x1000

1

W|A

 
 

0x82000000

0x1000

2

W|A

 
 

0x80000000

0x1000

0

W|A

Figure: Make available another descriptor chain

And the device marks it as used, so only the first descriptor in the chain (4th in the table) needs to be updated.

Avail idx

Address

Length

ID

Flags

Used idx

 

0x81000000

0x1000

1

W|(!A)|U

 

0x81000000

0x1000

1

W|A

 

0x82000000

0x1000

2

W|A

 
 

0x80000000

0x1000

0

W|A|U

 

Figure: Using another descriptor chain

Although the next descriptor (2nd) seems like available, since the avail flag is different from the used one, the device knows that it is not because of knowing the internal Driver Wrap Counter: The right flag combination is avail clear, used set.

Indirect descriptors: When chains are not enough

Indirect descriptors work like in the split case. First, the driver allocates a table of indirect descriptors each with the same layout as the regular packed descriptors anywhere in memory. After that, it sets each descriptor in this indirect table to the buffer it wants to make available for the driver (steps 1-2), and inserts a descriptor in the virtqueue with the flag VIRTQ_DESC_F_INDIRECT (0x4) set (step 3). The descriptor’s address and length correspond to the indirect table’s ones.

In packed layout buffers must come in order in the indirect table, and the ID field is completely ignored. Also, the only valid flag for them is VIRTQ_DESC_F_WRITE, others are reserved and ignored by the device. As usual, the driver will notify the device if the conditions for the notification are met (step 4).

 

Diagram: Driver makes available a descriptor using a packed queue

Diagram: Driver makes available a descriptor using a packed queue

For example, the driver would need to allocate this 48 bytes table for a 3 descriptors indirect table:

 

Address

Length

ID

Flags

 
 

0x80000000

0x1000

...

W

 
 

0x81000000

0x1000

...

W

 
 

0x82000000

0x1000

...

W

 

Figure: Three descriptor long indirect packed table

And if it introduces the indirect table the first in the descriptor table, assuming it is allocated in 0x83000000 address:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

48

0

A|I

...

       

Figure: Drivers makes an indirect table available

After indirect buffer consumption, the device needs to return the indirect buffer id (0 in the example) in its used descriptor. The table looks like the return of the first buffer, except for the indirect (I) flag set:

Avail idx

Address

Length

ID

Flags

Used idx

 

0x80000000

48

0

A|U|I

 

...

     

Figure: Device makes an indirect table used

After that, the device cannot access the memory table anymore unless the driver makes it available again, so the latter can free or reuse it.

Notifications: how to manage interruptions?

Like in the used queue, each side of the communication maintains two identical structures used for controlling notifications between the device and the driver. The driver’s one is read-only by the device, and the device’s one is read-only by the driver.

The struct layout is:

struct pvirtq_event_suppress { 
        le16 desc;
        le16 flags; 
};

Listing: Event suppression struct notification

The member flags can take the values:

  • 0: Notifications are enabled

  • 1: Notifications are disabled

  • 2: Notifications are enabled for a specific descriptor, specified from the desc member.

If flags value is 2, the other side will notify until the wrap counter matches the most significant bit of desc and the descriptor placed in the position desc discarding that bit is made used/available. For this mode to work, VIRTIO_F_RING_EVENT_IDX flag needs to be negotiated in Virtio device initialization: feature bits.

None of these mechanisms are 100% reliable, since the other side could have sent the notification already when we set the values, so expect it even when disable.

Note that, since the descriptor ring size is not being forced to be a power of two (comparing with the split version), the notification structure can fit in the same page as the descriptor table. This can be advantageous for some implementations.

Summary

In this series we have taken you through the different virtio data plane layouts and its virtqueues implementations. They are the means for virtio devices and virtio drivers to exchange information.

We start by covering the simpler and less optimized split virtqueue layout. This layout is relatively easy to implement and to debug thus it's a good entry point for learning the virtio dataplane basics.

We then moved on to the packed virtqueue layout specified in virtio 1.1 which allows requests exchange using a more compact descriptor representation. This avoids all the overhead of scattering the data through memory, avoiding cache contention and reducing the PCI transactions in case of actual hardware.

We also covered a number of optimizations on top of both ring layouts which depends on the communication/device type or how each part is implemented. Mainly, they are oriented to reduce the communication overhead, both in notifications and in memory transactions. Virtio offers a simple protocol to communicate what features and optimizations support each side, so they can agree on how the data is going to be exchanged and is highly future-proof.

This series covered the essence of the virtio data plane and provided you with the tool to analyze and develop your own virtio device and drivers. It should be noted that this series summarizes the relevant sections from the virtio spec thus you should refer to the spec for additional information and see it as the source of truth.

In the next posts we will return to vDPA including the kernel framework, hands on blogs and vDPA in Kubernetes.


About the author

Eugenio Pérez works as a Software Engineer in the Virtualization and Networking (virtio-net) team at Red Hat. He has been developing and promoting free software on Linux since his career start. Always closely related to networking, being with packet capture or classic monitoring. He enjoys to learn about how things are implemented and how he can expand them, keeping them simple (KISS) and focusing on maintainability and security.

Read full bio