In this post we will set up an environment and run a DPDK based application in a virtual machine. We will go over all steps required to set up a simple virtual switch in the host system which connects to the application in a VM. This includes a description of how to create, install and run a VM and install the application in it. You will learn how to create a simple setup where you sent packets via the application in the guest to a virtual switch in the host system and back. Based on this setup you will learn how to tune settings to achieve optimal performance.
Setting up
For readers interested in playing with DPDK but not in configuring and installing the required setup, we have Ansible playbooks in a Github repository that can be used to automate everything. Let’s start with the basic setup.
Requirements:
-
A computer running a Linux distribution. This guide uses CentOS 7 however the commands should not change significantly for other Linux distros, in particular for Red Hat Enterprise Linux 7.
-
A user with sudo permissions
-
~ 25 GB of free space in your home directory
-
At least 8GB of RAM
To start, we install the packages we are going to need:
sudo yum install qemu-kvm libvirt-daemon-qemu libvirt-daemon-kvm libvirt virt-install libguestfs-tools-c kernel-tools dpdk dpdk-tools
Creating a VM
First, download the latest CentOS-Cloud-Base image from the following website:
user@host $ sudo wget -O /var/lib/libvirt/images/CentOS-7-x86_64-GenericCloud.qcow2 http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud.qcow2
(Note the URL above might change, update it to the latest qcow2 image from https://wiki.centos.org/Download.)
This downloads a preinstalled version of CentOS 7, ready to run in an OpenStack environment. Since we’re not running OpenStack, we have to clean the image. To do that, first we will make a copy of the image so we can reuse it in the future:
user@host $ sudo qemu-img create -f qcow2 -b /var/lib/libvirt/images/CentOS-7-x86_64-GenericCloud.qcow2 /var/lib/libvirt/images//vhuser-test1.qcow2 20G
The libvirt commands to do this can be executed with an unprivileged user (recommended) if we export the following variable:
user@host $ export LIBVIRT_DEFAULT_URI="qemu:///system"
Now, the cleaning command (change the password to your own):
user@host $ sudo virt-sysprep --root-password password:changeme --uninstall cloud-init --selinux-relabel -a /var/lib/libvirt/images/vhuser-test1.qcow2 --network --install “dpdk,dpdk-tools,pciutils”
This command mounts the filesystem and applies some basic configuration automatically so that the image is ready to boot afresh.
We need a network to connect our VM as well. Libvirt handles networks in a similar way it manages VMs, you can define a network using an XML file and start it or stop it through the command line.
For this example, we will use a network called ‘default’ whose definition is shipped inside libvirt for convenience. The following commands define the ‘default’ network, start it and check that it’s running.
user@host $ virsh net-define /usr/share/libvirt/networks/default.xml Network default defined from /usr/share/libvirt/networks/default.xml user@host $ virsh net-start default Network default started user@host $virsh net-list Name State Autostart Persistent -------------------------------------------- default active no yes
Finally, we can use virt-install to create the VM. This command line utility creates the needed definitions for a set of well known operating systems. This will give us the base definitions that we can then customize:
user@host $ virt-install --import --name vhuser-test1 --ram=4096 --vcpus=3 \ --nographics --accelerate \ --network network:default,model=virtio --mac 02:ca:fe:fa:ce:aa \ --debug --wait 0 --console pty \ --disk /var/lib/libvirt/images/vhuser-test1.qcow2,bus=virtio --os-variant centos7.0
The options used for this command specify the number of vCPUs, the amount of RAM of our VM as well as the disk path and the network we want the VM to be connected to.
Apart from defining the VM according to the options that we specified, the virt-install command should have also started the VM for us so we should be able to list it:
user@host $ virsh list Id Name State ------------------------------ 1 vhuser-test1 running
Voilà! Our VM is running. We need to make some changes to its definition soon. So we will shut it down now:
user@host $ virsh shutdown vhuser-test1
Preparing the host
DPDK helps with optimally allocating and managing memory buffers. On Linux this requires using hugepage support which must be enabled in the running kernel. Using pages of a size bigger than the usual 4K improves performance by using fewer pages and therefore fewer TLB (Translation Lookaside Buffers) lookups. These lookups are required to translate virtual to physical addresses. To allocate hugepages during boot we add the following to the kernel parameters in the bootloader configuration.
user@host $ sudo grubby --args=“default_hugepagesz=1G hugepagesz=1G hugepages=6 iommu=pt intel_iommu=on” --update-kernel /boot/<your kernel image file>
Let’s understand what each of the parameters do:
default_hugepagesz=1G:
make all created hugepages by default 1G big
hugepagesz=1G
: for the hugepages created during startup set the size to 1G as well
hugepages=6
: create 6 hugepages (of size 1G) from the start. These should be seen after booting in /proc/meminfo
Note that in addition to the hugepage settings we also added two IOMMU related kernel parameters, iommu=pt intel_iommu=on
. This will initialize the Intel VT-d and the IOMMU Pass-Through mode that we will need for handling IO in the Linux userspace. As we changed kernel parameters, now is a good time to reboot the host.
After it comes up we can check that our changes to the kernel parameters were effective by running user@host $ cat /proc/cmdline
.
Prepare the guest
The virt-install command created and started a VM using libvirt. To connect our DPDK based vswitch testpmd to QEMU we need to add the definition of the vhost-user interfaces (backed by UNIX sockets) to the device section of the XML:
user@host $ virsh edit vhuser-test1
<device>
section:<interface type='vhostuser'> <mac address='56:48:4f:53:54:01'/> <source type='unix' path='/tmp/vhost-user1' mode='client'/> <model type='virtio'/> <driver name='vhost' rx_queue_size='256' /> </interface> <interface type='vhostuser'> <mac address='56:48:4f:53:54:02'/> <source type='unix' path='/tmp/vhost-user2' mode='client'/> <model type='virtio'/> <driver name='vhost' rx_queue_size='256' /> </interface>
Another difference in the guest config compared to one used for vhost-net is use of hugepages. For that we add the following to the guest definition:
<memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> <locked/> </memoryBacking> <numatune> <memory mode='strict' nodeset='0'/> </numatune>
And so memory can be accessed we need an additional setting in the guest configuration. This is an important setting, without it we won’t see any packets being transmitted:
<cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='3' threads='1'/> <numa> <cell id='0' cpus='0-2' memory='3145728' unit='KiB' memAccess='shared'/> </numa> </cpu>
Now we need to start our guest. Because we configured it to connect to the vhost-user UNIX sockets we need to be sure they are available when the guest it started. This is achieved by starting testpmd, which will open the sockets for us:
user@host $ sudo testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \ --vdev 'net_vhost0,iface=/tmp/vhost-user1' \ --vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \ --portmask=f -i --rxq=1 --txq=1 \ --nb-cores=4 --forward-mode=io
One last thing, because we connect to the vhost-user unix sockets, we need to make QEMU run as root for this experiment. For this set user = root
in /etc/libvirt/qemu.conf
. This is required for our special use case but not recommended in general. In fact readers should revert this setting after following this hands-on article by commenting out the user = root
setting.
Now we can start the VM with user@host $ virsh start vhuser-test1
.
Log in as root. The first thing we do in the guest is to bind the virtio devices to the vfio-pci driver. To be able to do this we need to load the required kernel modules first.
root@guest $ modprobe vfio enable_unsafe_noiommu_mode=1 root@guest $ cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode root@guest $ modprobe vfio-pci
Let’s find out the PCI addresses of our virtio-net devices first.
root@guest $ dpdk-devbind --status net … Network devices using kernel driver =================================== 0000:01:00.0 'Virtio network device 1041' if=enp1s0 drv=virtio-pci unused= *Active* 0000:0a:00.0 'Virtio network device 1041' if=enp1s1 drv=virtio-pci unused= 0000:0b:00.0 'Virtio network device 1041' if=enp1s2 drv=virtio-pci unused=
In the output of dpdk-devbind look for the virtio-devices in the section that are not marked active. We can use these for our experiment. Note: addresses may be different on the readers system. When we first boot the devices will be automatically bound to the virtio-pci driver. Because we want to use them not with the kernel driver but with the vfio-pci kernel module, we first unbind them from virtio-pci and then bind them to the vfio-pci driver.
root@guests $ dpdk-devbind.py -b vfio-pci 0000:0a:00.0 0000:0b:00.0
Now the guest is prepared to run our DPDK based application. To make this binding permanent we could also use the driverctl
utility:
root@guest $ driverctl -v set-override 0000:00:10.0 vfio-pci
Do the same for the second virtio-device with address 0000:00:11.0. Then list all overrides to check it worked:
user@guest $ sudo driverctl list-overrides 0000:00:10.0 vfio-pci 0000:00:11.0 vfio-pci
Generating traffic
We installed and configured everything to finally run networking traffic over our interfaces. Let’s start: In the host we first need to start the testpmd instance which acts as a virtual switch. We will just make it forward all packets it received on interface net_vhost0 to net_vhost1. It needs to be started before we start the VM, because it will try to connect to the unix sockets belonging to the vhost-user devices during initialization and they are created by QEMU.
root@host $ testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \ --vdev 'net_vhost0,iface=/tmp/vhost-user1' \ --vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \ --portmask=f -i --rxq=1 --txq=1 \ --nb-cores=4 --forward-mode=io
Now we can launch the VM we prepared previously:
user@host $ virsh start vhuser-test1
Notice how we can see output in the testpmd window which shows the vhost-user messages it received
Once the guest has booted we can start the testpmd instance. This one will initialize the ports and the virtio-net driver that DPDK implements. Among other things this is where the virtio feature negotiation takes place and the set of common features is agreed upon.
Before we start testpmd we make sure that the vfio kernel module is loaded and bind the virtio-net devices to the vfio-pci driver:
root@guest $ dpdk-devbind.py -b vfio-pci 0000:00:10.0 0000:00:11.0
Start testpmd
:
root@guest $ testpmd -l 0,1,2 --socket-mem 1024 -n 4 \ --proc-type auto --file-prefix pg -- \ --portmask=3 --forward-mode=macswap --port-topology=chained \ --disable-rss -i --rxq=1 --txq=1 \ --rxd=256 --txd=256 --nb-cores=2 --auto-start
Now we can check how many packets our testpmd instances are processing. On the testpmd prompt we enter the command ‘show port stats all’ and see the number of packets forwarded in each direction (RX/TX).
An example:
testpmd> show port stats all ######################## NIC statistics for port 0 ######################## RX-packets: 75525952 RX-missed: 0 RX-bytes: 4833660928 RX-errors: 0 RX-nombuf: 0 TX-packets: 75525984 TX-errors: 0 TX-bytes: 4833662976 Throughput (since last show) Rx-pps: 4684120 Tx-pps: 4684120 ######################################################################### ######################## NIC statistics for port 1 ######################## RX-packets: 75525984 RX-missed: 0 RX-bytes: 4833662976 RX-errors: 0 RX-nombuf: 0 TX-packets: 75526016 TX-errors: 0 TX-bytes: 4833665024 Throughput (since last show) Rx-pps: 4681229 Tx-pps: 4681229 #########################################################################
There are different forwarding modes in testpmd. In this example we used --forward-mode=macswap, which swaps the destination and source MAC address. Other forwarding modes like ‘io’ don’t touch packets at all and will give much higher, but also even more unrealistic numbers. Another forwarding mode is ‘noisy’. It can be fine-tuned to simulate packet buffering and memory lookups.
Extra: Optimizing the configuration for maximum throughput and low latency
So far we mostly stayed with the default settings. This helped to keep the tutorial simple and easy to follow. But for those readers interested in tuning all components for the best performance we will explain what is needed to achieve this.
Optimizing host settings
We start with optimizing our host system.
There are a few settings we need to do in the host system to achieve optimal performance. Note that you don’t necessarily have to do all these manual steps. With tuned you get a set of available tuned profiles that you can choose from. Applying the cpu-partitioning profile of tuned will take care of all the steps we will execute manually here.
Before we start explaining the tunings in detail, this is how you use the tuned cpu-partition profile and don’t have to bother with all the details:
user@host $ sudo dnf install tuned-profiles-cpu-partitioning
Then edit /etc/tuned/cpu-partitioning-variables.conf
and set isol_cpus
and no_balance_cores
both to 2-7
.
Now we can apply the tuned profile with the tuned-adm
command:
user@host $ sudo tuned-adm profile cpu-partitioning
Rebooting to apply the changes is necessary because we add kernel parameters.
For those readers who want to know more details of what the cpu-partitioning profile does, let’s do these steps manually. If you’re not interested in this you can just skip to the next section.
Let’s assume we have eight cores in the system and we want to isolate six of them on the same NUMA node. We use two cores to run the guest virtual CPUs and the remaining four to run the data path of the application.
The most basic change is done even outside of Linux in the BIOS settings of the system. There we have to disable turbo-boost and hyper-threads. If the BIOS is for some reason not accessible, disable hyperthreads with this command:
cat /sys/devices/system/cpu/cpu*[0-9]/topology/thread_siblings_list \ | sort | uniq \ | awk -F, '{system("echo 0 > /sys/devices/system/cpu/cpu"$2"/online")}'
After that we attach the following to the kernel command line:
intel_pstate=disable isolcpus=2-7 rcu_nocbs=2-7 nohz_full=2-7
What these parameters mean is:
intel_pstate=disable
: avoid switching power states
Do this by running:
user@host $ grubby --args “intel_pstate=disable mce=ignore_ce isolcpus=2-7 rcu_nocbs=2-7 nohz_full=2-7” --update-kernel /boot/<your kernel image file>
Non-maskable interrupts can reduce performance because they steal valuable cycles where the core could handle packets instead, so we disable them with:
user@host $ echo 0 > /proc/sys/kernel/nmi_watchdog
To a similar effect we exclude the cores we want to isolate from the writeback cpu mask:
user@host $ echo ffffff03 > /sys/bus/workqueue/devices/writeback/cpumask
Optimizing guest settings
Similar to what we did in the host we also change the kernel parameters for the guest. Again by using grubby we add the following parameters to the configuration:
default_hugepagesz=1G hugepagesz=1G hugepages=1 intel_iommu=on iommu=pt isolcpus=1,2 rcu_nocbs=1,2 nohz_full=1,2
The meaning of the first three parameters are known from the host configuration. The others are:
intel_iommu=on
: Make use of the IOMMU.
iommu=pt
: Operate IOMMU in pass-through mode. More about what this means later.
isolcpus=1,2
: Ask kernel to isolate these cores.
rcu_nocbs=1,2
: Don’t do RCU callbacks on the cpu, offload it to other threads to avoid RCU callbacks as softirqs.
nohz_full=1,2
: Avoid scheduling clock ticks.
And because we want the same for guest cores handling the packets that we want for the cores in the host we do the same steps and disable NMIs, exclude the cores from block device writeback flusher threads and from IRQs, we do this:
user@guest $ echo 0 > /proc/sys/kernel/nmi_watchdog user@guest $ echo 1 > /sys/bus/workqueue/devices/writeback/cpumask user@guest $ clear_mask=0x6 #Isolate CPU1 and CPU2 from IRQs for i in /proc/irq/*/smp_affinity do echo "obase=16;$(( 0x$(cat $i) & ~$clear_mask ))" | bc > $i done
Pinning virtual CPUs to physical cores in the host will make sure the vcpus are not scheduled to different cores.
<cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='6'/> <vcpupin vcpu='2' cpuset='7'/> <emulatorpin cpuset='0'/> </cputune>
Analyzing the performance
After going through all the performance tuning steps, let’s run our testpmd
instances again to see how the number of packets per second changed.
testpmd> show port stats all ######################## NIC statistics for port 0 ######################## RX-packets: 24828768 RX-missed: 0 RX-bytes: 1589041152 RX-errors: 0 RX-nombuf: 0 TX-packets: 24828800 TX-errors: 0 TX-bytes: 1589043200 Throughput (since last show) Rx-pps: 5207755 Tx-pps: 5207755 ######################################################################## ######################## NIC statistics for port 1 ######################## RX-packets: 24852096 RX-missed: 0 RX-bytes: 1590534144 RX-errors: 0 RX-nombuf: 0 TX-packets: 24852128 TX-errors: 0 TX-bytes: 1590536192 Throughput (since last show) Rx-pps: 5207927 Tx-pps: 5207927 ########################################################################
Compared to the numbers of the not tuned setup the numbers of packet per second increased roughly by 12%. This is a very simple setup and we have no other workloads running on host and guest. In a more complex scenario the performance improvement might be even more significant.
After building a simple setup before, in this section we concentrated on tuning the performance of the individual components. The key here is to deconfigure and disable everything that distracts cores (physical or virtual) from doing what they are supposed to do: handling packets.
We did this manually in what seems like a complicated set of commands so we can learn what is behind it all. But the truth is: all this can be achieved by installing and using tuned and the tuned-profiles-cpu-partition package and a simple one-line configuration file change. Even more, the single biggest impact is achieved by pinning the vCPUs to host cores.
Ansible scripts available
Setting this environment up and running is the first and fundamental step in order to understand, debug and test this architecture. In order to make it as quick and easy as possible, Red Hat’s virtio-net team has developed a set of Ansible scripts for everyone to use.
Just follow the instructions in the README and Ansible should take care of the rest.
Conclusion
We have set up and configured a host system to run DPDK based application and created a virtual machine that is connected to it via vhost-user interfaces. Inside the VM we ran testpmd, also built on DPDK, and used it to generate, send and receive packets in a loop between the testpmd vswitch instance in the host and the instance in the VM. The setup we looked at is a very simple one. A next step for the interested reader could be deploying and using OVS-DPDK, which is OpenVSwitch built against DPDK. It’s a far more advanced virtual switch used in production scenarios.
This is the last post on the “Virtio-networking and DPDK” topic, which started with "how vhost-user came into being," and was followed by "journey to the vhost-users realm."
Prior posts / Resources
- Introducing virtio-networking: Combining virtualization and networking for modern IT
- Introduction to virtio-networking and vhost-net
- Deep dive into Virtio-networking and vhost-net
- Hands on vhost-net: Do. Or do not. There is no try
- How vhost-user came into being: Virtio-networking and DPDK
- A journey to the vhost-users realm
저자 소개
Jens Freimann is a Software Engineering Manager at Red Hat with a focus on OpenShift sandboxed containers and Confidential Containers. He has been with Red Hat for more than six years, during which he has made contributions to low-level virtualization features in QEMU, KVM and virtio(-net). Freimann is passionate about Confidential Computing and has a keen interest in helping organizations implement the technology. Freimann has over 15 years of experience in the tech industry and has held various technical roles throughout his career.
Eugenio Pérez works as a Software Engineer in the Virtualization and Networking (virtio-net) team at Red Hat. He has been developing and promoting free software on Linux since his career start. Always closely related to networking, being with packet capture or classic monitoring. He enjoys to learn about how things are implemented and how he can expand them, keeping them simple (KISS) and focusing on maintainability and security.
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.