Blog da Red Hat
table { border: #ddd solid 1px; } td, th { padding: 8px; border: #ddd solid 1px; } td p { font-size: 15px !important; }
People working in the telco space with DPDK accelerated Open vSwitch have probably seen questions about packet drops a number of times and with multiple customers. Is it OVS-DPDK, or is it the VNF? Why is it happening, and where? What happened, and more importantly, what can we do about it? In this post we'll look at troubleshooting and answering some of these questions.
About three years ago, at OVScon 2017, Ericsson came up with a proposal made of two parts:
-
Exposing more advanced metrics from the PMD Threads
-
Improve how drop packets are classified in Open vSwitch
Since then (and many OVS-DPDK escalations after) the upstream community (including a number of Red Hatters) has worked really hard. So it was clear, we had to improve the usability and the observability features of OVS-DPDK.
We're blind when it comes to troubleshooting, and this has to change.
Let's start with the basics, what does the modern OVS-DPDK architecture look like? (Keep in mind, this won't take into consideration HWOffload and AFXDP scenarios)

Figure 1 - OVS-DPDK architecture

Figure 2 - OVS-DPDK trouble spots
This post will break down the new features added to OVS, when those features are useful for troubleshooting and how to consume them.
1. OVS-DPDK PMD Performance Metrics
Okay, so this is a big one and probably the most complex to understand of the series. Ericsson pitched it highly as "The main tool for troubleshooting OVS-DPDK data plane issues."
So what does it do?
-
Provides detailed statistics of what is happening in every iteration of a PMD thread, recording:
-
Cycles (aka the CPU Frequency also exposed as such)
-
Packets
-
RX Batches
-
packets/batch
-
max vhost user qlen (queue length)
-
upcalls and the cycles spent in upcalls
-
-
Those statistics are also exposed as histograms in three different forms, which can be extremely useful, especially to spot recurring patterns over time:
-
Histograms since the statistic reset (or Open vSwitch start).
-
Histograms for the last second in slices of ms.
-
Histograms for the last 999 iterations in slices per iteration.
-
-
As if above wouldn't be enough, the icing on the cake is the detection and logging of suspicious PMD iterations, or in other words, instead of scanning hours of histograms, OVS can log suspicious iterations given some patterns:
-
PMD Iteration lasts longer than US_THR microseconds (default 250)
-
Max vhost qlen exceeds a threshold Q_THR (default 128)
-
These Open vSwitch improvements come from experience in the field where real-life work helped highlight aspects to keep track of through the DPDK. And that's the most important feature for me. Being able to see and correlate events over time (e.g. upcalls every 1ms and dropped packets) is so powerful that it will dramatically shorten the RCA. These advanced PMD performance metrics are disabled unless the operator turns them on. There is a 1%-ish performance impact. To enable it.
ovs-vsctl set Open_vSwitch . other_config:pmd-perf-metrics=true
To see the statistics together with the histogram:
ovs-appctl dpif-netdev/pmd-perf-show
The command is quite useful since the stats are available also per PMD (specifying the common -pmd [core]
argument) and the cleanup is consistent with the current CLI, using:
ovs-appctl dpif-netdev/pmd-stats-clear
Let's break it down a bit.
1.1 Global PMD Stats
The top part reports global PMD statistics, similar to pmd-stats-show
but way easier to understand. In my case, a single PMD thread is displayed.
ovs-appctl dpif-netdev/pmd-perf-show -nh -pmd 2 Time: 14:02:49.792 Measurement duration: 74.074 s pmd thread numa_id 0 core_id 2: Iterations: 253645450 (0.29 us/it) - Used TSC cycles: 182424040882 ( 98.5 % of total cycles) - idle iterations: 246279232 ( 42.4 % of used cycles) - busy iterations: 7366218 ( 57.6 % of used cycles) Rx packets: 148150420 (2000 Kpps, 709 cycles/pkt) Datapath passes: 148150420 (1.00 passes/pkt) - EMC hits: 46902786 ( 31.7 %) - SMC hits: 101243794 ( 68.3 %) - Megaflow hits: 3832 ( 0.0 %, 1.00 subtbl lookups/hit) - Upcalls: 8 ( 0.0 %, 27.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 148146584 (2000 Kpps) Tx batches: 11667397 (12.70 pkts/batch)
There is a clear report of:
-
How long a PMD iteration takes on average (0.29 us).
-
From the total number of iterations (253645450), how many are busy (7366218) and idle (246279232) ones, in other words, does the PMD have any spare capacity?
-
How many are the used TSC cycles from a given CPU, in other words, the number of the CPU clock cycles.
-
How many receiving packets per second (2000 Kpps) and the number of cycles spent per packet.
-
How many passes per packet (also known as recirculation — none in this case with a value of 1.00 passes/pkt) happen in the datapath and statistics about the caching utilization (EMC, SMC, Megaflow, Upcall and any lost ones).
-
How many transmitting packets per second (2000 Kpps), the number of batches and of course, per batch how many packets are sent.
1.2 Performance Histogram
The performance histogram is not as intuitive at first but it includes a great deal of detail about performance over time.
ovs-appctl dpif-netdev/pmd-perf-show -pmd 2 Time: 14:02:49.796 Measurement duration: 74.078 s pmd thread numa_id 0 core_id 2: Iterations: 253661585 (0.29 us/it) - Used TSC cycles: 182436009747 ( 98.5 % of total cycles) - idle iterations: 246294875 ( 42.4 % of used cycles) - busy iterations: 7366710 ( 57.6 % of used cycles) Rx packets: 148160167 (2000 Kpps, 709 cycles/pkt) Datapath passes: 148160167 (1.00 passes/pkt) - EMC hits: 46905591 ( 31.7 %) - SMC hits: 101250736 ( 68.3 %) - Megaflow hits: 3832 ( 0.0 %, 1.00 subtbl lookups/hit) - Upcalls: 8 ( 0.0 %, 27.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 148156331 (2000 Kpps) Tx batches: 11668188 (12.70 pkts/batch) Histograms cycles/it packets/it cycles/pkt pkts/batch max vhost qlen upcalls/it cycles/upcall 499 244862637 0 246295034 100 0 0 0 0 253661747 0 253661739 999 0 716 634491 1 1371421 1096 4883015 1 3269163 1 0 1 8 1258 0 1025 725571 2 376039 2093 1115087 2 1308309 2 0 2 0 1584 0 1469 70524 3 276151 3090 1023449 3 914742 3 0 3 0 1995 0 2104 76371 4 249456 4086 331190 4 848946 4 0 4 0 2511 0 3014 955012 5 215626 5083 10891 5 703371 5 0 5 0 3162 0 4317 865926 6 178136 6080 2417 6 632981 6 0 6 0 3981 0 6183 602917 7 147110 7076 532 7 598099 7 0 7 0 5011 0 8857 619503 8 133248 8073 116 8 563776 8 0 8 0 6309 0 12686 881332 9 126671 9070 7 9 544639 9 0 9 0 7943 0 18171 948439 10 115439 10066 0 10 541891 10 0 10 0 9999 0 26027 1019270 11 120137 11063 0 11 596121 11 0 11 0 12589 0 37279 1210263 13 267800 12060 0 12 644960 12 0 12 0 15848 0 53395 189486 17 574838 13056 1 13 640917 13 0 13 0 19952 0 76480 1 22 490689 14053 0 14 604726 16 0 14 0 25118 0 109544 1 28 546363 15050 1 16 1114554 20 0 15 0 31622 2 156903 1 35 498462 16046 2 17 398686 25 0 16 0 39810 2 224736 4 45 636232 17043 0 18 354995 31 0 17 0 50118 2 321895 0 57 542871 18040 0 19 300376 38 0 18 0 63095 2 461058 0 72 500027 19036 0 20 285025 48 0 19 0 79432 2 660385 0 92 0 20033 0 21 256811 59 0 20 0 99999 4 945886 0 117 0 21030 0 22 263800 73 0 21 0 125892 2 1354815 0 148 0 22026 0 23 254168 91 0 22 0 158489 0 1940535 0 188 0 23023 0 24 265690 113 0 23 0 199526 0 2779476 0 239 0 24020 0 25 226204 140 0 24 0 251188 0 3981111 0 303 0 25016 0 26 220191 174 0 25 0 316227 0 5702243 0 385 0 26013 0 27 208123 216 0 26 0 398107 0 8167461 0 489 0 27010 0 28 228785 268 0 27 0 501187 0 11698454 0 621 0 28006 0 29 209380 332 0 28 0 630957 0 16755981 0 788 0 29003 0 30 222237 412 0 29 0 794328 0 24000000 0 1000 0 30000 0 32 1815253 512 0 30 0 1000000 0 > 0 > 0 > 5 > 0 > 0 > 0 > 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------- cycles/it packets/it cycles/pkt pkts/batch vhost qlen upcalls/it cycles/upcall 719 0.58409 708 12.69560 0.00000 0.00000 67425
So what's the meaning of all of this?
-
Cycles per iteration (cycles/it): this histogram tells us how the CPU and memory subsystem performs, how the NICs (different vendor, VHU etc) performs and how the features enabled in Open vSwitch are affecting the number of cycles required to complete an iteration.
-
In above example, most of the time 499 cycles is what is required to complete most of the iterations (499 cycles per 244862637) but together with the global PMD stats, it's clear those are mostly the idle iterations (idle iterations: 246294875). Then we can see that actually, most busy iterations require between 26027 and 37279 cycles as those have respectively 1019270 and 1210263 iterations
-
-
Packets per iteration (packets/it): this histogram tells us how OVS enabled features and the traffic profile can cause packet processing to take more time.
-
In the above example, clearly the lower the number of packets (1 packet per 1371421 iterations) the more inefficient this processing was and a possible cause could be an Upcall but let's wait for the next section for that.
-
-
Cycles per packet (cycles/pkt): this histogram tells us how CPU and OVS enabled features are affecting the packet processing.
-
In above example, clearly most packet, 4883015, have spent only 1096 cycles each
-
-
Packets per batch (pkts/batch): this histogram tells us how many packets can be pushed per batches
-
In above examples, there is no definitive answer, but clearly 3269163 times only a single packet has been pushed which sounds quite bad compared with the opposite side of the histogram where 1815253 times 32 packets have been sent together. To explain this column further, for example, for every PMD iteration OVS counts the number of packets processed and increases the ‘batch’ (+1) for that packet count.
-
-
Upcalls per iteration (upcalls/it): this histogram tells us how many iterations have made an Upcall
-
In above examples, it's flat and only 8 iterations have made an Upcall
-
-
Cycles per Upcall (cycles/upcall): this histogram tells us similarly to the other one, how my cycles are spent for Upcall processing
-
Being the above example flat also the cycles have little value here
-
-
At the bottom per histogram, we also have the average values
The max vhost qlen requires a different example and is one of if not the most important histogram in this series
-
Depth of vHost queue (max vhost qlen): this histogram tells us an extremely important value about how OVS performs. The queue length represents the buffer of virtual interfaces being used to push packets from the VM to OVS. In the event the buffer is full, when, for example, the DPDK application can’t keep up, there is a packet drop and OVS, on the host, is completely unaware of this drop (which is only truly visible in the VM's counter, which could be a dpdk app or the kernel)
-
In above original examples, all is zero because that PMD only handles physical interfaces, see the below example which is for a PMD dedicated to a single VHU. We can clearly see how most of the time the buffer is empty (which is very good) and the peaks are mostly between 1 and 12
-
ovs-appctl dpif-netdev/pmd-perf-show -pmd 8 Time: 10:33:23.320 Measurement duration: 124.554 s pmd thread numa_id 0 core_id 8: Iterations: 278641911 (0.44 us/it) - Used TSC cycles: 304655522174 ( 97.8 % of total cycles) - idle iterations: 232566988 ( 23.5 % of used cycles) - busy iterations: 46074923 ( 76.5 % of used cycles) Rx packets: 305809148 (2455 Kpps, 762 cycles/pkt) Datapath passes: 305809148 (1.00 passes/pkt) - EMC hits: 305809148 (100.0 %) - SMC hits: 0 ( 0.0 %) - Megaflow hits: 0 ( 0.0 %, 0.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 305809148 (2455 Kpps) Tx batches: 46074923 (6.64 pkts/batch) Histograms cycles/it packets/it cycles/pkt pkts/batch max vhost qlen upcalls/it cycles/upcall 499 230082962 0 232567031 100 0 0 0 0 232567031 0 278641961 999 0 716 2348794 1 4429713 1096 37488609 1 4429713 1 4429713 1 0 1258 0 1025 126360 2 4357395 2093 7723788 2 4357395 2 4357395 2 0 1584 0 1469 440700 3 4716953 3090 857961 3 4716953 3 4716953 3 0 1995 0 2104 4178725 4 7991733 4086 4527 4 7991733 4 7991733 4 0 2511 0 3014 7490999 5 2139353 5083 13 5 2139353 5 2139353 5 0 3162 0 4317 10602715 6 2041110 6080 11 6 2041110 6 2041110 6 0 3981 0 6183 12747816 7 1821996 7076 4 7 1821996 7 1821996 7 0 5011 0 8857 7443233 8 9572653 8073 5 8 9572653 8 9572653 8 0 6309 0 12686 1538068 9 1506048 9070 2 9 1506048 9 1506048 9 0 7943 0 18171 560169 10 1028548 10066 0 10 1028548 10 1028548 10 0 9999 0 26027 1081352 11 630210 11063 0 11 630210 11 630210 11 0 12589 0 37279 56 13 2996375 12060 4 12 2636106 12 2636106 12 0 15848 0 53395 16 17 999776 13056 1 13 360269 13 360269 13 0 19952 0 76480 0 22 437824 14053 0 14 203801 16 908323 14 0 25118 0 109544 0 28 290066 15050 1 16 704523 20 467773 15 0 31622 0 156903 0 35 1115181 16046 0 17 91453 25 238188 16 0 39810 0 224736 0 45 0 17043 0 18 60406 31 168761 17 0 50118 0 321895 0 57 0 18040 0 19 45100 38 39675 18 0 63095 0 461058 0 72 0 19036 0 20 270814 48 229133 19 0 79432 0 660385 0 92 0 20033 1 21 34224 59 234547 20 0 99999 0 945886 0 117 0 21030 0 22 27280 73 306394 21 0 125892 0 1354815 0 148 0 22026 0 23 23247 91 220944 22 0 158489 0 1940535 0 188 0 23023 2 24 132301 113 28756 23 0 199526 0 2779476 0 239 0 24020 1 25 21136 140 347 24 0 251188 0 3981111 0 303 0 25016 0 26 18417 174 4 25 0 316227 0 5702243 0 385 0 26013 0 27 17121 216 2 26 0 398107 0 8167461 0 489 0 27010 0 28 77844 268 0 27 0 501187 0 11698454 0 621 0 28006 0 29 17250 332 0 28 0 630957 0 16755981 0 788 0 29003 0 30 17516 412 0 29 0 794328 0 24000000 0 1000 0 30000 0 32 1080415 512 0 30 0 1000000 0 > 0 > 0 > 0 > 0 > 0 > 0 > 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------- cycles/it packets/it cycles/pkt pkts/batch vhost qlen upcalls/it cycles/upcall 1093 1.09750 761 6.63722 1.20933 0.00000 0
Here is the history of the past 10 milliseconds:
ovs-appctl dpif-netdev/pmd-perf-show -nh -ms 10 -pmd 8 Time: 10:59:10.376 Measurement duration: 125.730 s pmd thread numa_id 0 core_id 8: Iterations: 106715062 (1.16 us/it) - Used TSC cycles: 309408267307 ( 98.4 % of total cycles) - idle iterations: 88882527 ( 8.7 % of used cycles) - busy iterations: 17832535 ( 91.3 % of used cycles) Rx packets: 308696611 (2455 Kpps, 915 cycles/pkt) Datapath passes: 308696611 (1.00 passes/pkt) - EMC hits: 4013658 ( 1.3 %) - SMC hits: 290122445 ( 94.0 %) - Megaflow hits: 14560489 ( 4.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 308696592 (2455 Kpps) Tx batches: 17832535 (17.31 pkts/batch) ms iterations cycles/it Kpps cycles/pkt pkts/batch vhost qlen upcalls cycles/upcall 8582460 713 3431 2496 910 17 4 0 0 8582459 1009 2447 2416 913 17 3 0 0 8582458 711 3464 2500 916 17 4 0 0 8582457 979 2517 2428 910 18 2 0 0 8582456 859 2869 2468 911 16 3 0 0 8582455 934 2631 2442 907 17 2 0 0 8582454 817 3005 2453 919 14 3 0 0 8582453 1022 2404 2441 897 17 3 0 0 8582452 961 2556 2460 899 15 3 0 0 8582451 890 2751 2472 899 16 3 0 0
And also the past 10 iterations:
ovs-appctl dpif-netdev/pmd-perf-show -nh -it 10 -pmd 8 Time: 10:59:12.890 Measurement duration: 128.244 s pmd thread numa_id 0 core_id 8: Iterations: 108763877 (1.16 us/it) - Used TSC cycles: 315598998711 ( 98.4 % of total cycles) - idle iterations: 90590023 ( 8.7 % of used cycles) - busy iterations: 18173854 ( 91.3 % of used cycles) Rx packets: 314870545 (2455 Kpps, 915 cycles/pkt) Datapath passes: 314870545 (1.00 passes/pkt) - EMC hits: 4093873 ( 1.3 %) - SMC hits: 295925740 ( 94.0 %) - Megaflow hits: 14850900 ( 4.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 314870513 (2455 Kpps) Tx batches: 18173854 (17.33 pkts/batch) iter cycles packets cycles/pkt pkts/batch vhost qlen upcalls cycles/upcall 41217710251 28588 32 893 32 60 0 0 41217710250 27551 32 860 32 65 0 0 41217710249 26861 32 839 32 67 0 0 41217710249 26861 32 839 32 67 0 0 41217710248 28118 32 878 32 76 0 0 41217710247 28515 32 891 32 78 0 0 41217710246 28267 32 883 32 80 0 0 41217710245 27079 32 846 32 87 0 0 41217710245 27079 32 846 32 87 0 0 41217710244 30394 32 949 32 80 0 0
Now let's do some analysis of the following traffic profile:
-
8Mpps (the following PMD is only for a single DPDK Physical port and a single VHU, and there are two PMD not reported here, so the total N/S is actually 8Mpps).
-
64B Packet size.
-
51K flows (as couples of different source and destination IP addresses).
-
UDP packets to simulate GTP (Ethernet, VLAN, IPv4, UDP, random payload).
The idea of this test is to generate some realistic traffic, overloading Open vSwitch a bit beyond its capacity at zero loss in order to generate some loss — roughly 20 packets dropped per million or 0.00002% drop rate but not a completely unreasonable situation. This is actually a very common situation in the industry. Like a specific VNF was tested with a given traffic profile but then down the road, the packet rate has increased beyond what originally expected (especially in times where people are more and more dependent on the Internet).
The traffic has been generated using OPNFV NFVBench 4.2.2 and routed back by FD.IO VPP 20.05.
ovs-appctl dpif-netdev/pmd-perf-show -nh -pmd 2 Time: 11:25:13.273 Measurement duration: 547.862 s pmd thread numa_id 0 core_id 2: Iterations: 49240680 (11.03 us/it) - Used TSC cycles: 1357982526312 ( 99.1 % of total cycles) - idle iterations: 4595286 ( 0.1 % of used cycles) - busy iterations: 44645394 ( 99.9 % of used cycles) Rx packets: 2191447001 (4000 Kpps, 619 cycles/pkt) Datapath passes: 2191447280 (1.00 passes/pkt) - EMC hits: 693804346 ( 31.7 %) - SMC hits: 1497614519 ( 68.3 %) - Megaflow hits: 28345 ( 0.0 %, 1.06 subtbl lookups/hit) - Upcalls: 38 ( 0.0 %, 19.6 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2191421051 (4000 Kpps) Tx batches: 87439714 (25.06 pkts/batch) ovs-appctl dpif-netdev/pmd-perf-show -ms 2 -it 2 -pmd 8 Time: 11:25:13.278 Measurement duration: 547.867 s pmd thread numa_id 0 core_id 8: Iterations: 294461622 (1.83 us/it) - Used TSC cycles: 1348161965122 ( 98.4 % of total cycles) - idle iterations: 225396737 ( 5.0 % of used cycles) - busy iterations: 69064885 ( 95.0 % of used cycles) Rx packets: 2191437619 (4000 Kpps, 584 cycles/pkt) Datapath passes: 2191437619 (1.00 passes/pkt) - EMC hits: 687977303 ( 31.4 %) - SMC hits: 1503460316 ( 68.6 %) - Megaflow hits: 0 ( 0.0 %, 0.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2191437619 (4000 Kpps) Tx batches: 69064885 (31.73 pkts/batch) Histograms cycles/it packets/it cycles/pkt pkts/batch max vhost qlen upcalls/it cycles/upcall 499 224972808 0 225396737 100 0 0 0 0 225396737 0 294461624 999 0 716 191768 1 26927 1096 68968043 1 26927 1 26927 1 0 1258 0 1025 222053 2 32381 2093 70884 2 32381 2 32381 2 0 1584 0 1469 8787 3 38275 3090 25489 3 38275 3 38275 3 0 1995 0 2104 1875 4 13463 4086 471 4 13463 4 13463 4 0 2511 0 3014 39868 5 11784 5083 1 5 11784 5 11784 5 0 3162 0 4317 68769 6 11977 6080 0 6 11977 6 11977 6 0 3981 0 6183 44705 7 12200 7076 0 7 12200 7 12200 7 0 5011 0 8857 100338 8 18046 8073 0 8 18046 8 18046 8 0 6309 0 12686 268388 9 16991 9070 0 9 16991 9 16991 9 0 7943 0 18171 7600634 10 17344 10066 0 10 17344 10 17344 10 0 9999 0 26027 60941048 11 18199 11063 0 11 18199 11 18199 11 0 12589 0 37279 88 13 53839 12060 0 12 28223 12 28223 12 0 15848 0 53395 498 17 138721 13056 0 13 25616 13 25616 13 0 19952 0 76480 0 22 274966 14053 0 14 26656 16 98674 14 0 25118 0 109544 0 28 650540 15050 0 16 72018 20 193576 15 0 31622 0 156903 0 35 67729238 16046 0 17 40047 25 380445 16 0 39810 0 224736 0 45 0 17043 0 18 41946 31 1571912 17 0 50118 0 321895 0 57 0 18040 0 19 43953 38 539715 18 0 63095 0 461058 0 72 0 19036 0 20 67630 48 613924 19 0 79432 0 660385 0 92 0 20033 0 21 59682 59 1388530 20 0 99999 0 945886 0 117 0 21030 0 22 61755 73 7550979 21 0 125892 0 1354815 0 148 0 22026 0 23 64090 91 13169439 22 0 158489 0 1940535 0 188 0 23023 0 24 109018 113 19170248 23 0 199526 0 2779476 0 239 0 24020 0 25 85900 140 17599579 24 0 251188 0 3981111 0 303 0 25016 0 26 90086 174 6168106 25 0 316227 0 5702243 0 385 0 26013 0 27 94130 216 344121 26 0 398107 0 8167461 0 489 0 27010 0 28 207316 268 4188 27 0 501187 0 11698454 0 621 0 28006 0 29 401431 332 30 28 0 630957 0 16755981 0 788 0 29003 0 30 393401 412 0 29 0 794328 0 24000000 0 1000 0 30000 0 32 66934409 512 0 30 0 1000000 0 > 0 > 0 > 0 > 0 > 0 > 0 > 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------- cycles/it packets/it cycles/pkt pkts/batch vhost qlen upcalls/it cycles/upcall 4578 7.44218 584 31.73013 9.03423 0.00000 0 iter cycles packets cycles/pkt pkts/batch vhost qlen upcalls cycles/upcall 43061652739 18391 32 574 32 88 0 0 43061652739 18391 32 574 32 88 0 0 ms iterations cycles/it Kpps cycles/pkt pkts/batch vhost qlen upcalls cycles/upcall 10145363 432 5691 3976 595 31 27 0 0 10145362 246 9979 4024 601 31 57 0 0
Let's do some analysis:
-
The first PMD thread (id 2) for the physical interface is utilized at 99.9% as "busy iterations" reports, same story for the physical core with a TSC cycle of 99.1% as reported by "Used TSC cycles."
-
The second PMD thread (id 8) for the VHU interface is utilized at 95% as "busy iterations" reports, while the TSC cycle reports 98.4% utilization. For this report both the last two milliseconds (-ms 2) and the last two iterations (-it 2) will also be displayed.
-
EMC hit is below 32% while SMC is giving a tangible boost being above 68% hit rate. The MegaFlow is effectively not utilized (which in turns means good for the performance).
-
Because the cache is so efficient, no upcall is happening.
-
Both PMD Threads handle 4Mpps each, the first one for the physical interface can push (TX towards the physical fabric) batches of 25 packets while the second one for the virtual interface can push (TX towards the VM) batches of 32 packets (easier or faster in virtual than in physical but also DPDK driver and NIC family can affect this).
-
On the histogram, the queue depth is averaging at around 9.03, many are actually staying there as the queue depth is between 91 and 140 for a considerably high number of packets
-
The histogram of the past 2 iteration reports higher average queue depth.
-
While the on of the past 2 milliseconds is much lower in comparison.
-
When dealing with a bursty type of traffic (for example 3G where traditionally the Node B is sent irregularly even every a tenth of second), precision per millisecond and even per iteration can tell a very different story than the math mean.
-
Clearly the first PMD thread has no spare resources here. This together with the drop stats (down below) can give a clear idea about the current status in no time
1.3 What's an Upcall?
Well, this is something quite important in OVS as one of the main causes of drops and (lack of) scalability. OVS has a main caching layer called EMC where the action for packets a cached (the flows). If a flow for a specific packet is missing, the packet is sent to the ovs-vswitchd daemon in the slow path (figure 1) and then typically the datapath is populated with the related flow. Now the action of sending a packet is called upcall. Why is it important? Because the packet processing time increases dramatically. Here is where SMC comes into play, caching up to 1M flows (vs. 8k of EMC).
1.4 What’s an Iteration?
To better understand what an iteration is in OVS-DPDK (and more in general, in any DPDK Application) see the official DPDK documentation.
1.5 Suspicious Iteration
New with this release is the ability of the userspace datapath to supervise and detect suspicious iterations in PMD performance. This is the icing on the cake. First off, to enable it, run this ovs-appctl command:
ovs-appctl dpif-netdev/pmd-perf-log-set on
In this case the performance impact is more significant and in the patch was embedded a protection mechanism to disable the feature in the event of too many consumed resources.
Then analyze the ovs-vswitchd logs file (at /var/log/openvswitch/ovs-vswitchd.log
) and look for "Neighborhood of suspicious iteration" we will be able to see something like this that was generated pushing 10Mpps to my VM.
WARN|Neighborhood of suspicious iteration: iter cycles packets cycles/pkt pkts/batch vhost qlen upcalls cycles/upcall 22834435908 14554 32 454 32 995 0 0 22834435907 14639 32 457 32 1024 0 0 22834435906 14079 32 439 32 1024 0 0 22834435905 14552 32 454 32 1021 0 0 22834435904 14515 32 453 32 1024 0 0 22834435903 15115 32 472 32 1024 0 0 22834435902 14697 32 459 32 1020 0 0 22834435901 14197 32 443 32 1018 0 0 22834435900 14612 32 456 32 1020 0 0 22834435899 14506 32 453 32 1018 0 0 22834435898 14978 32 468 32 1019 0 0
Being able to log weird iterations is nothing without being able to define what suspicions mean. Those are the supported options:
Argument |
Default Value |
Description |
-b |
5 |
The number of iterations before the suspicious iteration to be logged |
-a |
5 |
The number of iterations after the suspicious iteration to be logged |
-e |
false |
Extend logging interval if another suspicious iteration is detected before logging occurs |
-ne |
true |
Do not extend logging interval if another suspicious iteration is detected before logging occurs |
-q |
128 |
Suspicious vhost queue fills level threshold. Increase this to 512 if the Qemu supports 1024 virtio queue length |
-us |
250 |
Change the duration threshold for a suspicious iteration |
Effectively what this tool does, is to log for you any suspicious activity without you staying and analyzing the real-time statistics. Also given the available time-stamp, correlating events (cron job ran, user activity, significant clock variation, packet burst, etc.) can be extremely useful.
1.6 Real-world
At OVScon 2017 Ericsson shared three real-world examples resolved with PMD performance metrics and supervision; the last one is quite common. See slides 14 and 15 here.
-
20-30% packet drop in conjunction with frequent upcalls due to the i40e driver wasting time checking the like state
-
Periodic packet drops in VMs due to VirtIO TX Queue Overrun result of Linux CFS balancing threads over the NUMA nodes
-
Periodic bursts of packets dropped in VM due to VirtIO TX Queue Overrun result of QEMU Emulator Threads preempting the vCPUs where DPDK in the VM runs
For more information, see the Ericsson deck shared at OVScon 2017 available here, up to slide 15 as the rest is another feature further down.
2. OVS-DPDK Counters
While the previous patch focuses on exposing detailed statistics, the following patches expose counters (anybody familiar with network devices knows how fundamentals rates and counters are).
2.1 Current Status
The dump-ports
command has been available for some time:
ovs-ofctl dump-ports br-int OFPST_PORT reply (xid=0x2): 6 ports port "patch-tun": rx pkts=67, bytes=39152, drop=?, errs=?, frame=?, over=?, crc=? tx pkts=3216, bytes=213001, drop=?, errs=?, coll=? port "vhu143d1144-93": rx pkts=83, bytes=9669, drop=0, errs=0, frame=?, over=?, crc=? tx pkts=67, bytes=39152, drop=0, errs=?, coll=? port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=0, bytes=0, drop=3144, errs=0, coll=0 port "vhufafd17b2-ed": rx pkts=4598724880, bytes=257533869688, drop=0, errs=0, frame=?, over=?, crc=? tx pkts=4648833689, bytes=260339038168, drop=0, errs=?, coll=? port "int-br-link0": rx pkts=9296687968, bytes=557811267416, drop=?, errs=?, frame=?, over=?, crc=? tx pkts=9126896579, bytes=511115758012, drop=?, errs=?, coll=? port "vhu85ef82f8-1e": rx pkts=4529490571, bytes=253655745272, drop=0, errs=0, frame=?, over=?, crc=? tx pkts=4649180271, bytes=260359422240, drop=0, errs=?, coll=?
All clear, right?
There is a simple shell script to better expose those values. It yields better-organized data, not more.
take_net_stats_dpdk.sh --duration 60 ########## Open vSwitch DPDK STATS over 60 seconds ########## Port TX RX TX Drops RX Drops TX Errors RX Errors TX PPS RX PPS TX & RX PPS TX Drops PPS RX Drops PPS br-int 0 0 120 0 0 0 0 0 0 2.00 0 br-link0 62 2 0 0 0 0 1.03 0.03 1.06 0 0 br-tun 0 0 0 0 0 0 0 0 0 0 0 dpdk0 240016980 240020001 0 5226 0 0 4000283.00 4000333.35 8000616.35 0 87.10 dpdk1 240015499 240020134 0 5126 0 0 4000258.31 4000335.56 8000593.88 0 85.43 ovs-netdev 0 0 0 0 0 0 0 0 0 0 0 vhu1c8db502-c9 0 0 0 0 0 0 0 0 0 0 0 vhu49d8cd2f-24 240017011 240015522 0 0 0 0 4000283.51 4000258.70 8000542.21 0 0 vhub80d9af0-79 240016911 240016896 0 0 0 0 4000281.85 4000281.60 8000563.45 0 0
2.2 vHost TX Retries
The problem statement: when sending a batch of packets to a vhost-user or vhost-user-client interface, it may happen that some but not all of the packets in the batch are able to be sent to the guest. This is often because there aren't enough free descriptors in the virtqueue for all the packets in the batch to be sent.
In this case, there will be a retry, with a default maximum of 8 occurring. If at any time no packets can be sent, it may mean the guest is not accepting packets, so there are no (more) retries. In turn, this means that the buffers in the guest are full or the packet processing reached 100% and no more packets can be received. Looking at the worst scenario, the guest may be wrongly configured hence not able to accept more packets.
To see the TX Retries counter:
ovs-vsctl get Interface vhu85ef82f8-1e statistics:ovs_tx_retries 0
No matter how hard I push the system, I couldn't generate any retries (in my scenario the guest is a DPDK application, on a kernel one the result might be different).
2.3 vHost TX Contention
The problem statement: Multiple PMD threads may need to write to the same vHost queue. Writing on a VHU queue in an atomic action and only one PMD Thread can write at the same time, effectively all the others waste CPU cycles waiting for it. This is defined as a spinlock. When it happens, packets may be dropped.
To see the TX Contention (This system was pushed well beyond the limit)
ovs-appctl coverage/show | grep -E "Event|vhost_tx_contention" Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=a213c5ea: vhost_tx_contention 0.0/sec 0.000/sec 15538.8247/sec total: 55939769
With OVS 2.13, there is an ad-hoc command to interrogate a specific counter.
ovs-appctl coverage/read-counter vhost_tx_contention
2.4 vHost TX IRQs
The problem statement: When OVS-DPDK writes to a vHost queue that is handled in by the guest's kernel space, it needs to send an IRQ notifying the guest that some packets are available. The DPDK library responsible for this action is eventfd_write()
.
VNF handles the traffic on DPDK accelerated interfaces while the management traffic is most of the time in kernel space. From a NEPs perspective, leveraging the kernel TCP/IP stack in the VM is actually a big plus, and also, let's be clear, managing SSH and other remote configuration protocol access does not require the effort of writing an entire TCP/IP stack using DPDK.
The downside is that when a VNF vendor uses this management interface for many more activities (remote access, remote configuration, monitoring, config synchronization, time sync, clustering, remote health check, data replication, etc.) then OVS-DPDK (but actually any DPDK Accelerated vSwitch/vRouter) will spend a considerable amount of time sending those IRQ notifications which in turns slows down the PMD Thread(s) affecting other interfaces which can result in packet drops. This counter has this exact aim at tracking down this issue.
To see the TX IRQs, in this case just an initial SSH config, NTP running in background and that's about it.
ovs-appctl coverage/show | grep -E "Event|vhost_notification" Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=a213c5ea: vhost_notification 0.0/sec 0.067/sec 0.0225/sec total: 81
But what happens when running a heavy application over the kernel?
ovs-appctl coverage/show | grep -E "Event|vhost_notification" Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=9a784f6c: vhost_notification 8734.2/sec 6511.350/sec 479.0725/sec total: 1746806
2.5 vHost Detailed Packet Drop Statistics
This patch adds four new counters for the vHost interfaces:
Category |
Counter |
Description |
Transmit |
ovs_tx_failure_drops |
Packet drop due to unable to transmit (to VHU) e.g. the guest VHU TX queue is full, the guest packet processing is over max capacity, etc |
MTU |
ovs_tx_mtu_exceeded_drops |
Packet drop due to packet length bigger than the guest configured MTU size for a given interface |
QoS Policy |
ovs_tx_qos_drops |
Packet drop due to QoS Egress policy for a given interface e.g. transmitting (to VHU) higher rate than configured traffic shaping one (sending 100Mbps while shipping at 80Mbps) |
QoS Policy |
ovs_rx_qos_drops |
Packet drop due to QoS Ingress policy for a given interface e.g. receiving (to OVS) higher rate than configured traffic shaping one (sending 50Mbps while shipping at 40Mbps) |
All the counters are available underneath the VHU (again, not quite simple to drop packets here)
ovs-vsctl get Interface vhufafd17b2-ed statistics:ovs_tx_failure_drops 0 ovs-vsctl get Interface vhufafd17b2-ed statistics:ovs_tx_mtu_exceeded_drops 0 ovs-vsctl get Interface vhufafd17b2-ed statistics:ovs_tx_qos_drops 0 ovs-vsctl get Interface vhufafd17b2-ed statistics:ovs_rx_qos_drops 0
2.6 Userspace Improved Packet Drop Statistics
Now we arrive at the big patch, this one adds 24 new counters.
Here is an approximate description:
Category |
Counter |
Description |
Upcall drops |
datapath_drop_upcall_error |
Drop packet due to error in the Upcall process |
Upcall drops |
datapath_drop_lock_error |
Drop packet due to Upcall lock contention |
RX drops |
datapath_drop_rx_invalid_packet |
Drop invalid packet having size lower than what wrote in the Ethernet header |
Datapath drops |
datapath_drop_meter |
Drop packet in the OpenFlow (1.3+) Meter Table |
Datapath drops |
datapath_drop_userspace_action_error |
Drop packet due to generic error executing the action |
Datapath drops |
datapath_drop_tunnel_push_error |
Drop packet due to error executing the tunnel push (aka encapsulation) action |
Datapath drops |
datapath_drop_tunnel_pop_error |
Drop packet due to error executing the tunnel pop (aka decapsulation) action |
Datapath drops |
datapath_drop_recirc_error |
Drop packet due to error in the recirculation (this can also happen in the tunnel pop action) |
Datapath drops |
datapath_drop_invalid_port |
Drop packet due to invalid port |
Datapath drops |
datapath_drop_invalid_tnl_port |
Drop packet due to invalid tunnel port executing the pop action |
Datapath drops |
datapath_drop_sample_error |
Drop packet due to sampling error |
Datapath drops |
datapath_drop_nsh_decap_error |
Drop packet due to invalid NSH pop (aka decapsulation) |
OpenFlow (action=drop) |
drop_action_of_pipeline |
Drop packet due to pipeline errors e.g. error parsing datapath actions |
OpenFlow (action=drop) |
drop_action_bridge_not_found |
Drop packet due to bridge not found but, at time of translation, existing |
OpenFlow (action=drop) |
drop_action_recursion_too_deep |
Drop packet due to too many translations, this is also a system limit to protect translation from running too long or occupy too much space |
OpenFlow (action=drop) |
drop_action_too_many_resubmit |
Drop packet due to too many resubmitted, this is also a system limit to protect translation from running too long or occupy too much space |
OpenFlow (action=drop) |
drop_action_stack_too_deep |
Drop packet due to the stack consuming more than 64 kB, this is also a system limit to protect translation from running too long or occupy too much space |
OpenFlow (action=drop) |
drop_action_no_recirculation_context |
Drop packet due to missing recirculation context (also look in the log where the context ID will be reported) |
OpenFlow (action=drop) |
drop_action_recirculation_conflict |
Drop packet due to conflict in the recirculation |
OpenFlow (action=drop) |
drop_action_too_many_mpls_labels |
Drop packet due to MPLS pop action can't be performed as it has more labels than the supported (in OVS 2.13 up to 3). |
OpenFlow (action=drop) |
drop_action_invalid_tunnel_metadata |
Drop packet due to invalid GENEVE tunnel metadata |
OpenFlow (action=drop) |
drop_action_unsupported_packet_type |
Drop packet due to packet not supported by the specific rule
|
OpenFlow (action=drop) |
drop_action_congestion |
Drop packet due to congestion ECN (Explicit Congestion Notification) mismatch |
OpenFlow (action=drop) |
drop_action_forwarding_disabled |
Drop packet when the forward for a port is disabled (like when it's admin down) |
The counters are displayed through the usual coverage utility. Unless triggered, the counters are hidden. Also the only way to reset these counters is to restart ovs.
Here is an example when pushing 10Mpps with 1M flows:
ovs-appctl coverage/show|grep -E "Event|drop_action|datapath" Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=55d53f83: datapath_drop_upcall_error 0.0/sec 0.000/sec 0.0108/sec total: 39 datapath_drop_lock_error 0.0/sec 0.000/sec 0.0003/sec total: 1 drop_action_of_pipeline 100.6/sec 100.967/sec 95.1669/sec total: 342601
Still in this example, at the bottom on coverage/show, there is a hint saying most of the counters are still at zero with "72 events never hit."
3. Features Availability
Since the Fast DataPath delivery model is an incremental one, all the features mentioned above are available starting from the FDP 20.B or in other words using OVS 2.11.0-48 or above. While upstream everything is available with the 2.13 release. Breaking down the availability of feature by feature:
Feature |
Upstream Git Commit |
Upstream OVS |
FDP Version |
Downstream OVS |
OVS PMD Performance Metrics |
2.10 |
FDP 18.09 |
2.10.0-0 |
|
vHost TX Retries |
Many patches but the one exposing it is Here |
2.12 |
FDP 20.A |
2.11.0-23 |
vHost TX Contention |
2.13 |
FDP 20.A |
2.11.0-28 |
|
vHost TX IRQs |
2.13 |
FDP 20.B |
2.11.0-37 |
|
vHost Detailed Packet Drop Statistics |
2.13 |
FDP 20.B |
2.11.0-41 |
|
Userspace Improved Packet Drop Statistics |
2.13 |
FDP 20.B |
2.11.0-42 |
See here for the complete Fast DataPath matrix.
4. Wrap up
Telco customers on the TLS release, OSP13, are enjoying all of this with the z-stream 12 shipped in June together with FDP 20.D while for the ones on OSP16.0 (and upcoming 16.1), the FDP 20.B is already a thing since a few months and they just got the 20.D upgrade as well.
Before troubleshooting OVS-DPDK was either guesswork or plenty of Perf and GDB (raising massively the complexity and also the time required for the RCA), now we can safely say it's much easier with more hints and tools where the problem is.
I hope some improvements will be made in the way how counters are displayed and especially new features will not get merged without their respective counters. But for the time being, we should be happy with what we've got here.
5. References
About the author
Federico Iezzi is an open-source evangelist who has witnessed the Telco NFV transformation. Over his career, Iezzi achieved a number of international firsts in the public cloud space and also has about a decade of experience with OpenStack. He has been following the Telco NFV transformation since 2014. At Red Hat, Federico is member of the EMEA Telco practice as a Principle Architect.