MDS vulnerabilities explained in ~three minutes

For a more detailed technical view of MDS, please see this deeper dive with Jon Masters.

Over the past year, we have all heard about various hardware-level security vulnerabilities affecting the microprocessors that power our modern infrastructure. These issues, with names like “Meltdown”, “Spectre”, “Speculative Store Buffer Bypass”, and “Foreshadow”, collectively known as speculative execution side-channel vulnerabilities, impact a performance optimization in which microprocessors attempt to guess ahead about future program behavior in order to reduce the time waiting around for data to be loaded from slower external memories. Today, yet another set of vulnerabilities were disclosed, known as Microarchitectural Data Sampling (MDS). These are similar to those we have seen before, but they involve different parts of the processor.

Let’s dig into MDS, how it works and how we can mitigate the impact of this vulnerability upon our customers and their sensitive data. As with the other issues we have seen over the past year, we can’t fix hardware flaws with software, but we can mitigate their impact by trading some performance for improved security, as in this case.

The state of the processor

Modern processors are designed to perform calculations upon data under the control of stored programs. Both the data and the program are contained within relatively large memory chips that are attached to the external interfaces of the processor. If you’ve ever looked at a modern server motherboard, these look like parallel rows of small circuit boards (known as DIMMs, or Dual Inline Memory Modules) that run along each side of the central processor chip. The processor itself has small amounts of internal memory, known as caches, that are loaded with copies of data from memory in order to perform calculations. Data must usually be “pulled” into the caches (an automatic process) in order for the “execution units” to perform calculations.

These caches form a hierarchy in which the innermost level runs at the core speed of the processor’s execution units, while the outer levels are progressively slower (yet still much faster than the external memory chips). We often say in computer engineering that you can have small and fast or big and slow memories, but the laws of physics don’t let you have both at the same time. As a result, these innermost caches are tiny, too small in fact for even a single whole cat image (e.g. 32 kilobytes), but they are augmented by multiple additional larger levels of caches, up to many megabytes in size. Compared to the external memories (which may be many gigabytes), even the larger caches are small. As a result, data is frequently pulled into the caches, and then evicted and written back to memory in order for other data to be loaded.

Caches keep copies of data from external memory, thus every entry in the cache has both an address, and an associated data value. Caches are further split at the innermost level into those that keep copies of data, and those that keep copies of program instructions. When a program adds two numbers together, the values are typically first loaded into the innermost processor cache (the “L1” data cache, or L1D$), then the calculation is performed, and finally the result is also stored in the L1 data cache. Eventually, the result will also be written back to the correct corresponding external memory location, but this can happen at a more leisurely pace. In most modern systems, results will remain in the caches until space needs to be freed up because such data might be used again. Keeping it in the caches reduces the time to access the data.

Processor decisions: Now vs. then

In the earliest days of computing, processors ran at similar speeds to the memories to which they were attached, and which contained the data upon which they would operate. Over time, the relative difference in performance became more acute. Today, there may be whole orders of magnitude between the relative performance of a processor and its external memories. Thus it is critical to ensure that data is available to the processor right when it needs it. Sophisticated hardware such as “prefetchers” help to pull data that may be needed into the caches ahead of time, but there are still many situations in which the processor needs data that is not cached.

A common example is that of a branch condition. The processor may hit a piece of code that needs to make a decision - is it raining today? The answer may be stored in a piece of data contained in the external memory. Rather than waiting for the data to be loaded into the cache, the processor can enter a special mode known as speculation in which it will guess ahead. Perhaps the last few times this piece of code ran, the result was that it was not raining. The processor may then guess the same is still true. It will thus begin to “speculatively” execute ahead in the program based upon this potential condition, all while the data needed is being loaded from memory (“resolved”). If the guess is right, significant time can be saved.

Conversely, if the guess is wrong, the processor must now discard any transient operations that it has performed as a result of miss-speculation. It will then recompute the correct result. All of this happens under the hood, in the processor implementation, the so-called “microarchitecture”. This differs from the programmer visible model of the world known as a “architecture”. At an architectural level, the program tests a condition (is it raining?) and only ever executes the correct path in the program as a result. The fact that the processor may speculate underneath is supposed to be invisible to the programmer, and for decades it was assumed to be the case.

The lessons of Meltdown and Spectre

At the start of 2018, Spectre and Meltdown taught us that these assumptions were no longer valid. It was discovered that a determined attacker could leverage “side channels” to exfiltrate information from the microarchitectural state and make it visible at an architectural level. This is typically done by exploiting the fundamental property of caches: they make access to recently used data faster. Thus, it is possible to write code that will perform a timing analysis on memory access, inferring whether a location is present in the caches or not. By performing a second (speculative) memory access that depends upon the value of sensitive data contained in a first memory location, an attacker can use cache side-channel timing analysis to reconstruct the value of the sensitive data.

Meltdown exploited such a sequence of loads, in combination with an additional optimization common to many different processors allowing a data load from privileged memory to proceed in parallel with the corresponding permission check to see if the access was permitted. To save time, the processor may speculate that the access is permitted, allowing a second “dependent” load to be performed based upon the content of the first. When the access check eventually completes, the processor kills the speculative state and generates the appropriate access violation error, but by then the second load has altered the cache in a way that can be later measured by the attacker in order to reconstruct the content of the privileged memory.

Meltdown exists in part because of the complexity of the infrastructure used to handle loads (reads) of memory, and stores (writes) to memory. Modern application programs don’t operate directly upon the address of data in physical memory. Instead, they use “virtual memory”, which is an abstraction in which nicely linear and uniform memory addresses are seen by each program, and are translated by the processor Memory Management Unit (MMU) into physical addresses. This happens using a set of operating system managed structures known as Page Tables, and navigating (“walking”) through these takes some time. To speed things up, the processor has a series of Translation Lookaside Buffers (TLBs) that store a small number of recent translations.

We mitigated Meltdown by observing that successful exploitation required secret data to be either present in the innermost data cache, or a valid virtual memory mapping to exist for the privileged data. For performance reasons, Linux and other operating systems used to have a mapping installed for all operating system memory in the address space of each and every running application. KPTI (Kernel Page Table Isolation) removes this by splitting the set of translations such that there won’t be a useful translation present during a Meltdown attack. Thus we trade some performance loss (having to switch these page tables every time a program calls upon the OS in a system call) for improved security. Newer hardware removes the need for PTI.

The differences of MDS

MDS has some similarities with the previous vulnerabilities, as well as some important differences. MDS is in fact a family of vulnerabilities in different (related) components of the processor. Unlike Meltdown, MDS doesn’t allow an attacker to directly control the target memory address from which they would like to leak data. Instead, MDS is a form of “sampling” attack in which an attacker is able to leverage cache side channel analysis in order to repeatedly measure the stale content of certain small internal processor buffers that are used to store data as it is being loaded into the caches or written back to memory. Through a sophisticated statistical analysis, it is possible to then reconstruct the original data.

Each of the variants of MDS relies upon an optimization in the load path of Intel processors. When a load is performed, the processor performs address generation and dispatches an internal operation (known as a uop or micro-op) to control the execution unit performing the load. Under certain conditions, the load may fault (for example due to attempting to access memory marked as not having a valid translation), or it might be complex enough that it requires an “assist”. This is a special process in which the processor begins to execute a simple built-in microcoded program to help handle a load that can’t be implemented easily in pure hardware. In either case, the design of the processor is such that it may speculate beyond the pending fault.

During the window of resulting speculation, prior to handling the fault or assist, the processor may speculatively forward stale values from internal buffers that may be available to subsequent operations. The designers knew that the processor would shortly be handling the fault or assist, and would throw out any speculated activity as a result, so this was perhaps not seen as a problem, especially in light of the previous assumptions around speculation being a “black box”. Unfortunately, this allows a modified Meltdown-like attack to extract the stale buffer state.

The MDS store buffer variant aka "Fallout"

The store buffer variant of MDS (MSBDS) targets a small structure within the processor that contains copies of recent stores made by programs to memory. Any time an application writes to a data variable, that write goes through the store buffer, and is ultimately committed to the cache on its way to external memory. The store buffer is central to a few critical performance features of modern computers. Firstly, it enables speculation (and out-of-order execution) by allowing stores to memory to be speculative, only existing in the store buffer until it is confirmed that they should really impact the architectural state of the processor. Secondly, the store buffer enables store-to-load forwarding in which later (younger) loads in a program re-use older stores.

The “store buffer” is, in fact, not even a single structure within the processor. Instead, it is distributed as part of a complex set of structures that together comprise the more abstract MOB (Memory Ordering Buffer). The MOB handles preserving x86 memory ordering and consistency semantics, sequencing the order of all loads and stores to ensure they become visible to other processors in the correct order. An individual store operation is further decomposed into two sub-operations, known as STA (STore Address) and STD (STore Data). When a store is performed by an application, the processor generates STA and STD uops that are dispatched for execution sequentially by the execution units. Thus, a store in Intel processors is not necessarily a single atomic operation but is ultimately recombined into such in the end.

As a result of the store buffer design, entries are tagged as containing valid addresses and/or valid data. Under certain conditions, it is possible that entries containing data from previous stores may speculatively be forwarded as matching with younger (more recent) loads that are falsely seen to depend upon them. Thus an attacker can monitor recent stores performed by other programs, containers, the operating system, or even other virtual machines running on the same thread. Store buffers are partitioned between two sibling Hyper-Threads of the same core, thus making MSBDS the only variant safe from exploit across two different HT threads.

Mitigation of the store buffer variant is similar to the other MDS variants. We overload a rarely used legacy x86 instruction (VERW, Verify Segment for Writing) with a side-effect that causes it to now also flush the affected internal processor microarchitectural buffers. This change is made by updating the microcode on impacted processors such that VERW retains its existing (very rarely used) semantics, but also performs the necessary flush. Then, on vulnerable processors we automatically perform boot-time patching of the Linux kernel to dynamically insert VERW code sequences whenever transitioning across certain privilege boundaries - including from one process or container to another, when entering or leaving OS (Linux kernel) code, or when switching from one virtual machine to another through hypervisor (KVM) code. There is a performance impact to this flush, and over time we may be able to further optimize it.

The fill buffer variant of MDS aka "RIDL"

The fill buffer (MFBDS) variant of MDS targets a small structure that lives alongside the L1 data cache of Intel microprocessors. When a data value is being loaded into the L1D$ from memory (or another cache), it must go through the fill buffer. The fill buffer is critical to Intel’s implementation of a non-blocking cache, allowing multiple loads into the L1D$ to proceed while others are still outstanding (the MOB will ensure that the eventual architecturally visible ordering is consistent). The fill buffer loads a sequence of bytes from memory representing a cache “line” (unit of measurement, 64 bytes in Intel implementations), all the while still servicing requests from other caches in the system to maintain a coherent view of memory during the load.

As with the store buffer before it, the fill buffer has validity bits and may contain data from previous fill operations (but marked as invalid). Under certain conditions, as described previously, it is possible that the fill buffer may speculatively forward these previous stale data values to younger dependent operations. Thus an attacker can monitor recent loads performed by other programs, containers, the operating system, or even other virtual machines running on the same core. Unlike with store buffers, fill buffers are not split between two sibling Hyper-Threads but are instead a shared resource at the core level. Thus, it is not possible to fully prevent one thread from monitoring the fill buffer activity of a peer Hyper-Thread on the same core.

This is where the most performance impacting component of the MDS mitigation takes hold. Since we cannot fully prevent cross-thread attacks, complete mitigation of MDS may require that some users disable Intel Hyper-Threading Technology. This is typically the case when running untrusted workloads, especially containers or virtual machines in a multi-tenant environment, such as in a public cloud. In this case, part of the mitigation advice is to specify a kernel command line option (see the “mds=full,nosmt” command line option) that will both mitigate MDS, and disable Intel Hyper-Threading. It is important to note that Red Hat is not disabling Intel Hyper-Threading by default. Those administrators desiring to disable Hyper-Threading need to take specific action in order to turn it off on their systems.

The load port variant of MDS

The load port variant of MDS targets a processor structure used during the process of loading a single data value into a processor register. Registers are small memories that store data as it is being operated upon by the execution units. Data is loaded from a cache line into a register for processing, by means of the processor load ports. There are typically only a couple of load ports, and they are competitively shared by peer sibling threads of a Hyper-Threaded core. During the process of performing a load into a register, the load port needs to be able to handle the largest possible load that it may encounter - such as a 512-bit wide vector value - in addition to the smaller 8, 16, 32, and 64-bit loads performed routinely during the course of a program.

Intel’s implementation of load ports doesn’t completely zero out previous data that might have been loaded by older instructions within a program. Instead, it tracks the size of the load, and only forwards those bits that are supposed to be accessed by a load to the internal processor register “file”. But as an optimization, loads can be forwarded (bypassed) for certain operations even while the data is in flight to the register file. This forwarded data may speculatively appear to be larger than its actual width, allowing an attacker to sample certain stale load port data. In the vector example, which is commonly used by cryptographic code, a reasonably large amount of data can be sampled, potentially allowing attackers to derive bits of cryptographic keys used by other applications. This is not in any way related to the earlier “LazyFPU” vulnerability.

In addition, load ports must handle the case that a load is not naturally aligned, meaning that the location of the load in memory is not optimal for performance, but instead crosses address boundaries such as from one cache line to another, or from the end of one page of memory into another page. In these cases, which aren’t good programming practice but are allowed by the x86 architecture, the processor may actually perform multiple loads and then have to coalesce these into a single final value as seen by the architectural state of the machine. But once again, a small window exists during which speculated instructions may see some partially stale data.

The complex behavior of load ports are slightly different under the hood but structurally similar to the fill buffer, and mitigated in the same manner, by flushing any potential stale values when transitioning between privilege levels. Unfortunately, once again, the sharing of these resources has an impact upon the safety of Intel Hyper-Threading Technology since an attacker able to become co-tenant on the sibling thread of a vulnerable core in which the peer thread is performing secret calculations can attempt to coerce that thread into leaking secrets.

We learned from the 2018 experience of Spectre, Meltdown, Foreshadow, and other vulnerabilities that there is now a new normal in which security researchers are likely to continue to discover speculative execution side-channels that impact modern microprocessors. MDS is then just the latest outcome of this ongoing research. We are grateful to the academic and industry partner security community for their collaboration throughout the coordinated vulnerability disclosure process as we sought appropriate mitigations and remediations that can help to keep users and their data safe from exploitation. For further information, we recommend consulting the academic papers, as well as advisories from Red Hat, Intel, and other partners.

For more information the MDS family of vulnerabilities, please consult the Red Hat Knowledgebase article on MDS.

Jon Masters is computer architecture lead and distinguished engineer at Red Hat.