Understanding L1 Terminal Fault aka Foreshadow: What you need to know

14 agosto 2018Jon Masters20 minuti (tempo di lettura)

L1 Terminal Fault/Foreshadow explained in ~three minutes

For a more detailed technical view of L1 Terminal Fault, please see this deeper dive with Jon Masters.

How we got here: a brief history of modern microprocessor caches

Modern computer microprocessors are complex machines, filled with many optimizations designed to squeeze out every last ounce of performance available. The earliest stored program computers were, by contrast, relatively simple machines. They executed the programs provided by the user, following each of the binary encoded instructions precisely, in the order given, delivering predictable program flow and results. This worked well in the early days of computing in large part because different parts of the computer operated at similar speeds. The processor (CPU) wasn’t much faster than the memory chips attached to it, for example. As a result, the processor could simply wait to load data from memory as a program needed it.

Over time, the relative performance of the different components of computers changed dramatically. Microprocessors continued to increase in terms of frequency from a few thousands of instructions per second to millions, and eventually billions of instructions per second. At the same time, transistor counts (area) that could fit onto a single chip also increased by many orders of magnitude, again from thousands to the many billions of transistors found on today’s high-end processors. This allowed for increasingly complex designs. Yet, while the other parts of the computer also advanced (memory chips went from a few megabytes to many gigabytes), relative processor performance became much faster. As a direct consequence of this disparity in performance, processors began to become bottlenecked on slower external memory, often waiting hundreds, thousands, or millions of “cycles” for data.

Academia came to the rescue with the innovation of computer caches. Caches store copies of recently used data much closer to the processor’s functional units (the parts that perform calculations). The data values are stored in slower external memory and as they are loaded by the program, they are also stored in the processor caches. In fact, modern processors (almost) never operate directly on data stored in external memory. Instead, they are optimized to work on data contained within the caches. These caches form a hierarchy in which the highest level (L1) is closest to the “core” of the processor, while lower levels (L2, L3, etc.) are conceptually further away. Each level has different characteristics.

In digital electronics, it is often said that you can have “small and fast” or “big and slow”, but rarely can you have both. Thus, the L1 data cache within a modern processor is only on the order of 32 kilobytes (barely enough for a single cat photo) while the L3 (sometimes called LLC - Last Level Cache) closer to the external memory might be 32 megabytes or even larger in today’s highest-end servers. The manner in which caches are designed (“organized”) varies from one design to another, but in a classical inclusive design, copies of the same data will be contained within multiple levels of the cache depending upon when it was last used. A replacement algorithm within the processor will “evict” less recently used entries from the L1 data cache in order to load new data while keeping copies within the L2 or L3. Thus, it is rare that recently used data must be loaded from memory, but it may well come from one of the larger, slower, lower level caches into the L1 when needed.

Caches are a shared resource between multiple individual cores, threads, and programs. Modern processor chips contain multiple cores. Each of these behaves just as a traditional single processor computer. Each may execute individual programs or run multiple threads of the same program and operate with shared memory. Each core has its own L1 cache and may have its own L2 cache as well, while the larger L3 is usually shared by all of the cores in a processor. A technique known as cache “coherency” is then employed by the processor to keep each core’s internal copy of a memory location in sync with any copy stored in the cache of another core. This is achieved by the cores tracking ownership of memory and sending small messages to one another behind the scenes each time memory is updated.

The shared nature of caches is both a benefit, as well as a potential source of exploitation. We learned from the “Spectre” and “Meltdown” vulnerabilities that caches can serve as a potential “side-channel” through which information may be leaked. In side-channel analysis, data is not directly accessed but is instead inferred due to some property of the system under observation. In the case of caches, the relative difference in performance between cache memory, and the much slower main memory external to a processor is the whole point of using caches to begin with. Unfortunately, this same difference can be exploited by measuring the relative difference in access times to determine whether one memory location is in the cache or not. If it is contained within the cache, it has been recently accessed as a result of some other processor activity. As an example, data loaded during speculative execution by a processor will alter the cache state.

Virtual and physical memory addressing

Modern processor caches typically use a combination of virtual and physical memory addresses to reference data. Physical addresses are used to access main memory. Conceptually, you can think of the external memory DIMM or DDR chips in your computer as being a giant array of values beginning at address zero and continuing upward until the memory is exhausted. The amount of physical memory varies from one machine to another, from 8GB in a typical laptop to hundreds of gigabytes or even more in contemporary high-end server machines. Programmers once had to worry about physical memory when writing programs. In the earlier days, it was necessary for the programmer to explicitly track what was contained in each physical memory location and to avoid potential conflict with other applications that use memory.

Today’s machines use virtual memory. Virtual memory means that the operating system is able to present each application with its own conceptually isolated view of the world. Programs see memory as a nearly infinite range within which they can do whatever they like. Every time the program accesses a memory location, the address is translated by special hardware within the processor known as the Memory Management Unit (MMU). The MMU works in consort with the operating system (OS), which creates and manages a set of page tables that translate virtual memory addresses into physical ones. Physical memory is divided into tiny chunks known as pages that are typically 4KB in size. Page tables contain translations for these pages such that one 4KB size range of virtual addresses will translate into a 4KB range of physical ones. As a further optimization, page tables are hierarchical in nature with a single address being decoded through a sequence of “walks” through several layers of tables until it has been fully translated.

MMUs contain special hardware that can read and even update the OS managed page tables. These include page table “walkers” that go through the process of a table walk, as well as additional hardware that can update page table entries to indicate recently accessed data. The latter is used by the OS for tracking what data can be temporarily “paged” or “swapped” out to disk if it has not recently been used. A page table entry can be marked as “not present”, meaning that any attempt to access the associated address will generate a special condition known as a “page fault” that signals the OS to take action. Thus, the OS can intercept attempts to access data that has been previously swapped out to disk, pulling it back into memory, and resuming an application without it noticing this has happened. As a result, paging is used to create the illusion of having more physical memory than actually exists.

As you might imagine, page table walks can be expensive from a performance perspective. OS managed page tables live in real physical memory which must be loaded into the processor when read. Walking a page table can take quite a few such memory accesses, which would be prohibitively slow were it to happen every time. So instead of doing this walk on each memory access by a program, the processor will cache the result of these table walks in a separate processor structure, known as a Translation Lookaside Buffer (or TLB). Recently used translations are thus much faster to resolve into physical addresses because the processor need only search the TLB. If an entry does not exist, then the processor will perform the much more expensive page walk and populate a TLB entry, possibly evicting another entry in the process. Incidentally, we saw another unrelated attack against TLBs recently in the form of the TLBleed vulnerability.

When programs read or write to memory, these accesses go through the highest level (L1) data cache, which is (in most modern implementations) known as Virtually Indexed, Physically Tagged (VIPT). This means that the cache uses part virtual and part physical addresses to look up a memory location. As the virtual address for a load is examined, the processor will perform a simultaneous search of the TLB for the virtual to physical page translation, while beginning to search the cache for a possible matching entry by using the offset within a single page. Thus, the process of reading from a virtual memory location is extremely complicated in almost every modern processor. For the curious, this common design optimization explains why L1 data caches are typically 32KB on processors with a 4KB page size, since they are limited by the available number of offset bits within a single page to begin the cache search.

Intel processors contain a further optimization in how they handle the process of a page table walk and “terminal faults” (not present). We will dive into this further after first reviewing the speculative nature of modern processors. You can skip over the next section if you’re already familiar with vulnerabilities such as Meltdown.

Out-of-Order and Speculative execution

The adoption of caches enabled faster microprocessor performance improvements as compared with other parts of a modern computer platform. Academia and industry working together created foundational innovations - such as Out-of-Order (OoO) and speculative execution - that serve as the underpinning for the subsequent consistent gains in throughput and performance seen in recent years. As transistor counts increase and processors become more complex, the possible optimizations further advance, but they are all still built upon the the key insight of an OoO design.

In OoO, the processor is conceptually split into an “in-order” front-end and an “out-of-order” back-end. The front-end takes as input the user program. This program is sequential in nature, formed from blocks of code and occasional branches (such as “if” statements) to other blocks based upon the conditional evaluation of data values. The in-order front end dispatches instructions contained within the program to the out-of-order backend. As each of these instructions is dispatched, an entry is allocated within a processor structure known as the Re-Order Buffer (ROB). The ROB enables data dependence tracking, realizing the key innovation of OoO that instructions can execute in any order just so long as the programmer is unable to tell the difference in the end. They only see the same effect as in a sequential execution model.

The ROB effectively serves to convert an in-order machine into what is known as a “dataflow” machine, in which dependent instructions are forced to wait to execute until their input values are ready. Whenever an entry in the ROB containing a program instruction has all of its dependent data values available (e.g. loaded from memory), it will be issued to the processor functional units and the result stored back in the ROB for use by following instructions. As entries in the ROB age out (become the “oldest instruction in the machine”) they are known as retired and are made available to the programmer visible view of the machine. This is known as the “architecturally visible” state and it is identical to that obtained from a sequential execution of the program. Re-Ordering instructions in this manner provides significant speedups.

Consider this example pseudocode:

LOAD R1

LOAD R2

R3 = R1 + R2

R4 = 2

R5 = R4 + 1

R6 = R3 + 1

The example uses the letter R to designate small processor internal memory locations known as registers, or more broadly as GPRs (General Purpose Registers). There are typically only a small handful of registers, 16 in the case of Intel x86-64 machines. For convenience here, we number them R1, R2, etc. while in reality, they have other names, such as RAX, RBX, etc.

In a classical sequential model of execution, the first two instructions could cause the machine to wait (“stall”) while slow external memory locations were accessed. Caches speed this up, but even if the two values are contained within some level of the processor cache, there may still be a small delay while the load instructions are completed. Rather than wait for these, an OoO machine will skip ahead, noticing that while instruction number 3 depends upon the first two (has a “data dependency”), instructions 4, and 5 are independent. In fact, those two instructions have no dependency at all upon the earlier instructions. They can be executed at any time and don’t even depend upon external memory. Their results will be stored in the ROB until such time as the earlier instructions have also completed, at which time they too will retire.

Instruction number 6 in the previous example is known as a data dependency. Just like how instruction 3 depends upon the results of loading memory locations A and B, instruction 6 depends upon the result of adding those two memory locations. A real-world program will have many such dependencies, all tracked in the ROB, which might be quite large in size. As a result of the large size of this structure, it is possible to layer another innovation upon OoO.

Speculation builds upon OoO execution. In speculative execution, the processor again performs instructions in a sequence different from that in which the program is written, but it also speculates beyond branches in the program code.

Consider a program statement such as:

if (it_is_raining)

pack_umbrella();

The value “it_is_raining” might be contained in (slower) external memory. As a consequence, it would be useful for the processor to be able to continue to perform useful work while waiting for the branch condition to be “resolved”. Rather than stalling (as in a classical, simpler design), a speculative processor will guess (predict) the direction of a branch based upon history. The processor will continue to execute instructions following the branch, but it will tag the results in the ROB to indicate that they are speculative and may need to be thrown away. A notion of checkpointing within the processor allows speculation to be quickly undone without becoming visible to the programmer, but some artifacts of speculative activity may still be visible.

We learned from the Spectre vulnerabilities that processor branch predictors can be tricked (trained) into predicting a certain way. Then, code can be executed speculatively and have an observable effect upon the processor caches. If we can find suitable “gadgets” in existing code, we can cause deliberate speculative execution of code that ordinarily should not form part of a program flow (e.g. exceeding the bounds of an array (in Spectre-v1), and then cause dependent instructions to execute that will alter locations in the cache through which we can infer the value of data to which we should not have access). You can read more about the Spectre vulnerabilities by referring to our earlier blog.

Increasing speculation in Intel processors

Modern processors take speculation much further than simply running ahead in a program. Since the speculative apparatus has been created and is already in use, microprocessor vendors like Intel have further extended it in order to speculate upon all manner of additional possible states within the processor. This includes speculating upon the result of a page table walk by the MMU page walker during the translation of a virtual to physical memory address.

Intel defines the term “terminal fault” to mean the condition that arises whenever a Page Table Entry (PTE) is found to be “not present” during a page table walk. This typically happens because an operating system (OS) has swapped a page out to disk (or not yet demand loaded it) and marked the page as not present in order to trigger a later fault on access. As explained earlier, the illusion of swapping allows an OS to provide much more virtual memory than physical memory within the machine. Page faults will occur in the case of a not present page, and the OS can then determine what memory location needs to be swapped back in from disk.

The OS does this by using the bits of a “not present” PTE to store various housekeeping data, such as the physical location on disk containing the content of the page. The Intel Software Developer’s Manual (SDM) states that pages marked as not present in this fashion will have all of the remaining bits (other than the present bit) ignored such that they are available to the OS. Both Linux, Windows, and other operating systems make heavy use of these PTE bits for “not present” pages in order to support swapping, as well as for various other permitted purposes.

As mentioned previously, processors such as Intel’s use a common optimization in which virtual address translation is performed in parallel with cache access to the Virtually Indexed, Physically Tagged (VIPT) L1 data cache. As a further optimization, Intel realized that in the optimal case (critical logic path) a data value loaded from memory is present in the cache, and there is a valid page table translation for it. Put another way, it is less likely that the page table has an entry that is marked “not present”. As a result, modern Intel processors delay handling the present bit check slightly and forward the content of Page Table Entries (PTEs) directly to the cache control logic while simultaneously performing all of the other checks, including whether the entry is valid.A new set of speculation vulnerabilities

As in Meltdown before it, the ROB is tagged to indicate if a “not present” fault should be raised, but meanwhile, the processor will continue to speculate slightly further in the program until this fault takes effect. During this small window, any data values present in the L1 data cache for the “not present” page will nonetheless be forwarded to dependent instructions. An attack similar in concept to Meltdown can be used to read data from physical addresses if a “not present” page table entry can be created (or cause to be created) for the address, and if that physical address is currently present in the L1 data cache. This is known as an L1 Terminal Fault attack.

On Linux, an attacker could exploit this vulnerability and attempt to compromise the kernel or another application through a malicious use of the mprotect() system call to cause a “not present” page table entry for a physical address of interest that might be in the cache. If they can then trick the other application (or the kernel) into loading a secret of interest - such as a cryptographic key, password, or other sensitive data - then they can extract it using an attack similar in nature to the Meltdown exploit code. This attack may be mitigated by changing how Linux generates “not present” PTEs such that certain physical address bits are always set in the PTE (using an inexpensive masking operation), thus the processor will still forward a physical address in the “not present” case, but it will appear to be a large physical address that is outside of the range of populated physical memory in all but the most extreme cases.

Beyond bare metal

The L1TF attack against bare metal machines is trivial to mitigate through a few lines of kernel code (that is available in all of our errata releases, and has also been submitted for inclusion in upstream Linux). This mitigation has no measurable performance impact and requires systems be promptly patched . If that were the end of it, this blog post might not be necessary, and nor would the inevitable attention that will be paid to this latest vulnerability.

Unfortunately, there are several other components to this vulnerability.

One relates to Software Guard Extensions (SGX). SGX is an Intel technology, also referred to as a “secure enclave” in which users can provide secure software code that will run in a special protected “enclave” that will keep that software from being observed by even the operating system. The typical use case of SGX is to provide tampering protection for rights management, encryption, and other software. Red Hat does not ship SGX software, which is typically owned and managed directly by third-parties. Intel has protected SGX by issuing processor microcode updates designed to prevent it from being compromised through the “Foreshadow” SGX specific variant of the L1TF vulnerability. Red Hat is providing access to courtesy microcode updates to assist in deploying this mitigation.

Another variant of L1TF concerns virtualization use cases. In virtualized deployments, Intel processors implement a technology known as EPT (Extended Page Tables) in which page tables are jointly managed by both a hypervisor, a guest operating system running under that hypervisor, and the hardware. EPT replaces an older software-only approach in which the hypervisor was forced to use shadow page tables. In the older design, each time a guest operating system wanted to update its own page tables, the hypervisor would have to trap (stop the guest), update its own shadow tables (as used by the real hardware), and resume the guest. This was necessary to ensure that a guest could never try to create page tables accessing memory disallowed by the hypervisor.

EPT significantly improve performance because a guest operating system can manage its own page tables just as it would on bare metal. Under EPT, each memory access is translated multiple times, first using the guest page tables from a guest virtual address to a guest physical address, and then by the Hypervisor page tables from a guest physical address to a host physical address. This process allows all of the benefit of native bare metal hardware assisted page translation while still allowing the hypervisor to retain control since it can arrange for various traps to occur, and manage which guest physical memory is available.

When Intel implemented EPT, it was an extension of the existing paging infrastructure. It seems likely that the extra stage of translation was simply added beyond the traditional one. This worked well, with the exception of one small problem. A terminal fault condition arising in the guest virtual to physical address translation can result in an untranslated guest physical address being treated as a host physical address, and forwarded on to the L1 data cache. Thus, it is possible for a malicious guest to create an EPT page table entry that is marked as both “not present” and also contains a host physical address from which it would like to read. If that host physical address is in the L1 data cache, it can read it.

As a result, if a malicious guest can cause a Hypervisor (or another guest) to load a secret into the L1 data cache (e.g. just by using a data value during its normal operation), it can extract that data using an attack that is similar to Meltdown. When we first reproduced this attack within Red Hat earlier this year, we used a modified version of the TU Graz Meltdown code to extract data from known physical addresses in which we had stored interesting strings. While we should have seen an innocuous string we stashed in the guest physical memory, once the malicious L1TF PTE was created for the same location in the host, we read its memory instead. There are a few additional pieces required to reproduce the vulnerability that are omitted.

L1TF is a significant threat to virtualized environments, especially those that contain a mixture of trusted and untrusted virtual machines. Fortunately, L1TF can be mitigated with a modest cost to system performance. Since a successful exploit requires that data be contained within the L1 data cache on a vulnerable machine, it is possible to arrange for the L1 to be flushed before returning to a guest virtual machine in cases where secrets or other data of interest to a malicious party might have been loaded. Performing this flush is not without cost, but a refill from the L2 to the L1 cache is fast (only a few hundreds of cycles) and uses a high bandwidth internal bus that exists in these processors. Thus, the overhead is on the order of a few percent. We have both a software (fallback) cache flush, as well as an (optimized) hardware-assisted flush that is available through microcode updates.

The L1 data cache flush mitigation will be automatically enabled whenever virtualization is in use on impacted machines.

There is one further complexity to the L1TF vulnerability concerning Intel Hyper-Threading.

Simultaneous Multi-Threading

Processors may implement an optimization known as Simultaneous Multi-Threading (SMT). SMT was invented by Susan Eggers, who earlier this year received the prestigious Eckert-Mauchly Award for her contributions to the field. Eggers realized in the 1990s that greater program thread-level parallelism could be realized by splitting a single physical processor core into several lighter-weight threads. Rather than duplicating all of the resources of a full core, SMT duplicates only the more essential resources needed to have two separate threads of execution running at the same time. The idea is that expensive (in terms of transistor count) resources like caches can be shared tightly between two threads of the same program because they are often operating upon the same data. Rather than destructively competing, these threads actually serve to pull useful data into their shared caches, for example in the case of a “producer-consumer” situation in which one thread generates data used by another thread.

Intel was one of the early commercial adopters of SMT. Their implementation, known as “Hyper-Threading” has been effective, leading to average computer users referring to the number of cores and threads in their machines as a key characteristic. In Hyper-Threading, two peer threads (siblings) exist within a single core. Each is dynamically scheduled to use the available resources of the core in such a way that an overall gain in throughput of up to 30% can be realized for truly threaded applications that are operating on shared data. At the same time, effort is made to reduce the impact of unintentionally disruptive interference (for example in terms of cache footprint) between two threads that aren’t closely sharing resources.

Indeed, so good is the general implementation of Intel Hyper-Threading that in many cases it can be hard for end users to distinguish between Hyper-Threading threads and additional physical cores. Under Linux, these threaded “logical processors” are reported (in /proc/cpuinfo) almost identically to full processor cores. The only real way to tell the difference is to look at the associated topology, as described in the “core id”, “cpu cores”, and “siblings” fields of the output, or in the more structured output of commands that parse this topology, such as “lscpu”. The Linux scheduler knows the difference, of course, and it will try not to schedule unrelated threads onto the same core. Nonetheless, there are many occasions (such as in HPC applications) where the potential for interference between unrelated threads outweighs the benefit. In these cases, some users have long disabled Hyper-Threading using BIOS settings in their computer firmware.

The concept of not splitting Hyper-Threading threads across different workloads has long extended into the realm of virtualization as well. For a number of reasons, it has long made sense to assign only full cores (so-called “core scheduling”) to virtual machine instances. Two different virtual machines sharing a single core can otherwise interfere with one another’s performance as they split the underlying caches and other resources. Yet for all the potential problems that can arise, it has long been tempting to treat threads as cheap extra cores. Thus, it is common in today’s deployments to split VMs across Hyper-Threads of the same core, and technologies like OpenStack will often do this by default. This has never been a great idea, but the impact to overall security is far more significant in the presence of the L1TF vulnerability.

Hyper-Threads run simultaneously (the “S” in “SMT”), and as a result, it is possible that one thread is running Hypervisor code or another virtual machine instance, while simultaneously the peer thread is executing a malicious guest. When entering the malicious guest on the peer thread, the L1 data cache will be flushed, but unfortunately, it is not possible to prevent it from subsequently observing the cache loads performed by its peer thread. Thus, if two different virtual machines are running on the same core at the same time, it is difficult to guarantee that they cannot perform an L1TF exploit against one another and steal secrets.

The precise impact of L1TF to Hyper-Threading depends upon the specific use case and the virtualization environment being used. In some cases, it may be possible for public cloud vendors (who have often built special purpose hardware to assist in isolation) to take steps to render Hyper-Threading safe. In other cases, such as in a traditional enterprise environment featuring untrusted guest virtual machines, it may be necessary to disable Intel Hyper-Threading. Since this varies from one use case to another, and from one environment to another, Red Hat and our peers are not disabling Intel Hyper-Threading by default. Customers should instead consult our Knowledge Base article and make the appropriate determination for their own situation.

To help facilitate control over Intel Hyper-Threading, Red Hat is shipping updated kernels that include a new interface through which customers can disable Hyper-Threading at boot time. Consult the updated kernel documentation and Knowledgebase for further information.

Wrapping up

The L1TF (L1 Terminal Fault) Intel processor vulnerability is complex and in some cases requires specific actions by customers to effect a complete mitigation. Red Hat and our partners have been working to prepare for the public coordinated disclosure, and to prepare patches, documentation, training, and other materials necessary to help keep our customers and their data safe. We recommend that you always follow best security practices, including deploying the updates for the L1TF vulnerability as quickly as possible.

For further information, please consult the Red Hat Knowledgebase article on L1TF.

Jon Masters is chief ARM architect at Red Hat.