Detecting Meltdown Using Radix

This is the first post in a two-part series on how Radix can be used to detect and discover microarchitectural vulnerabilities such as Meltdown and Spectre. This first post introduces the Meltdown vulnerability, formulates Meltdown as a hardware information flow property violation, and provides details on how Tortuga Logic’s Radix solution can be used to ensure future generations of processors are not vulnerable to Meltdown.

The second blog post will review the Spectre vulnerability and outline the process of formulating microarchitectural security issues manifesting during speculation as information flow property violations. Tortuga Logic, in collaboration with a processor vendor, has developed and implemented a novel methodology for tracking data manipulated by squashed CPU instructions in order to determine if leakage through microarchitectural side-channels is possible.

What is Meltdown?

Meltdown(1), CVE-2017-5754, is a CPU vulnerability allowing unprivileged software to read arbitrary data from kernel memory. Meltdown does not exploit a bug in operating system code, but instead leverages non-transparent microarchitectural behavior allowing data brought into the CPU from a faulting load instruction to leak through timing side channels.
Figure 1 illustrates the functional processor blocks and operations involved in the Meltdown vulnerability. An attacker exploiting the Meltdown vulnerability begins by attempting to load data from kernel memory (Step 0 in Figure 1). The attacker is executing at a lower privilege level and does have adequate permissions to access the data. The process of virtual to physical address translation involves consulting the page tables through a page table walk (PTW) or accessing the relevant page table entry in the Translation Lookaside Buffer (TLB). The page table entry contains the permissions for the page containing the load address.

Meltdown exploits a race condition where data is read from the memory subsystem (red path in Figure 1) in parallel with the permissions check performed on the memory access (green path in Figure 1). Due to out-of-order execution, the data read from memory can enter the CPU (Step 2) and be used in subsequent transient instructions (Steps 3 and 4) even if the permissions check fails. A key point to note is that transient instructions can’t affect the architectural state, but can alter the state of microarchitectural structures, such as caches, in a way that is observable to software. Leaking the value of the illegally loaded data through a cache timing side-channel only requires transiently executing one or two additional instructions before the permissions check failure is handled.

Figure 1: Meltdown results from a race condition between the red and green paths.

Formulating Meltdown as an Information Flow Property

Detecting vulnerabilities like Meltdown is challenging because the illegal loading of the data and transmission through the timing side-channel occurs during transient or speculative execution, which is invisible to the programmer’s view of the processor. The behavior of processors during speculation or out-of-order execution must be analyzed in a security context before tapeout to avoid similar vulnerabilities in next generation CPUs. Meltdown is a violation of the following security requirement:

Data loaded under speculation eventually resulting in a fault or exception due to failure of a permissions check, cannot cause cache allocation.

If a verifiable security property can be written for a processor RTL design, then Meltdown-like vulnerabilities can be detected and mitigated pre-silicon.
Determining if load data affects cache allocation is extremely difficult without analysis tools that understand the concept of information flow. The transformation of load data to an address for a subsequent load, and the effects the second load has on cache state are hard to model using value-based checkers but easy to capture using hardware information flow tracking tools such as Tortuga Logic’s Radix.

Radix dynamically tracks information flows in RTL designs during simulation or emulation. Radix security rules control what design information should be tracked, when it should be tracked, and where in the design it should not propagate to.

The basic syntax of a Radix Security Rule is:

{ source } when ( condition ) =/=> { destination }

“=/=>” is the no-flow operator. The rule will fail if information from the source signal set propagates to any signal in the destination signal set. Both the source and destination consist of a set of RTL design signals. The “when” keyword specifies conditions which must hold to start tracking information from the source.

Below is the high-level structure of a Radix Security Rule capable of detecting Meltdown:

Load Data when ( Load Data Valid &&
Do Not Have Permission to Load Data ) =/=> L1D$ Tag RAM

Figure 2 shows where in a typical processor design the source and destination signals are located. The red dashed line emanating from the source shows the information flows tracked by the security rule.

Figure 2: Design signals relevant to the Radix Security Rule for detecting Meltdown

For simplicity, a single load port is shown, but Radix security rules can be easily extended to include multiple source signal sets each with their own independent “when” conditions for enabling information flow tracking. Each load data bus has a corresponding valid signal used to mark when data is present. In real processor implementations analyzed by Tortuga Logic, the load data is also qualified by a flag carrying the result of the permissions check at the same time the valid signal is asserted and the load data enters the core.
This begs the following question:

If the result of the permissions check is known before illegal data is loaded into the CPU, why is the data still allowed to enter the CPU and be manipulated by transient instructions?

The answer is that delaying all exception and fault handling to a fixed time in the future is more efficient than aborting instructions at multiple points during execution. Before the discovery of Spectre and Meltdown these sort of aggressive optimizations in CPU design were allowed as long as the delayed exception and fault handling didn’t erroneously affect the programmer’s view of the processor. Meltdown emphasized the importance of ensuring that these optimizations don’t enable data from a higher privilege level to be accessed and leaked by transient instructions executed at a lower privilege through microarchitectural side channels.

The Radix Security Rule provided tracks valid data entering the CPU that will eventually lead to an exception from the illegal load. The power of information flow tracking is that the illegal load data can be tracked through arbitrary transformations such as bit shifting, combination with other data to form an address etc. If the illegal data in any way propagates to the L1 data cache tag RAM, the Radix rule will fail and Meltdown will be detected.

Discussion

Tortuga Logic wrote the Meltdown security rule after the discovery of the Meltdown vulnerability and verified its effectiveness on one of our customers’ commercial CPUs vulnerable to Meltdown and a revised version with hardware mitigations in place. Even though this rule targets detection of a known vulnerability, the rule provides significant additional value over a directed test case for Meltdown for the following applications:

Verifying the effectiveness of mitigations
Discovery of additional microarchitectural structures acting as covert channels

Software mitigations for Meltdown incur significant overhead but are necessary for vulnerable CPUs already deployed in the field. CPU designs under active development should use the Meltdown Radix Security Rule during RTL simulation and emulation to verify that data returned from a load failing a permissions check is appropriately blocked.

If the mitigation blocks speculative load data immediately based on the permissions check and the load data is never expected to reach the CPU core, the Radix rule destination can be changed to the CPU inputs to reflect this expectation. In other hardware mitigations we have seen, the data still is brought into the CPU, but blocked before an address can be derived for the second load. In this case, it is extremely important to verify there are no additional side-channels that data can leak through even if the path to altering the data cache state is blocked.

Radix provides several analysis views, shown in Figure 3, enabling visualization of rule-specific information flows occurring in the RTL design during simulation or emulation that were invaluable when detecting Meltdown. Radix will track information from the rule source (in this case the illegal load data) anywhere in the design, not only just the rule destination signals. This allows interactive examination of the illegal load data through any microarchitectural structures inside the processor.

Figure 3: Radix analysis views enable visual exploration of information flows within the design