Arm MTE and Speculative Oracles
Table of Contents
A note about speculative oracles and the Arm Memory Tagging Extension (MTE).
Setting up the Stack
The folks at Apple SEAR recently published an article1 that describes mitigations against memory safety bugs that are being integrated into their ecosystem. On newer Apple Silicon parts, this includes an implementation of Arm's "Memory Tagging Extension" (MTE).
This isn't intended to be a comprehensive review of MTE or memory safety, but since people have been talking about it, I thought it would be a good excuse to write something about microarchitectural attacks against MTE.
If you feel like anything is missing from this article, or if you notice that I've made some kind of egregious mistake, feel free to reach out on social media.
MTE Overview
MTE is a way of constraining the behavior of programs that might have memory safety bugs. Instead of allowing a program to continue with unpredictable or unintended behavior, proper use of MTE is supposed to create situations where invalid or unexpected memory accesses cause the program to "fail fast" with a fault/exception23.
In short, this works by creating hardware support for creating and validating the use of tagged pointers. When allocating some piece of memory, a "tag" is propagated along with the pointer to that memory. The basic idea here is:
-
Each 16-byte "granule" of physical memory can be associated with a 4-bit "allocation tag"
-
Virtual addresses may also carry a "tag" (in bits
[59:56]
) -
The ISA exposes new instructions for manipulating tags and pointers, ie.
irg Xd, Xn
- "create a copy of pointerXn
with random tag bits"stg Xt, [Xn]
- "store tag bits from pointerXt
for granuleXn
"ldg Xt, [Xn]
- "load tag bits from granuleXn
"
-
Loads and stores will fault if the allocation tag for the associated granule does not match the tag in the virtual address
This gives programmers a way to constrain the lifetime of a pointer, or to constrain pointer arithmetic that might occur during runtime. These are both very old problems at the root of very common bugs, especially in languages where the compiler/runtime does not necessarily track references to objects in memory or use type information to constrain pointer arithmetic.
Probabilistic Guarantees
Apart from showing you where memory safety bugs might be occurring, this is also useful if you expect someone to try and actually exploit these kinds of bugs. In that sense, MTE is also a tool for reducing the likelihood that bugs like this can be exploited successfully in a way that remains undetected by system software.4
This is all obviously very useful, but there are still situations where MTE can only be a probabilistic mitigation against attempts to exploit bugs like these.
For instance, consider an attacker who has gained control over all the bits in a pointer, and wants to use it to leak or corrupt some unrelated data somewhere else in the program.
If the tag bits are controlled by the attacker, nothing prevents them from simply guessing the tag for the desired memory location - although, it's likely that will take multiple attempts, and each failed attempt will [ideally] cause the program to terminate. With only four tag bits and no way to "authorize" changes to pointer's tag, MTE cannot provide hard guarantees.
Despite this, it's easy to imagine this is frustrating for an attacker trying to remain undetected while performing some kind of wildly complicated chain of exploits against some program. Ideally, instead of being forced to [potentially] crash, attackers want a way to quietly leak the tags required to continue exerting some control over the program.
Speculative Oracles
Unsurprisingly, a recent pair of papers (TikTag5 and Sticky Tags6 paper) demonstrate ways of silently leaking MTE tags. These involve using speculative execution to leak the tags, using the L1D cache as a channel.
Arm has a security advisory7 (ASA) that addresses this, and the introduction is very clear about the intended scope of MTE:
MTE stands for Memory Tagging Extension [1], and it implements a lock and key based access to memory. Allocation Tags (or locks) of 4 bits can be set on every 16-bytes of memory, and accesses to locked locations are only allowed when the address includes a matching Address Tag (or key).
MTE is designed to expose classes of memory safety issues in software that may become exploitable security vulnerabilities. MTE can provide a limited set of deterministic first line defences, and a broader set of probabilistic first line defences, against specific classes of exploits. However, the probabilistic properties are not designed to be a full solution against an interactive adversary that is able to brute force, leak, or craft arbitrary Address Tags.
[...]
As Allocation Tags are not expected to be a secret to software in the address space, a speculative mechanism that reveals the correct tag value is not considered a compromise of the principles of the architecture.
This is understandable, but maybe a little disappointing because it softens the whole "lock and key" analogy. Besides that, I still think it's still useful to think carefully about exactly what is happening in this situation.
I don't want to turn this into paper review, so I'm going to assume that you've already read the TikTag paper. In short, the authors describe two cases where a [mispredicted] speculative tag fault seems to leave architecturally-visible effects:
-
In TikTag-v1, where multiple repeated tag faults seem to prevent a younger load from being speculatively complete and/or affect whether or not the load is prefetched and available after the initial branch misprediction is resolved
-
In TikTag-v2, where STLF eligibility seems to depend on a successful tag check, and a tag fault prevents a younger dependent load from being speculatively complete
Since I don't have any MTE-compatible hardware to play with, I don't think we'll be able to say anything new or interesting about the exact conditions used to reproduce these bugs here. But despite that, we can at least say something about what's going on here conceptually.
Speculative Faults
It seems useful to think about what properties we typically expect that faulting instructions should have. After all, the useful security properties of MTE come from the fact that it causes faults in certain situations.
If you think about it, all instructions that might cause a fault implicitly carry a special kind of branch. If some faulting condition is true, the instruction stream is halted and redirected to an exception handler - and in some situations, we expect this is the only architecturally-visible effect of the faulting instruction. When a fault occurs, younger instructions are necessarily not part of the [immediate] correct path through the program.
Since faulting instructions can be speculatively executed, an implementation has to decide what happens if the fault occurs speculatively. From the perspective of the machine, speculatively checking for a fault creates two possible situations:
-
If the fault is not-taken, younger instructions may be part of the correct path, but only if the instruction in question is also part of the correct path (ie. is guaranteed to retire)
-
If a fault is taken, younger instructions cannot be part of the correct path, regardless of whether or not the faulting instruction itself is also part of the correct path
When a fault is taken speculatively, since there are no situations where younger instructions are expected to retire, a reasonable analogy to "changing control-flow to a fault handler" in the speculative case is "halting speculative control-flow past the faulting instruction".
There are perfectly valid reasons to want to avoid speculating past a fault. If speculation were not cancelled or somehow restricted by a fault, we might risk propagating invalid values to younger dependent instructions.
For instance, imagine an incorrect speculative load that causes a page fault based on some privilege check (ie. we are in userspace and trying to load from kernel memory).
If speculation were allowed to continue across the fault [and, if the load itself is allowed to be speculatively completed], we'd risk creating a situation where younger dependent instructions are speculatively executed with data that we should not be allowed to access. If those instructions can have architecturally-visible effects, this might allow us to leak data from a different privilege domain.
Speculative Control-flow
With the TikTag cases, despite the fact that it seems like speculation is being cancelled when the fault occurs, the result of the tag check is still leaking. If cancelling speculation is supposed to prevent us from leaking values, why does this leak the result of the tag check?
This follows directly from the fact that cancelling speculation necessarily makes a difference in speculative control-flow, not just data-flow. That difference is necessarily visible when the younger speculative instructions (or other parts of the machine that are sensitive to the difference) have architecturally-visible effects. The presence [or lack of] those effects is sufficient to encode whether or not the fault occurred.
I personally think it's easier to see this using performance monitoring counters (PMCs) as a channel, rather than thinking about leaving effects on the cache and using a cycle counter. Consider the following:
_start:
<always-mispredicted branch to _exit>
_speculative_fault:
<maybe faulting instruction>
_speculative_marker:
<instruction that triggers a unique/distinguishable PMC event>
...
_exit:
After running this code, if you discover that your counter is nonzero, this tells you that either the fault did not occur, or that speculation is simply allowed across the fault in this case. Otherwise, if the counter has not been incremented, this is probably a good indication that the fault occurred speculatively, and that speculation is not allowed across the fault.
Note that this is also true for our example about the page fault. Although we expect halting speculation to outright prevent data-flow across the fault, it still creates an oracle. When the fault does not occur, speculation continues and the effects of younger instructions show us that the fault did not occur.
Wrapping Up
With MTE tag faults, this seems difficult because it presents implementations with an undesirable tradeoff: either (a) cancel speculation and risk leaking the existence of the fault, or (b) allow speculation and risk leaking values past the fault. To quote the ASA again:
According to FEAT_CSV3, the architectural rule against Meltdown-style issues, “Data loaded under speculation with a permission or domain fault cannot be used to” do anything that exposes its value via a side channel. However, for performance reasons, the architecture does not restrict data loaded under speculation with a Tag-Check fault.
[...]
On implementations where faulting Tag-Checks do not affect speculative execution in any way, but only yield a fault when retired, there are no observable differences that the adversary can leverage until the fault occurs.
However, this alternative creates another threat: memory tagging protection is effectively disabled during speculation, and thus Spectre attacks (potentially assisted by memory corruption vulnerabilities) are still possible.
Finally, apart from the discussion about MTE and tag faults, it's also worth mentioning that there are other situations where faults might be involved in creating oracles.
Returning to our imaginary case with the page fault: it's easy [for me] to imagine an implementation where speculative page faults might accidentally reveal whether or not memory is accessible at a certain location. This would probably be useful if the layout of virtual memory is secret and/or randomized, like in the case of operating systems with ASLR.
Also on the topic of defeating ASLR, there's also the case of the PREFETCH
instruction in some x86 implementations.8 This is slightly
different because it involves the fact that PREFETCH
does not cause
page faults, but then again, maybe some of the ideas here belong in an argument
for why it should remain that way.
In a reverse-engineering context, it turns out you can also use this kind of thing to speculatively fuzz instruction sets by relying on the fact that instruction decoding occurs speculatively, and that undefined instructions generally cause faults.
Or in a situation not-too-dissimilar from MTE, the case of attacks9 against the Arm Pointer Authentication extension, where speculative faults might leak the result of pointer verification.
References and Footnotes
Apart from the case where tag check failure immediately causes an exception, MTE also includes support for generating an asynchronous signal that can be handled by system software sometime in the future - but we won't really be talking about this here.
The compile-time analogy for this is probably something like ASan.
In broad sense, I see this as a "canary-in-a-coal-mine" kind of situation, and it certainly makes a lot of sense in the Apple ecosystem. Ideally, if someone is trying to do something nasty to your iPhone and ends up crashing some program with MTE tag check exceptions in the process, this is probably a good sign that system software should start sending logs off to the right people.