Skip to content

Commit 0d5eca3

Browse files
committed
RFC: CFI Improvements with PAuth and BTI
Improve control flow integrity for compiled WebAssembly code by utilizing two technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target Identification. Copyright (c) 2021, Arm Limited.
1 parent 2821d03 commit 0d5eca3

File tree

1 file changed

+210
-0
lines changed

1 file changed

+210
-0
lines changed
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# Summary
2+
[summary]: #summary
3+
4+
This RFC proposes to improve control flow integrity for compiled WebAssembly code by utilizing two
5+
technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target
6+
Identification.
7+
8+
# Motivation
9+
[motivation]: #motivation
10+
11+
The [security model of WebAssembly](https://webassembly.org/docs/security/) ensures that Wasm
12+
modules execute in a sandboxed environment isolated from the host runtime. One aspect of that model
13+
is that it provides implicit control flow integrity (CFI) by forcing all function call targets to
14+
specify a valid entry in the function index space, by using a protected call stack that is not
15+
affected by buffer overflows in the module heap, and so on. As a result, in some Wasm applications
16+
the runtime is able to execute untrusted code safely. However, that places the burden of ensuring
17+
that the security properties are upheld on the compiler to a large extent.
18+
19+
On the other hand, a further aspect of the WebAssembly design is efficient execution (close to
20+
native speed), which leads to a natural tendency towards sophisticated optimizing compilers.
21+
Unfortunately, the additional complexity increases the risk of implementation problems and in
22+
particular compromises of the security properties. For example, Cranelift has been affected by
23+
issues such as [CVE-2021-32629][cve] that could make it possible to access the protected call stack
24+
or memory that is private to the host runtime.
25+
26+
We are trying to tackle the challenge of ensuring compiler correctness with initiatives such as
27+
expanding fuzzing and making it possible to apply formal verification to at least some parts of the
28+
compilation process. However, it is also reasonable to consider a defense in depth strategy and to
29+
evaluate mitigations for potential future issues.
30+
31+
Finally, Wasmtime can be used as a library and in particular embedded into an application that is
32+
implemented in languages that lack some of the hardening provided by Rust such as C and C++. In that
33+
case the compiled WebAssembly code could provide convenient instruction sequences for attacks that
34+
subvert normal control flow and that originate from the embedder's code, even if Cranelift and
35+
Wasmtime themselves lack any defects.
36+
37+
[cve]: https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-hpqh-2wqx-7qp5
38+
39+
# Proposal
40+
[proposal]: #proposal
41+
42+
Currently this proposal focuses on the AArch64 execution environment.
43+
44+
## Background
45+
46+
The Pointer Authentication (PAuth) extension to the Arm architecture protects function returns, i.e.
47+
provides back-edge CFI. It is described in section D5.1.5 of
48+
[the Arm Architecture Reference Manual][arm-arm]. Some of the PAuth operations act as `NOP`
49+
instructions when executed by a processor that does not support the extension.
50+
51+
The Branch Target Identification (BTI) extension protects other kinds of indirect branches, that is
52+
provides forward-edge CFI and is described in section D5.4.4. A processor implementation with BTI
53+
would support PAuth as well, but not necessarily vice versa. Whether BTI applies to an executable
54+
memory page or not is controlled by a dedicated page attribute. Note that the `BTI` "landing pad"
55+
for indirect branches acts as a `NOP` instruction when the extension is not active (e.g. for
56+
processors that do not support BTI).
57+
58+
Both extensions are applicable only to the AArch64 execution state and are optional, so each CFI
59+
technique would be employed only if the target environment provides the necessary ISA support.
60+
Wasmtime embedders need to consider a subtlety - if they cache the result of the check, that may
61+
happen to be located in memory that could be potentially accessible to an attacker, so the latter
62+
could disable the use of PAuth and BTI in subsequent code generation. Mitigating this issue is
63+
outside the scope of this proposal.
64+
65+
The article [*Code reuse attacks: The compiler story*][code-reuse-attacks] provides an introduction
66+
to the technologies.
67+
68+
In the Intel® 64 architecture [the Control-Flow Enforcement Technology (CET)][intel-cet] provides
69+
similar capabilities.
70+
71+
[arm-arm]: https://developer.arm.com/documentation/ddi0487/gb/?lang=en
72+
[code-reuse-attacks]: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/code-reuse-attacks-the-compiler-story
73+
[intel-cet]: https://www.intel.com/content/www/us/en/developer/articles/technical/technical-look-control-flow-enforcement-technology.html
74+
75+
## Improved back-edge CFI with PAuth
76+
77+
The proposed implementation will add the `PACIASP` instruction to the beginning of every function
78+
compiled by Cranelift and would replace the final return with the `RETAA` instruction.
79+
80+
In environments that use the DWARF format for unwinding the implementation would be modified to
81+
apply the `DW_CFA_AARCH64_negate_ra_state` operation immediately after the `PACIASP` instruction.
82+
83+
These steps can be skipped for simple leaf functions that do not construct frame records on the
84+
stack.
85+
86+
As a conrete example, consider the following function:
87+
88+
```plain
89+
function %f() {
90+
fn0 = %g()
91+
92+
block0:
93+
call fn0()
94+
return
95+
}
96+
```
97+
98+
Without the proposal it would result in the generation of:
99+
100+
```plain
101+
stp fp, lr, [sp, #-16]!
102+
mov fp, sp
103+
ldr x0, 1f
104+
b 2f
105+
1:
106+
.byte 0x00, 0x00, 0x00, 0x00
107+
.byte 0x00, 0x00, 0x00, 0x00
108+
2:
109+
blr x0
110+
ldp fp, lr, [sp], #16
111+
ret
112+
```
113+
114+
And with the proposal:
115+
116+
```plain
117+
paciasp
118+
stp fp, lr, [sp, #-16]!
119+
mov fp, sp
120+
ldr x0, 1f
121+
b 2f
122+
1:
123+
.byte 0x00, 0x00, 0x00, 0x00
124+
.byte 0x00, 0x00, 0x00, 0x00
125+
2:
126+
blr x0
127+
ldp fp, lr, [sp], #16
128+
retaa
129+
```
130+
131+
## Enhanced forward-edge CFI with BTI
132+
133+
The proposed implementation will add the `BTI j` instruction to the beginning of every basic block
134+
that is the target of an indirect branch and that is not a function prologue. Note that in the
135+
AArch64 backend generated function calls always target function prologues and indirect branches that
136+
do not act like function calls appear only in the implementation of the `br_table` IR operation.
137+
Function prologues would be covered by the pointer authentication instructions, which also act as
138+
landing pads - as discussed before, BTI support implies Pauth.
139+
140+
During development one simple way to create a working prototype is to add the landing pads to the
141+
beginning of every basic block, irrespective of whether it is the target of an indirect branch or
142+
not. In this way it can be checked if BTI causes any issue with the rest of the runtime.
143+
144+
## CFI improvements to code that is not compiled by Cranelift
145+
146+
Currently the code that is not compiled by Cranelift is in assembly, C, C++, or Rust.
147+
148+
Improving CFI for compiled C, C++, and Rust code with the same technologies is outside the scope of
149+
this proposal, but in general it should be achievable by passing the appropriate parameters to the
150+
respective compiler.
151+
152+
Functions implemented in assembly will get a similar treatment as generated code, i.e. they will
153+
start with the `PACIASP` instruction. However, the regular return will be preserved and instead will
154+
be preceded by the `AUTIASP` instruction. The reason is that both `AUTIASP` and `PACIASP` act as
155+
`NOP` instructions when executed by a processor that does not support PAuth, thus making the
156+
assembly code generic.
157+
158+
One potential problem in the interaction between code that is compiled by Cranelift and code that is
159+
not is that only one side might have the CFI enhancements. However, this proposal does not have any
160+
ABI implications, so Rust code in the Wasmtime implementation that does not use PAuth and BTI, for
161+
example, would be able to call functions compiled by Cranelift without any issues and vice versa.
162+
The reason is that it is the responsibility of the callee to ensure that PAuth is used correctly,
163+
while everything is transparent to the caller. As for BTI, if an executable memory page does not
164+
have the respective attribute set, then the extension does not have any effect, except for
165+
introducing extra `NOP` instructions, irrespective of how the code has been reached (e.g. via a
166+
branch from a page with BTI protections enabled); similarly for branches out of the unprotected
167+
page. The major exception that is relevant to Wasmtime is unwinding, but there should be no issues
168+
as long as the abovementioned DWARF operation is used and the system unwinder is recent.
169+
170+
Future work that is beyond what this proposal presents may introduce further hardening that
171+
necessitates ABI changes, e.g. by being based on
172+
[the proposed PAuth ABI extension to ELF][pauth-abi] or something similar.
173+
174+
[pauth-abi]: https://github.com/ARM-software/abi-aa/blob/2021Q3/pauthabielf64/pauthabielf64.rst
175+
176+
# Rationale and alternatives
177+
[rationale-and-alternatives]: #rationale-and-alternatives
178+
179+
Since the existing implementation already uses the standard back-edge CFI techniques that are
180+
preferred in the absence of special hardware support (i.e. a separate protected stack that is not
181+
used for buffers that could be accessed out of bounds), the alternative is not to implement the
182+
proposal, so the rationale is based mainly on the overhead being insignificant. In terms of code
183+
size the impact of the back-edge CFI improvements is an additional instruction per function, or 2
184+
for functions implemented in assembly.
185+
186+
The [Clang CFI design][clang-cfi-design] provides an idea for an alternative implementation of the
187+
forward-edge CFI mechanism that is enabled by BTI. It involves instrumenting every indirect branch
188+
to check if its destination is permitted. While the overhead of this approach can be reduced by
189+
using efficient data structures for the destination address lookup and optionally limiting the
190+
checks only to indirect function calls, it is still significantly larger than the worst-case BTI
191+
overhead of one instruction per basic block per function. On the other hand, it does not require any
192+
special hardware support, so it could be applied to all supported platforms.
193+
194+
[clang-cfi-design]: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html
195+
196+
# Open questions
197+
[open-questions]: #open-questions
198+
199+
- What is the performance overhead of the proposal?
200+
- What hardening approaches are applicable to the fiber implementation? The fiber switching code
201+
saves the values of all callee-saved registers on the stack, i.e. memory that is potentially
202+
accessible to an attacker. Some of those values could be code addresses that would be used by
203+
indirect branches, so should we devise a scheme to authenticate them? While the regular pointer
204+
authentication instructions assume that they are operating on valid virtual addresses (which implies
205+
that the most significant bits are redundant and could be repurposed), PAuth provides operations to
206+
authenticate arbitrary data, which could be used in this case.
207+
- Should we generate the operations that act as `NOP` instructions unconditionally instead (while
208+
still choosing the shorter alternative sequences if the target supports them)? That would
209+
especially help the ahead of time compilation use case, and could arguably reduce the amount of
210+
testing, i.e. no need to check both with and without CFI enhancements.

0 commit comments

Comments
 (0)