|
| 1 | +# Summary |
| 2 | +[summary]: #summary |
| 3 | + |
| 4 | +This RFC proposes to improve control flow integrity for compiled WebAssembly code by utilizing two |
| 5 | +technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target |
| 6 | +Identification. |
| 7 | + |
| 8 | +# Motivation |
| 9 | +[motivation]: #motivation |
| 10 | + |
| 11 | +The [security model of WebAssembly](https://webassembly.org/docs/security/) ensures that Wasm |
| 12 | +modules execute in a sandboxed environment isolated from the host runtime. One aspect of that model |
| 13 | +is that it provides implicit control flow integrity (CFI) by forcing all function call targets to |
| 14 | +specify a valid entry in the function index space, by using a protected call stack that is not |
| 15 | +affected by buffer overflows in the module heap, and so on. As a result, in some Wasm applications |
| 16 | +the runtime is able to execute untrusted code safely. However, that places the burden of ensuring |
| 17 | +that the security properties are upheld on the compiler to a large extent. |
| 18 | + |
| 19 | +On the other hand, a further aspect of the WebAssembly design is efficient execution (close to |
| 20 | +native speed), which leads to a natural tendency towards sophisticated optimizing compilers. |
| 21 | +Unfortunately, the additional complexity increases the risk of implementation problems and in |
| 22 | +particular compromises of the security properties. For example, Cranelift has been affected by |
| 23 | +issues such as [CVE-2021-32629][cve] that could make it possible to access the protected call stack |
| 24 | +or memory that is private to the host runtime. |
| 25 | + |
| 26 | +We are trying to tackle the challenge of ensuring compiler correctness with initiatives such as |
| 27 | +expanding fuzzing and making it possible to apply formal verification to at least some parts of the |
| 28 | +compilation process. However, it is also reasonable to consider a defense in depth strategy and to |
| 29 | +evaluate mitigations for potential future issues. |
| 30 | + |
| 31 | +Finally, Wasmtime can be used as a library and in particular embedded into an application that is |
| 32 | +implemented in languages that lack some of the hardening provided by Rust such as C and C++. In that |
| 33 | +case the compiled WebAssembly code could provide convenient instruction sequences for attacks that |
| 34 | +subvert normal control flow and that originate from the embedder's code, even if Cranelift and |
| 35 | +Wasmtime themselves lack any defects. |
| 36 | + |
| 37 | +[cve]: https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-hpqh-2wqx-7qp5 |
| 38 | + |
| 39 | +# Proposal |
| 40 | +[proposal]: #proposal |
| 41 | + |
| 42 | +Currently this proposal focuses on the AArch64 execution environment. |
| 43 | + |
| 44 | +## Background |
| 45 | + |
| 46 | +The Pointer Authentication (PAuth) extension to the Arm architecture protects function returns, i.e. |
| 47 | +provides back-edge CFI. It is described in section D5.1.5 of |
| 48 | +[the Arm Architecture Reference Manual][arm-arm]. Some of the PAuth operations act as `NOP` |
| 49 | +instructions when executed by a processor that does not support the extension. |
| 50 | + |
| 51 | +The Branch Target Identification (BTI) extension protects other kinds of indirect branches, that is |
| 52 | +provides forward-edge CFI and is described in section D5.4.4. A processor implementation with BTI |
| 53 | +would support PAuth as well, but not necessarily vice versa. Whether BTI applies to an executable |
| 54 | +memory page or not is controlled by a dedicated page attribute. Note that the `BTI` "landing pad" |
| 55 | +for indirect branches acts as a `NOP` instruction when the extension is not active (e.g. for |
| 56 | +processors that do not support BTI). |
| 57 | + |
| 58 | +Both extensions are applicable only to the AArch64 execution state and are optional, so each CFI |
| 59 | +technique would be employed only if the target environment provides the necessary ISA support. |
| 60 | +Wasmtime embedders need to consider a subtlety - if they cache the result of the check, that may |
| 61 | +happen to be located in memory that could be potentially accessible to an attacker, so the latter |
| 62 | +could disable the use of PAuth and BTI in subsequent code generation. Mitigating this issue is |
| 63 | +outside the scope of this proposal. |
| 64 | + |
| 65 | +The article [*Code reuse attacks: The compiler story*][code-reuse-attacks] provides an introduction |
| 66 | +to the technologies. |
| 67 | + |
| 68 | +In the Intel® 64 architecture [the Control-Flow Enforcement Technology (CET)][intel-cet] provides |
| 69 | +similar capabilities. |
| 70 | + |
| 71 | +[arm-arm]: https://developer.arm.com/documentation/ddi0487/gb/?lang=en |
| 72 | +[code-reuse-attacks]: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/code-reuse-attacks-the-compiler-story |
| 73 | +[intel-cet]: https://www.intel.com/content/www/us/en/developer/articles/technical/technical-look-control-flow-enforcement-technology.html |
| 74 | + |
| 75 | +## Improved back-edge CFI with PAuth |
| 76 | + |
| 77 | +The proposed implementation will add the `PACIASP` instruction to the beginning of every function |
| 78 | +compiled by Cranelift and would replace the final return with the `RETAA` instruction. |
| 79 | + |
| 80 | +In environments that use the DWARF format for unwinding the implementation would be modified to |
| 81 | +apply the `DW_CFA_AARCH64_negate_ra_state` operation immediately after the `PACIASP` instruction. |
| 82 | + |
| 83 | +These steps can be skipped for simple leaf functions that do not construct frame records on the |
| 84 | +stack. |
| 85 | + |
| 86 | +As a conrete example, consider the following function: |
| 87 | + |
| 88 | +```plain |
| 89 | +function %f() { |
| 90 | + fn0 = %g() |
| 91 | +
|
| 92 | +block0: |
| 93 | + call fn0() |
| 94 | + return |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +Without the proposal it would result in the generation of: |
| 99 | + |
| 100 | +```plain |
| 101 | + stp fp, lr, [sp, #-16]! |
| 102 | + mov fp, sp |
| 103 | + ldr x0, 1f |
| 104 | + b 2f |
| 105 | +1: |
| 106 | + .byte 0x00, 0x00, 0x00, 0x00 |
| 107 | + .byte 0x00, 0x00, 0x00, 0x00 |
| 108 | +2: |
| 109 | + blr x0 |
| 110 | + ldp fp, lr, [sp], #16 |
| 111 | + ret |
| 112 | +``` |
| 113 | + |
| 114 | +And with the proposal: |
| 115 | + |
| 116 | +```plain |
| 117 | + paciasp |
| 118 | + stp fp, lr, [sp, #-16]! |
| 119 | + mov fp, sp |
| 120 | + ldr x0, 1f |
| 121 | + b 2f |
| 122 | +1: |
| 123 | + .byte 0x00, 0x00, 0x00, 0x00 |
| 124 | + .byte 0x00, 0x00, 0x00, 0x00 |
| 125 | +2: |
| 126 | + blr x0 |
| 127 | + ldp fp, lr, [sp], #16 |
| 128 | + retaa |
| 129 | +``` |
| 130 | + |
| 131 | +## Enhanced forward-edge CFI with BTI |
| 132 | + |
| 133 | +The proposed implementation will add the `BTI j` instruction to the beginning of every basic block |
| 134 | +that is the target of an indirect branch and that is not a function prologue. Note that in the |
| 135 | +AArch64 backend generated function calls always target function prologues and indirect branches that |
| 136 | +do not act like function calls appear only in the implementation of the `br_table` IR operation. |
| 137 | +Function prologues would be covered by the pointer authentication instructions, which also act as |
| 138 | +landing pads - as discussed before, BTI support implies Pauth. |
| 139 | + |
| 140 | +During development one simple way to create a working prototype is to add the landing pads to the |
| 141 | +beginning of every basic block, irrespective of whether it is the target of an indirect branch or |
| 142 | +not. In this way it can be checked if BTI causes any issue with the rest of the runtime. |
| 143 | + |
| 144 | +## CFI improvements to code that is not compiled by Cranelift |
| 145 | + |
| 146 | +Currently the code that is not compiled by Cranelift is in assembly, C, C++, or Rust. |
| 147 | + |
| 148 | +Improving CFI for compiled C, C++, and Rust code with the same technologies is outside the scope of |
| 149 | +this proposal, but in general it should be achievable by passing the appropriate parameters to the |
| 150 | +respective compiler. |
| 151 | + |
| 152 | +Functions implemented in assembly will get a similar treatment as generated code, i.e. they will |
| 153 | +start with the `PACIASP` instruction. However, the regular return will be preserved and instead will |
| 154 | +be preceded by the `AUTIASP` instruction. The reason is that both `AUTIASP` and `PACIASP` act as |
| 155 | +`NOP` instructions when executed by a processor that does not support PAuth, thus making the |
| 156 | +assembly code generic. |
| 157 | + |
| 158 | +One potential problem in the interaction between code that is compiled by Cranelift and code that is |
| 159 | +not is that only one side might have the CFI enhancements. However, this proposal does not have any |
| 160 | +ABI implications, so Rust code in the Wasmtime implementation that does not use PAuth and BTI, for |
| 161 | +example, would be able to call functions compiled by Cranelift without any issues and vice versa. |
| 162 | +The reason is that it is the responsibility of the callee to ensure that PAuth is used correctly, |
| 163 | +while everything is transparent to the caller. As for BTI, if an executable memory page does not |
| 164 | +have the respective attribute set, then the extension does not have any effect, except for |
| 165 | +introducing extra `NOP` instructions, irrespective of how the code has been reached (e.g. via a |
| 166 | +branch from a page with BTI protections enabled); similarly for branches out of the unprotected |
| 167 | +page. The major exception that is relevant to Wasmtime is unwinding, but there should be no issues |
| 168 | +as long as the abovementioned DWARF operation is used and the system unwinder is recent. |
| 169 | + |
| 170 | +Future work that is beyond what this proposal presents may introduce further hardening that |
| 171 | +necessitates ABI changes, e.g. by being based on |
| 172 | +[the proposed PAuth ABI extension to ELF][pauth-abi] or something similar. |
| 173 | + |
| 174 | +[pauth-abi]: https://github.com/ARM-software/abi-aa/blob/2021Q3/pauthabielf64/pauthabielf64.rst |
| 175 | + |
| 176 | +# Rationale and alternatives |
| 177 | +[rationale-and-alternatives]: #rationale-and-alternatives |
| 178 | + |
| 179 | +Since the existing implementation already uses the standard back-edge CFI techniques that are |
| 180 | +preferred in the absence of special hardware support (i.e. a separate protected stack that is not |
| 181 | +used for buffers that could be accessed out of bounds), the alternative is not to implement the |
| 182 | +proposal, so the rationale is based mainly on the overhead being insignificant. In terms of code |
| 183 | +size the impact of the back-edge CFI improvements is an additional instruction per function, or 2 |
| 184 | +for functions implemented in assembly. |
| 185 | + |
| 186 | +The [Clang CFI design][clang-cfi-design] provides an idea for an alternative implementation of the |
| 187 | +forward-edge CFI mechanism that is enabled by BTI. It involves instrumenting every indirect branch |
| 188 | +to check if its destination is permitted. While the overhead of this approach can be reduced by |
| 189 | +using efficient data structures for the destination address lookup and optionally limiting the |
| 190 | +checks only to indirect function calls, it is still significantly larger than the worst-case BTI |
| 191 | +overhead of one instruction per basic block per function. On the other hand, it does not require any |
| 192 | +special hardware support, so it could be applied to all supported platforms. |
| 193 | + |
| 194 | +[clang-cfi-design]: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html |
| 195 | + |
| 196 | +# Open questions |
| 197 | +[open-questions]: #open-questions |
| 198 | + |
| 199 | +- What is the performance overhead of the proposal? |
| 200 | +- What hardening approaches are applicable to the fiber implementation? The fiber switching code |
| 201 | +saves the values of all callee-saved registers on the stack, i.e. memory that is potentially |
| 202 | +accessible to an attacker. Some of those values could be code addresses that would be used by |
| 203 | +indirect branches, so should we devise a scheme to authenticate them? While the regular pointer |
| 204 | +authentication instructions assume that they are operating on valid virtual addresses (which implies |
| 205 | +that the most significant bits are redundant and could be repurposed), PAuth provides operations to |
| 206 | +authenticate arbitrary data, which could be used in this case. |
| 207 | +- Should we generate the operations that act as `NOP` instructions unconditionally instead (while |
| 208 | +still choosing the shorter alternative sequences if the target supports them)? That would |
| 209 | +especially help the ahead of time compilation use case, and could arguably reduce the amount of |
| 210 | +testing, i.e. no need to check both with and without CFI enhancements. |
0 commit comments