RFC: CFI Improvements with PAuth and BTI

akirilov-arm · akirilov-arm · commit 0d5eca378e93 · 2021-11-26T14:35:50.000Z
Improve control flow integrity for compiled WebAssembly code
by utilizing two technologies from the Arm instruction set
architecture - Pointer Authentication and Branch Target
Identification.

Copyright (c) 2021, Arm Limited.
diff --git a/accepted/cfi-improvements-with-pauth-and-bti.md b/accepted/cfi-improvements-with-pauth-and-bti.md
@@ -0,0 +1,210 @@
+# Summary
+[summary]: #summary
+
+This RFC proposes to improve control flow integrity for compiled WebAssembly code by utilizing two
+technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target
+Identification.
+
+# Motivation
+[motivation]: #motivation
+
+The [security model of WebAssembly](https://webassembly.org/docs/security/) ensures that Wasm
+modules execute in a sandboxed environment isolated from the host runtime. One aspect of that model
+is that it provides implicit control flow integrity (CFI) by forcing all function call targets to
+specify a valid entry in the function index space, by using a protected call stack that is not
+affected by buffer overflows in the module heap, and so on. As a result, in some Wasm applications
+the runtime is able to execute untrusted code safely. However, that places the burden of ensuring
+that the security properties are upheld on the compiler to a large extent.
+
+On the other hand, a further aspect of the WebAssembly design is efficient execution (close to
+native speed), which leads to a natural tendency towards sophisticated optimizing compilers.
+Unfortunately, the additional complexity increases the risk of implementation problems and in
+particular compromises of the security properties. For example, Cranelift has been affected by
+issues such as [CVE-2021-32629][cve] that could make it possible to access the protected call stack
+or memory that is private to the host runtime.
+
+We are trying to tackle the challenge of ensuring compiler correctness with initiatives such as
+expanding fuzzing and making it possible to apply formal verification to at least some parts of the
+compilation process. However, it is also reasonable to consider a defense in depth strategy and to
+evaluate mitigations for potential future issues.
+
+Finally, Wasmtime can be used as a library and in particular embedded into an application that is
+implemented in languages that lack some of the hardening provided by Rust such as C and C++. In that
+case the compiled WebAssembly code could provide convenient instruction sequences for attacks that
+subvert normal control flow and that originate from the embedder's code, even if Cranelift and
+Wasmtime themselves lack any defects.
+
+[cve]: https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-hpqh-2wqx-7qp5
+
+# Proposal
+[proposal]: #proposal
+
+Currently this proposal focuses on the AArch64 execution environment.
+
+## Background
+
+The Pointer Authentication (PAuth) extension to the Arm architecture protects function returns, i.e.
+provides back-edge CFI. It is described in section D5.1.5 of
+[the Arm Architecture Reference Manual][arm-arm]. Some of the PAuth operations act as `NOP`
+instructions when executed by a processor that does not support the extension.
+
+The Branch Target Identification (BTI) extension protects other kinds of indirect branches, that is
+provides forward-edge CFI and is described in section D5.4.4. A processor implementation with BTI
+would support PAuth as well, but not necessarily vice versa. Whether BTI applies to an executable
+memory page or not is controlled by a dedicated page attribute. Note that the `BTI` "landing pad"
+for indirect branches acts as a `NOP` instruction when the extension is not active (e.g. for
+processors that do not support BTI).
+
+Both extensions are applicable only to the AArch64 execution state and are optional, so each CFI
+technique would be employed only if the target environment provides the necessary ISA support.
+Wasmtime embedders need to consider a subtlety - if they cache the result of the check, that may
+happen to be located in memory that could be potentially accessible to an attacker, so the latter
+could disable the use of PAuth and BTI in subsequent code generation. Mitigating this issue is
+outside the scope of this proposal.
+
+The article [*Code reuse attacks: The compiler story*][code-reuse-attacks] provides an introduction
+to the technologies.
+
+In the Intel® 64 architecture [the Control-Flow Enforcement Technology (CET)][intel-cet] provides
+similar capabilities.
+
+[arm-arm]: https://developer.arm.com/documentation/ddi0487/gb/?lang=en
+[code-reuse-attacks]: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/code-reuse-attacks-the-compiler-story
+[intel-cet]: https://www.intel.com/content/www/us/en/developer/articles/technical/technical-look-control-flow-enforcement-technology.html
+
+## Improved back-edge CFI with PAuth
+
+The proposed implementation will add the `PACIASP` instruction to the beginning of every function
+compiled by Cranelift and would replace the final return with the `RETAA` instruction.
+
+In environments that use the DWARF format for unwinding the implementation would be modified to
+apply the `DW_CFA_AARCH64_negate_ra_state` operation immediately after the `PACIASP` instruction.
+
+These steps can be skipped for simple leaf functions that do not construct frame records on the
+stack.
+
+As a conrete example, consider the following function:
+
+```plain
+function %f() {
+    fn0 = %g()
+
+block0:
+    call fn0()
+    return
+}
+```
+
+Without the proposal it would result in the generation of:
+
+```plain
+  stp fp, lr, [sp, #-16]!
+  mov fp, sp
+  ldr x0, 1f
+  b 2f
+1:
+  .byte 0x00, 0x00, 0x00, 0x00
+  .byte 0x00, 0x00, 0x00, 0x00
+2:
+  blr x0
+  ldp fp, lr, [sp], #16
+  ret
+```
+
+And with the proposal:
+
+```plain
+  paciasp
+  stp fp, lr, [sp, #-16]!
+  mov fp, sp
+  ldr x0, 1f
+  b 2f
+1:
+  .byte 0x00, 0x00, 0x00, 0x00
+  .byte 0x00, 0x00, 0x00, 0x00
+2:
+  blr x0
+  ldp fp, lr, [sp], #16
+  retaa
+```
+
+## Enhanced forward-edge CFI with BTI
+
+The proposed implementation will add the `BTI j` instruction to the beginning of every basic block
+that is the target of an indirect branch and that is not a function prologue. Note that in the
+AArch64 backend generated function calls always target function prologues and indirect branches that
+do not act like function calls appear only in the implementation of the `br_table` IR operation.
+Function prologues would be covered by the pointer authentication instructions, which also act as
+landing pads - as discussed before, BTI support implies Pauth.
+
+During development one simple way to create a working prototype is to add the landing pads to the
+beginning of every basic block, irrespective of whether it is the target of an indirect branch or
+not. In this way it can be checked if BTI causes any issue with the rest of the runtime.
+
+## CFI improvements to code that is not compiled by Cranelift
+
+Currently the code that is not compiled by Cranelift is in assembly, C, C++, or Rust.
+
+Improving CFI for compiled C, C++, and Rust code with the same technologies is outside the scope of
+this proposal, but in general it should be achievable by passing the appropriate parameters to the
+respective compiler.
+
+Functions implemented in assembly will get a similar treatment as generated code, i.e. they will
+start with the `PACIASP` instruction. However, the regular return will be preserved and instead will
+be preceded by the `AUTIASP` instruction. The reason is that both `AUTIASP` and `PACIASP` act as
+`NOP` instructions when executed by a processor that does not support PAuth, thus making the
+assembly code generic.
+
+One potential problem in the interaction between code that is compiled by Cranelift and code that is
+not is that only one side might have the CFI enhancements. However, this proposal does not have any
+ABI implications, so Rust code in the Wasmtime implementation that does not use PAuth and BTI, for
+example, would be able to call functions compiled by Cranelift without any issues and vice versa.
+The reason is that it is the responsibility of the callee to ensure that PAuth is used correctly,
+while everything is transparent to the caller. As for BTI, if an executable memory page does not
+have the respective attribute set, then the extension does not have any effect, except for
+introducing extra `NOP` instructions, irrespective of how the code has been reached (e.g. via a
+branch from a page with BTI protections enabled); similarly for branches out of the unprotected
+page. The major exception that is relevant to Wasmtime is unwinding, but there should be no issues
+as long as the abovementioned DWARF operation is used and the system unwinder is recent.
+
+Future work that is beyond what this proposal presents may introduce further hardening that
+necessitates ABI changes, e.g. by being based on
+[the proposed PAuth ABI extension to ELF][pauth-abi] or something similar.
+
+[pauth-abi]: https://github.com/ARM-software/abi-aa/blob/2021Q3/pauthabielf64/pauthabielf64.rst
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+Since the existing implementation already uses the standard back-edge CFI techniques that are
+preferred in the absence of special hardware support (i.e. a separate protected stack that is not
+used for buffers that could be accessed out of bounds), the alternative is not to implement the
+proposal, so the rationale is based mainly on the overhead being insignificant. In terms of code
+size the impact of the back-edge CFI improvements is an additional instruction per function, or 2
+for functions implemented in assembly.
+
+The [Clang CFI design][clang-cfi-design] provides an idea for an alternative implementation of the
+forward-edge CFI mechanism that is enabled by BTI. It involves instrumenting every indirect branch
+to check if its destination is permitted. While the overhead of this approach can be reduced by
+using efficient data structures for the destination address lookup and optionally limiting the
+checks only to indirect function calls, it is still significantly larger than the worst-case BTI
+overhead of one instruction per basic block per function. On the other hand, it does not require any
+special hardware support, so it could be applied to all supported platforms.
+
+[clang-cfi-design]: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html
+
+# Open questions
+[open-questions]: #open-questions
+
+- What is the performance overhead of the proposal?
+- What hardening approaches are applicable to the fiber implementation? The fiber switching code
+saves the values of all callee-saved registers on the stack, i.e. memory that is potentially
+accessible to an attacker. Some of those values could be code addresses that would be used by
+indirect branches, so should we devise a scheme to authenticate them? While the regular pointer
+authentication instructions assume that they are operating on valid virtual addresses (which implies
+that the most significant bits are redundant and could be repurposed), PAuth provides operations to
+authenticate arbitrary data, which could be used in this case.
+- Should we generate the operations that act as `NOP` instructions unconditionally instead (while
+still choosing the shorter alternative sequences if the target supports them)? That would
+especially help the ahead of time compilation use case, and could arguably reduce the amount of
+testing, i.e. no need to check both with and without CFI enhancements.