Hot reloading: bytecode analysis

This is part of an ongoing series on hot reloading, a core feature of Sanable Engine. It is closely related to Unreal’s live coding and Visual Studio’s edit-and-continue: those systems only rewrite changed bytecode, whereas Sanable’s hot reload feature unloads and reloads an entire DLL while also fixing object layouts.

When writing about hot reloading and handling polymorphism by observing constructor behavior against a moving background, I observed a number of common compiler behaviors that would make it possible to detect vptrs with high accuracy by analyzing constructor bytecode and watching for one key instruction (and one older idiom). Although it requires per-architecture definitions for instructions, it could be performed with any constructor without any side effects, and predict function outcomes without even calling them.

GCC
Clang
MSVC

After giving myself a two-month crash course in low-level computing, I can confidently say this technique works, but is quite tricky.

Sanable uses the Capstone disassembly engine for its incredible selection of supported architectures, but any disassembler should be capable of backing this method.

CPU basics

Computers have several layers of storage. The disk is the slowest but biggest (up to several terabytes), and memory is faster but smaller, typically measured in gigabytes. Inside the CPU, there are three caches which are smaller and faster than memory, each faster than the last, ranging from megabytes to kilobytes. Finally, registers are blazingly fast, and might only add up to ~500 bytes of storage–less than 1/10th of the text of this article.

Caches aren’t meant to be directly read from user code, so we only care about the memory and registers that assembly lets us address. CPUs come equipped with a number of general-purpose registers for calculation, and several special purpose registers that include tracking the call stack and what instruction is currently being executed.

For our analysis purposes, we can consider there to be four types of values that can live in memory or registers:

  • Magic values are values that we want to watch, but don’t have any determinable value. Each magic value has a unique ID and an offset. They are mostly used to hold the “this” pointer, and in this context a magic value represents a pointer to one of that object’s fields.
  • Unknown values come from reading unset memory or registers, performing nonsensical math (like adding two magics) or reinterprets, or when two execution branches disagree on a value. The only value we care about when tracking an unknown is how much space it occupies.
  • Known constants may be set by bytecode, passed from the outside as function arguments, or inferred from execution state (like the instruction pointer register).
  • Flag values are a bitfield where any bit may be 0, 1, or unknown. They should only be used in branching instructions, but may be treated as a known constant under certain circumstances. Constants can always be converted to flags.

Object layout and vptr detection

In x86 position-independent code, vptrs are most frequently referred to in bytecode with an instruction called Load Effective Address, but depending on compiler and CPU architecture differences, an equivalent idiom with one or two Moves may be used instead. These have the same function: turning a position-independent reference within a DLL into an absolute address. To handle both cases, when we update our instruction pointer we place a flag which is only kept when added to another known constant, and we can filter for this flag later to get only vptrs and type info pointers.

mov rax,[relVtbl+rip]          ; Convert vtable address from position-independent form (an offset from this instruction) to an absolute address, and store in RAX temporarily
mov QWORD ptr [rcx+0x???],rax  ; Move from RAX into its location in our object (the "this" pointer lives in RCX)

; Note: Most compilers place the vptr at offset 0 within the object,
; and optimize to one instruction that stores directly into ptr [rcx]
lea [rcx+0x???],relVtbl  ; Convert vtable address, and place directly into destination

However, this will be insufficient for virtual inheritance. Some compilers create a full object constructor for user code to call, which sets up vptrs and calls the type’s base object constructor, which in turn calls its parents’ base object constructors. Others have only one constructor, and place flags in registers before the constructor call to control whether vptr setting takes place.

To fix this, we must also implement branching support, and in turn, test/compare instructions. However, this also opens us up to the need to detect if our branching would rely on indeterminate data (an unknown, or a flag bit set to unknown). Sanable addresses this by running both branches until they converge, then discarding values that aren’t equal.

Intercepting malloc

This approach also lets us detect when malloc or other allocation functions would have been called. If we intercept them, we can skip complex heap allocation logic and create a magic value with a new ID to represent an object at some indeterminate-but-important location, with its own memory space that prevents it from being accidentally written to. It’s even easier to prevent compilers from zero-filling before the constructor is called by blocking the memset call in our thunk from executing.

However, functions from the C standard library functions will be a wrapper that calls the real implementation in the C runtime, similar to dllimport. In other words, the apparent addresses of malloc from two different DLLs won’t match. We can be reasonably certain that the last indirect function call in the wrapper will point at the real implementation, which we can retrieve fairly easily and place in our list of allocators.

Extending to WebAssembly

The WebAssembly VM is considerably different from traditional CPUs. It doesn’t have general-purpose registers and can only calculate on the stack, which is separate from general program memory. While it has a stack pointer, it isn’t stored in a register but can be read onto the stack by the global.get instruction. Position-independent vptr loading is even more idiomatic:

global.get  __memory_base          ; Stands in for instruction pointer        stack: __memory_base
i32.const   vtable for Base@MBREL  ; Refers to relative address               stack: __memory_base, vtable offset
i32.add                            ; Produce absolute address of vtable       stack: absolute vtable address
...
i32.store   0                      ; Write to memory