Bare-metal Programming on the Apple M2

21 January 2023

Speculating on Apple Silicon

Table of Contents
About m1n1Running our code with m1n1
More about the Environment
Running Rust binaries on m1n1Dynamic Allocations
Exit and Error-handling
Breakpoint

This is short review about creating a bare-metal microbenchmarking environment for the Apple M2 (specifically, for my new 13" MacBook Pro with the T8112).

WARNING: If it isn't already obvious, interacting with hardware like this is inherently risky. There are no hard guarantees that any of this is perfectly safe.

I have yet to brick or otherwise damage my MacBook Pro this way (as far as I can tell), but nevertheless, you should avoid doing this on your own hardware if you cannot accept that risk.

Thanks to the hard work of the Asahi Linux contributors, it's quite easy to write and run bare-metal code on devices with Apple Silicon. I thought it might be fun to take advantage of this, and attempt to create some kind of environment where I can run experiments on the ARM cores in a more-controlled setting (ie. without needing to run from userspace within macOS).

All of the source code from this article is available at eigenform/m2e.

About m1n1

m1n1 is the bootloader aspect of the Asahi Linux project used to handle early SoC/hardware initialization before loading and booting into Linux.

Apart from that, m1n1 is also an extremely useful development tool on Apple Silicon machines because it exposes a "proxy" debugging interface over the USB ports.

Instead of loading an operating system, m1n1's proxy mode provides a kind of "sandbox" environment, allowing you to interact with the machine over USB after the primary CPU core and basic SoC features are online.

Running our code with m1n1

Originally, I thought I was going to end up writing a tiny kernel. m1n1 makes it easy to chainload other binaries (in exactly the same way that a typical Asahi Linux installation chainloads uboot, and then a Linux kernel image). m1n1 also has the ability to act like a hypervisor for other targets, although I haven't gotten around to exploring this.

After thinking about it for a while, I decided it would be sufficient to implement everything on-top of the m1n1 proxy. The list of things we need from a "kernel" here is fairly short, and m1n1's proxy mode already satisfies them:

We can arbitrarily interact with memory
We can arbitrarily run code on a particular physical CPU core
We can move result data back to a host machine

Importantly, the m1n1 source includes Python libraries that make it very easy to do this. After experimenting for a while, I had a simple script that loads an ELF image into memory and starts running on another CPU core. Interacting with m1n1 this way is a breeze:

from m1n1.setup import *
from construct import *

TARGET_CPU = 4
HEAP_SIZE  = 512 * (1024 * 1024)
CODE_SIZE  = (1024 * 1024)

ResultContext = Struct(
    "result"  / Hex(Int64ul),
    "payload" / Hex(Int64ul),
    "len"     / Hex(Int64ul),
)

def exec_elf(elf_file: str) -> ResultContext:
    """ Load and execute an ELF image on-top of m1n1.
    The target ELF is expected to return a pointer to a [ResultContext]
    which describes the address/length of some result data.
    """
    with GuardedHeap(u.heap) as heap:
        img = TargetELF(elf_file)
        code = heap.memalign((1024 * 1024), CODE_SIZE)
        heap = heap.memalign((1024 * 1024), HEAP_SIZE)
        print(f"[*] Allocated for code @ {code:016x}")
        print(f"[*] Allocated for heap @ {heap:016x}")

        entrypt = img.load(code)
        print(f"[*] Entrypoint: {entrypt:016x}")

        res = p.smp_call_sync(TARGET_CPU, entrypt, heap)
        print(f"[!] Returned with res={res:016x}")

        ctx = iface.readstruct(res, ResultContext)
        return ctx

p.smp_start_secondaries()
p.mmu_init_secondary(TARGET_CPU)

ctx = exec_elf("./path/to/my.elf")

The only funky part about this is that the location of our binary in memory isn't necessarily fixed. This means that whatever code we're going to run needs to be position-independent. Luckily, fixing up relocations after loading ELF segments isn't too difficult (I omitted all of the ELF parsing code here in TargetELF because it's not very interesting).

More about the Environment

m1n1 boots on the primary CPU core (cpu0), which happens to be one of the "Blizzard" E-cores. Since I'm mainly interested in poking at the P-cores here, we need to be able to run on the other "secondary" CPU cores.

Firmware on Apple machines (I think this is called "iBoot") passes an "Apple Device Tree" (ADT) to system software, exposing information about the underlying hardware. The ADT for the T8112 indicates that the CPU cores are split into two clusters: E-cores (cpu{0..3}) and P-cores (cpu{4..7}).

That smp_start_secondaries() call at the end of the script tells m1n1 to initialize all of the secondary cores. After that, all the other cores are idling, waiting for an interrupt from m1n1 that schedules some work. Since we're expecting to use RAM, we also need to tell m1n1 to enable the MMU on the target core via mmu_init_secondary().

In our case, smp_call_sync() is sufficient to cause our code to run on a particular core in EL2. Luckily, it doesn't seem like this setup doesn't cause any other [unexpected] interrupts to occur on the target core. Plus, since these aren't SMT (simultaneous multithreading) cores, that means our binary is effectively the only thing running on the target CPU. (Wahoo!)

Running Rust binaries on m1n1

I'm most comfortable programming in [nightly] Rust, and since I already have a bit of experience doing freestanding/bare-metal Rust, getting an environment for #![no_std] AArch64 binaries was mostly painless (see the m2e crate).

Like I mentioned before, we want to make sure we're emiting position-independent code in this situation. As far as I can tell, setting relocation-model to pie in my target configuration was sufficient to deal with this (apart from remembering to fixup relocations when we're finally loading the ELF).

Dynamic Allocations

While it's definitely not necessary for us to have dynamic allocations here, I thought that having an allocator would make things easier to use. If you look at the Python script from before, you can see that we use m1n1 to allocate some space for a heap before passing the base address to our binary.

I haven't needed to deal with the allocator in Rust before, and I found GlobalAlloc to be a lot less scary than I imagined! The environment is very simple [and single-threaded!], so I didn't have to spend a lot of time worrying about the allocator implementation.

Exit and Error-handling

As of right now, I have a pretty simple scheme for moving data back to the host machine: basically, we're just returning a pointer back to m1n1 to some structure with a result code, and another pointer/length pair describing a buffer with whatever result data we care about.

The panic handler is very simple, and just restores the original stack pointer, link register, and non-volatile GPRs from m1n1 before returning with an error code. I'll eventually get around to using this to make error-handling a lot more friendly (ie. by passing panic information back to the host), but this seemed sufficient for now.

Breakpoint

Now we're finally free to write some mostly normal-looking Rust binaries. The non-returning main is just an artifact of how I decided to glue up the entrypoint/exit functions, but that should be easy to clean up.

// m2e/m2e-rs/src/bin/template.rs

#![no_std]
#![no_main]

use m2e::common::*;
use m2e::mem::*;

#[no_mangle]
pub fn main(heap_base: usize) -> ! {
    // We always have to initialize the allocator!
    ALLOCATOR.init(heap_base);

    let myvec = vec![0xffu8; 0x100];
    CONTEXT.get().set_payload(
        myvec.as_ptr(), 
        core::mem::size_of_val(&*myvec)
    );

    exit(ResultCode::OK)
}

... and here's what the ELF loader looks like - all we're doing here is allocating some 0xffs on the heap and returning it to my host machine with m1n1.

$ run-elf.py
TTY> CPU init (MIDR: 0x611f0320)...
TTY>   CPU part: 0x32 rev: 0x10
TTY>   CPU: M2 Blizzard

...

Fetching ADT (0x0005C000 bytes)...
m1n1 base: 0x8039c0000
TTY> Starting secondary CPUs...
TTY> Starting CPU 1 (0:0:1)...   Started.
TTY> Starting CPU 2 (0:0:2)...   Started.
TTY> Starting CPU 3 (0:0:3)...   Started.
TTY> Starting CPU 4 (0:1:0)...   Started.
TTY> Starting CPU 5 (0:1:1)...   Started.
TTY> Starting CPU 6 (0:1:2)...   Started.
TTY> Starting CPU 7 (0:1:3)...   Started.
[*] Allocated for code @ 000000080de00000
[*] Allocated for heap @ 000000080df00000
[*] Writing segment to 000000080de00000 (0000000000000214)
[*] Writing segment to 000000080de00218 (00000000000000e4)
[*] Writing segment to 000000080de00300 (00000000000001f0)
[*] RELA @ 000000080de004e0 => 000000080de00400
[*] RELA @ 000000080de004e8 => 000000080de00300
[*] Entrypoint: 000000080de00104
[*] Running ...
[!] Returned with res=000000080de00300
Container: 
    result = 0x0000000000000003
    payload = 0x000000082DEFFF00
    len = 0x0000000000000100
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000010  *

Later on, I will probably be writing more about exploring aspects of the P-core ("Avalanche") microarchitecture with this setup, so stay tuned! Thanks for reading!