Playing with AMD IBS

10 September 2021

Miscellania

Table of Contents
Instruction-based Sampling (IBS)
Microcoded Instructions
Writing ibstrace
Enumerating MSRs
'Till Next Time

Anecdote about having fun with AMD's "instruction-based sampling" features.

A while ago, I built a new desktop machine with a Ryzen 9 3950X. I had never owned any AMD silicon before this, so I thought it might be kind of interesting to read the programmer's manual and learn a little bit about Zen 2.

One thing I wanted to try was using some performance-monitoring features (specifically, AMD's "instruction-based sampling"). On Linux, the perf subsystem already supports IBS - but having no performance-critical code that I need to measure, I decided to do something that's slightly outside of the intended use-case for performance monitoring features like this.

This is probably the first in a series of write-ups that I'll be doing about these experiments. I'm going to use this one to set up some context, and later entries will be more about particular experiments I'm going to do. I've been working on a lot of different unrelated projects over the past year or so, so most of this post is retrospective.

This series is not going to be about using IBS to learn about particular instructions and performance monitoring in the typical sense. I'm using IBS here to learn weird details about the microarchitecture, and I approached it from a slightly more "adversarial," security-research kind of angle.

Instruction-based Sampling (IBS)

Like other modern processors, AMD machines (including Zen 2 machines) have special performance monitoring features (called IBS) that are distinct from "architectural performance-monitoring counters" that are otherwise consistent across all implementations of the x86-64 ISA. As far as I can tell, it's a bit like Intel's feature called "Processor Event-based Sampling" (PEBS).

Typically when reasoning about modern microarchitectures, it's useful to make a distinction between a "front-end" and a "back-end" part of a pipeline. IBS features are also split this way:

"IBS fetch sampling" reflects the front-end of the CPU core, where we collect data about instructions being fetched and decoded into simpler micro-ops
"IBS op/execution sampling" reflects the back-end, providing data about instructions after being dispatched, executed, and retired

I'm only going to be talking about the back-end "op sampling" part of IBS, which as far as I can tell works something like this:

When op sampling is enabled, a programmable counter (of either clock-cycles or dispatched micro-ops) iterates until overflow, causing IBS to tag the next micro-op
The tagged micro-op moves through the CPU core's pipeline
When the tagged micro-op is retired, information is written to the IBS registers, and a non-maskable interrupt delivered to the core indicates that a new sample is available

On a whim back in March/April, I played with some existing tools for using IBS (see jlgreathouse/AMD_IBS_Toolkit) a little bit. One thing I found really really interesting, is that you will occasionally sample load/store micro-ops, and IBS will collect the associated linear and physical addresses along with the width of the access.

Feeling a little drunk with a newfound power, I had the following thoughts:

"Can I use IBS to glean some arcane or otherwise undocumented knowledge about how the machine is implemented?"
"From an 'adversarial' security-research point-of-view, can we use this as a kind of foothold into understanding or reverse-engineering other aspects of the machine?"

I haven't been able to find any cases of other folks "playing" with IBS like this, so I thought it would be a nice avenue for some research. I've been working at this on-and-off for a while alongside other projects, and I consider these as still mostly open questions. There are all sorts of paths for some research in this direction.

Microcoded Instructions

If I were designing a chip, I'd want programmers to know at least a little bit about how certain instructions and certain parts of the machine are implemented, because that makes both of our lives easier in a sense. However, the way that some instructions are implemented is probably too complicated or otherwise not-relevant-enough to programmers in some cases.

The first "opaque" thing I could think of was microcoded instructions. If you have any experience with x86-64, it's not too-far of a stretch to assume that lots of microcoded instructions probably decode into multiple load/store micro-ops. So I decided to run with this idea.

Luckily, AMD's software optimization guide provides a spreadsheet that lists out which instructions are microcoded (also, if you're playing along at home, you might want to know that AMD refers to my 3950X as "Family 17h, Model 71h").

The first microcoded instruction that came to my mind was CPUID, mostly because I already had an intuition about how it might be implemented, and that it probably had something to do with the machine's model-specific registers (or MSRs). MSRs are special registers that expose all kinds of different system-level functionality to programmers. They're typically used for configuring different parts of the machine.

With this in mind, I wrote a little program that pins itself to a core, runs a CPUID instructions in a tight loop, and then parses out only the load/store micro-op samples associated with CPUID. If you do this, you will see something like this:

====== EAX 80000001
store=0 load=1 width=1 phys_addr=000000000000008c
store=0 load=1 width=4 phys_addr=00000000000000d4
store=0 load=1 width=8 phys_addr=0000000000000120
store=0 load=1 width=4 phys_addr=0000000000000134
store=0 load=1 width=4 phys_addr=00000000000002e0
store=0 load=1 width=4 phys_addr=00000000000002e4

If these samples aren't noise, it suggests that the leaf at 0x8000_0001 is implemented by performing these memory accesses. Like I mentioned before, I expected this a little bit. However, it raised a bunch of nice questions. Assuming that these locations are MSRs (or some kind of more generic storage for special parts of the architectural state):

MSRs are typically just a single value; why are there so many accesses here?
Why do they have physical addresses? Are these storage locations on some bus that might be shared by parts of the system?
Are there other ways of making the machine issue loads or stores like this, apart from decoding a microcoded instruction?

If you think about the programming model for x86-64 in a certain way, the CPUID instruction is (in some sense) a way for unprivileged programs to glean information about the underlying hardware without having access to an RDMSR instruction. RDMSR is a privileged instruction in x86-64, and is only used in the context of programs like firmware and operating-systems.

And naturally, RDMSR itself is also a microcoded instruction. If these accesses are directed at some kind of special address-space for MSRs, it seems like IBS might be a good way to figure out what's going on.

Writing ibstrace

In order to continue hypothesizing about all of this, I figured that I'd want a reliable way to execute and sample privileged instructions. I was also unhappy using a tool that I didn't quite understand, so I decided to make my own little environment.

A super-rigorous test environment would probably involve writing your own tiny kernel solely for playing with this. I've done comparatively little bare-metal x86 programming (outside of hacking on the Linux kernel and UEFI things here-and-there), so I decided to shelf that idea and settle for writing a Linux kernel module and some user programs.

I wrote a kernel module called ibstrace with only three goals in mind:

Compatibility with AMD Family 17h Model 71h
Running arbitrary chunks of privileged code on a single target core
Collecting op samples while some code is running on the target core

It's a bit hacky, but I think it's suitable for this kind of exploration. I also wrote a Rust crate for myself called ibst-rs that I'm using to send code to the kernel module and retrieve samples, among other things.

You can see all my code and notes at eigenform/ibstrace, but I have to warn you: it's highly unsafe by design, there are no guarantees of compatibility with your machine, and I've most definitely made subtle mistakes.

One really cool thing I wanted to do was to have some way of emitting code I want to measure at runtime. I have to take the time here to plug the dynasm-rs crate, which lets you do exactly that. It's hard for me to overstate how useful this crate is.

For example, here's how I'm emitting tests that measure RDMSR instructions:

/// Wrapper around dynasm for emitting a simple loop (decrementing RSI).
#[macro_export]
macro_rules! emit_test_iters_rsi {
    ($num_iter:expr, $($t:tt)*) => { {
        let mut asm = Assembler::<X64Relocation>::new().unwrap();
        dynasm!(asm
            ; .arch x64
            ; mov   rsi, $num_iter as _
            ; ->loop_start:
            $($t)*
            ; sub   rsi, 1
            ; jne   ->loop_start
            ; mov   rax, 42
            ; ret
        );
        asm.finalize().unwrap()
    } }
}

/// Emit a test for a particular invocation of RDMSR.
pub fn emit_msr_test(msr: u32, iters: usize) -> ExecutableBuffer {
    emit_test_iters_rsi!(iters, 
        ; mov   ecx, msr as _
        ; rdmsr
    )
}

/// Test an MSR, returning a list of samples from the kernel module.
fn test_msr_single(fd: i32, msr: u32, iters: usize) -> Box[<Sample>] {
    let code = ibst::codegen::emit_msr_test(msr, iters);
    let msg = ibst::ioctl::UserBuf::new(
        code.ptr(AssemblyOffset(0)), code.len()
    );
    ibst::measure(fd, &msg)
}

This way, we can programmatically generate whole sets of test cases for code that we want to measure with IBS! (I think this is really cool!!)

Enumerating MSRs

I knew that I was going to end up using this to learn about MSRs, but in order to deal with that I had to do something kind of goofy first.

The RDMSR takes a register number in ECX, and it actually faults if you try to specify an unimplemented or otherwise unaccessable register. If I'm going to end up reading a bunch of MSRs, I should probably figure out which ones are valid.

Note: Unfortunately, according to the programming manual, micro-ops tagged by IBS don't produce sample data if the original instruction is aborted. I haven't been able to play around with the behavior in that space yet. I'll probably talk more about this in a later post.

I don't want to crash my machine over and over again, so I'd really like a list of all the valid MSRs, but I don't trust that the processor programming reference (PPR) for this chip is going to really list all of the valid MSRs.

On Linux, you can use /dev/cpu/n/msr to read an MSR from a particular CPU core. If you try to read an invalid MSR from this device, the kernel will magically handle the exception for you, and you'll just get -EIO in return. Now, so long as nothing totally undefined or unexpected happens, it's reasonable to expect that you can use this to just enumerate all of the MSRs on a machine. Consider the following:

//! Goofy way of enumerating "acceptable" MSRs via /dev/cpu/n/msr, where the
//! word "acceptable" here means "cases where RDMSR doesn't fault."

use std::collections::BTreeMap;

pub fn msr_open(core_id: usize) -> Result<i32, &'static str> {
    let path = format!("/dev/cpu/{}/msr", core_id);
    match nix::fcntl::open(path.as_str(), nix::fcntl::OFlag::O_RDONLY, 
                                nix::sys::stat::Mode::S_IRUSR) {
        Ok(fd) => Ok(fd),
        Err(e) => match e {
            nix::Error::Sys(eno) => match eno {
                nix::errno::Errno::EACCES => Err("Permission denied"),
                _ => panic!("{}", eno),
            },
            _ => panic!("{}", e),
        },
    }
}

pub fn msr_close(fd: i32) {
    use nix::unistd::close;
    match close(fd) {
        Ok(_) => {},
        Err(e) => panic!("{}", e),
    }
}

pub fn msr_read(fd: i32, msr: u32) -> Result<u64, &'static str> {
    let mut buf = [0u8; 8];
    match nix::sys::uio::pread(fd, &mut buf, msr as i64) {
        Ok(_) => Ok(u64::from_le_bytes(buf)),
        Err(e) => match e {
            nix::Error::Sys(eno) => match eno {
                nix::errno::Errno::EIO => Err("Unsupported MSR"),
                _ => panic!("{}", eno),
            },
            _ => panic!("{}", e),
        },
    }
}


fn main() -> Result<(), &'static str> {
    const TGT_CORE: usize = 0;
    let mut output = BTreeMap::new();

    // You probably want to pin this to the same core you're reading from.
    // It seems to run *much* faster when I do this.

    let this_pid = nix::unistd::Pid::from_raw(0);
    let mut cpuset = nix::sched::CpuSet::new();
    cpuset.set(TGT_CORE).unwrap();
    nix::sched::sched_setaffinity(this_pid, &cpuset).unwrap();

    let fd = match msr_open(TGT_CORE) {
        Ok(fd) => fd,
        Err(e) => return Err(e),
    };

    for msr in 0x0000_0000..=0x0000_1000 {
        if let Ok(val) = msr_read(fd, msr) {
            eprintln!("Found MSR {:08x}", msr);
            output.insert(msr, val);
        }
    }
    for msr in 0xc000_0000..=0xc002_0000 {
        if let Ok(val) = msr_read(fd, msr) {
            eprintln!("Found MSR {:08x}", msr);
            output.insert(msr, val);
        }
    }
    for (msr, val) in &output {
        println!("{:08x}: {:016x}", msr, val);
    }

    msr_close(fd);
    Ok(())
}

To my surprise, you can actually do this, it doesn't take too long to try all 32-bits, and it didn't hose my machine or cause anything weird to happen. The results are mostly expected: no MSRs hidden in "unusual" 32-bit values different from the expected values of architectural MSRs, but plenty of valid MSRs (5305 in total) that are within expected ranges, and many that are not documented in the PPR for my particular chip.

'Till Next Time

This is a pretty good retrospective on "what I was thinking when I starting exploring this," but I haven't show you many observations yet. I don't know exactly what I want to touch on next, but I'll definitely show some data from tests I've tried, and probably take some time to piece together a more complete picture of what I think is going on here.

Also, here's some other material and extra context related some of the things I've described here, or other existing work that was floating around in my head while starting to think about this:

AMD's Developer Guides, Manuals & ISA Documents
The Zen 2 Microarchitecture (on Wikichip)
"How many registers does an x86-64 CPU have?"
Neat 34C3 presentation on x86 microcode (on YouTube)

Thanks for reading!