Playing with PMCs on Zen 2 Machines

26 March 2022

Miscellania

Table of Contents
About PMCs
Taking Measurements with lamina

Anecdote about having fun with x86 performance-monitoring counters.

The last time I wrote, I talked about some of my experiments with the Instruction-based Sampling (IBS) functionality on Zen 2 machines. This time, we're talking about exploring something somewhat more general.

About PMCs

Like other machines implementing the x86-64 ISA, Zen 2 has performance-monitoring counter (PMC) registers that can be used to count different [micro]architectural events. On Zen 2, there's a set of six PMCs: each has an associated PERF_CTL MSR used to program the counter, and a PERF_CTR MSR that increments when events occur in the machine.

AMD publishes a Processor Programming Reference (PPR) guide for different machines - this is where the different types of events are defined. Each event is defined by a 12-bit number, and is further qualified by an 8-bit unit mask, which can sometimes be used to qualify the event.

For example, the Processor Programming Reference for AMD Family 17h Model 71h defines the "Load-Store Dispatch" event PMCx029, which counts the number of micro-ops that are dispatched to the load-store units. Setting different bits in the unit mask allow the user to count types of load-store micro-ops: loads, stores, or combined load/stores.

Apart from specifying an event, PERF_CTL registers have different bits that can be used to control counting in other ways:

Events can be edge-triggered or level-triggered
You can fire interrupts when counters overflow
You can count events from user and/or privileged threads
You can count events from SVM guests and/or the SVM host

Taking Measurements with lamina

NOTE: All of this is crafted specifically for looking at my Ryzen 9 3950X on Linux 5.16. Also, you should know that these tools are somewhat hacky and very unsafe by design. You may want to proceed with caution if you decide to replicate experiments with these tools.

I wrote a simple Linux kernel module that lets me instrument the PMC registers with an ioctl(). The functionality is wrapped up in the Rust crate (see eigenform/lamina) that I'm using to keep all of my experiments in this space.

Here, I'm only really interested in measuring code that's running in userspace. This means that we need some quick way to read the counters. The RDPMC instruction is a shortcut for reading a counter (typically you'd use RDMSR), but this is only available in userspace when the PCE bit is set in the CR4.

On Linux 5.16, I think this functionality already exists because of the perf subsystem, and you can apparently manage it with the following (as root):

# Enable RDPMC use
echo 2 > /sys/bus/event_source/devices/cpu/rdpmc
# Disable RDPMC use
echo 1 > /sys/bus/event_source/devices/cpu/rdpmc

Apart from that, there are actually a few other things that I'm doing to prepare my machine for handling tests (see this script):

The kernel NMI watchdog is disabled (it typically consumes counter 0)
Simultaneous multithreading (SMT) is disabled
Frequency scaling features are disabled
All of the measured code is pinned to a single hardware thread

Despite this, testing in Linux userspace will always be somewhat noisy, even when the window between two measurements is very small. There aren't really any guarantees that we haven't been rescheduled by the kernel for some reason. I'm mainly relying on repeated testing to compensate for this.

Here's an example of what I'm using to emit code wrapped in RDPMC measurements:

#[macro_export]
macro_rules! emit_rdpmc_test_all {
    ($($body:tt)*) => { {
        let mut asm = Assembler::<X64Relocation>::new().unwrap();

        // <emit a function prologue>
        // ...
		
        // Take some measurements
        dynasm!(asm
            ; mov rcx, 5 ; lfence ; rdpmc ; lfence ; sub r14, rax
            ; mov rcx, 4 ; lfence ; rdpmc ; lfence ; sub r13, rax
            ; mov rcx, 3 ; lfence ; rdpmc ; lfence ; sub r12, rax
            ; mov rcx, 2 ; lfence ; rdpmc ; lfence ; sub r11, rax
            ; mov rcx, 1 ; lfence ; rdpmc ; lfence ; sub r10, rax
            ; mov rcx, 0 ; lfence ; rdpmc ; lfence ; sub  r9, rax
        );

        // Measured code goes here
        $($body)*

        // Take another set of measurements and compute the difference
        dynasm!(asm
            ; mov rcx, 0 ; lfence ; rdpmc ; lfence ; add  r9, rax
            ; mov rcx, 1 ; lfence ; rdpmc ; lfence ; add r10, rax
            ; mov rcx, 2 ; lfence ; rdpmc ; lfence ; add r11, rax
            ; mov rcx, 3 ; lfence ; rdpmc ; lfence ; add r12, rax
            ; mov rcx, 4 ; lfence ; rdpmc ; lfence ; add r13, rax
            ; mov rcx, 5 ; lfence ; rdpmc ; lfence ; add r14, rax
        );

        // <save our results and emit a function epilogue>
        // ...
	
        asm.finalize().unwrap()
    } }
}

Note that the LFENCE instructions are necessary. Here, we're using them to make sure that RDPMC completes precisely at this point.

My library is mostly a collection of Rust macros and templates like this, plus the interfaces for collecting and analyzing measurements.

Working with runtime code generation macros here is especially nice because it makes things very readable (at least, to me?), and I'm pretty happy with it. It's basically the same setup as eigenform/ibstrace.

Here's an example of a test that measures six different events over four NOP instructions:

use lamina::*;
use lamina::ctx::PMCContext;
use lamina::pmc::PerfCtlDescriptor;
use lamina::event::Event;

fn main() -> Result<(), &'static str> {
    // The kernel module always instruments PMCs on core 0
    lamina::util::pin_to_core(0);

    // Context for interactions with the kernel module
    let mut ctx = PMCContext::new()?;

    // Declare a set of events that we want to measure
    let mut pmc = PerfCtlDescriptor::new()
        .set(0, Event::ExRetCops(0x00))      // Retired ops
        .set(1, Event::DeSrcOpDisp(0x03))    // Dispatched ops
        .set(2, Event::ExRetInstr(0x00))     // Retired instructions
        .set(3, Event::LsNotHaltedCyc(0x00)) // Cycles not in halt
        .set(4, Event::LsIntTaken(0x00))     // Interrupts taken
        .set(5, Event::LsSmiRx(0x00));       // SMI interrupts taken

    // Tell the kernel to enable a set of counters
    ctx.write(&pmc);

    // Take an empty baseline/floor measurement 
    let code = emit_rdpmc_test_all!();
    let mut test = PMCTest::new("floor", &code, &pmc);
    test.run_iter(4096);
    test.print();

    // Measure something interesting
    let code = emit_rdpmc_test_all!(
        ; nop
        ; nop
        ; nop
        ; nop
    );
    let mut test = PMCTest::new("4 nops", &code, &pmc);
    test.run_iter(4096);
    test.print();

    Ok(())
}

Since we're always going to be dealing with a set of repeated tests, the results for each event have an associated minimum (the lowest number of observed events), a maximum (the highest number of observed events), and a mode (the most commonly observed number of events). In some cases, it's also useful to see the whole distribution of observed counter values. For example, running the test above yields the following results:

# Test 'floor'
ExRetCops(0)             min=23    max=23    mde=23    dist={23: 4096}
DeSrcOpDisp(3)           min=12    max=16    mde=16    dist={12: 1, 16: 4095}
ExRetInstr(0)            min=25    max=25    mde=25    dist={25: 4096}
LsNotHaltedCyc(0)        min=371   max=371   mde=371   dist={371: 4096}
LsIntTaken(0)            min=0     max=0     mde=0     dist={0: 4096}
LsSmiRx(0)               min=0     max=0     mde=0     dist={0: 4096}

# Test '4 nops'
ExRetCops(0)             min=27    max=27    mde=27    dist={27: 4096}
DeSrcOpDisp(3)           min=20    max=20    mde=20    dist={20: 4096}
ExRetInstr(0)            min=29    max=29    mde=29    dist={29: 4096}
LsNotHaltedCyc(0)        min=372   max=372   mde=372   dist={372: 4096}
LsIntTaken(0)            min=0     max=0     mde=0     dist={0: 4096}
LsSmiRx(0)               min=0     max=0     mde=0     dist={0: 4096}

Afterwards, we can take the difference between modes here to obtain the actual number of events that are relevant to our measured code. We can see that for the 4 NOP instructions:

There were 4 retired ops (NOP is a single micro-op)
There were 4 dispatched ops
There were 4 retired instructions
There was 1 cycle not-in-halt (NOP has a throughput of 4 per-cycle!)

It's also worth mentioning that in this setup, our emitted code is always cleaned from the cache with CLFLUSH before each test run.