Playing with PMCs on Zen 2 Machines
Table of Contents
Anecdote about having fun with x86 performance-monitoring counters.
The last time I wrote, I talked about some of my experiments with the Instruction-based Sampling (IBS) functionality on Zen 2 machines. This time, we're talking about exploring something somewhat more general.
About PMCs
Like other machines implementing the x86-64 ISA, Zen 2 has
performance-monitoring counter (PMC) registers that can be used to
count different [micro]architectural events. On Zen 2, there's a set of
six PMCs: each has an associated PERF_CTL
MSR used to program the counter,
and a PERF_CTR
MSR that increments when events occur in the machine.
AMD publishes a Processor Programming Reference (PPR) guide for different machines - this is where the different types of events are defined. Each event is defined by a 12-bit number, and is further qualified by an 8-bit unit mask, which can sometimes be used to qualify the event.
For example, the Processor Programming Reference for AMD Family 17h Model 71h
defines the "Load-Store Dispatch" event PMCx029
, which counts the number of
micro-ops that are dispatched to the load-store units.
Setting different bits in the unit mask allow the user to count types of
load-store micro-ops: loads, stores, or combined load/stores.
Apart from specifying an event, PERF_CTL
registers have different bits that
can be used to control counting in other ways:
- Events can be edge-triggered or level-triggered
- You can fire interrupts when counters overflow
- You can count events from user and/or privileged threads
- You can count events from SVM guests and/or the SVM host
Taking Measurements with lamina
NOTE: All of this is crafted specifically for looking at my Ryzen 9 3950X on Linux 5.16. Also, you should know that these tools are somewhat hacky and very unsafe by design. You may want to proceed with caution if you decide to replicate experiments with these tools.
I wrote a simple Linux kernel module that lets me instrument the PMC registers
with an ioctl()
. The functionality is wrapped up in the Rust crate
(see eigenform/lamina) that I'm using
to keep all of my experiments in this space.
Here, I'm only really interested in measuring code that's running in userspace.
This means that we need some quick way to read the counters. The RDPMC
instruction is a shortcut for reading a counter (typically you'd use RDMSR
),
but this is only available in userspace when the PCE
bit is set in the CR4
.
On Linux 5.16, I think this functionality already exists because of the perf
subsystem, and you can apparently manage it with the following (as root):
# Enable RDPMC use
echo 2 > /sys/bus/event_source/devices/cpu/rdpmc
# Disable RDPMC use
echo 1 > /sys/bus/event_source/devices/cpu/rdpmc
Apart from that, there are actually a few other things that I'm doing to prepare my machine for handling tests (see this script):
- The kernel NMI watchdog is disabled (it typically consumes counter 0)
- Simultaneous multithreading (SMT) is disabled
- Frequency scaling features are disabled
- All of the measured code is pinned to a single hardware thread
Despite this, testing in Linux userspace will always be somewhat noisy, even when the window between two measurements is very small. There aren't really any guarantees that we haven't been rescheduled by the kernel for some reason. I'm mainly relying on repeated testing to compensate for this.
Here's an example of what I'm using to emit code wrapped in RDPMC
measurements:
#[macro_export]
macro_rules! emit_rdpmc_test_all {
($($body:tt)*) => { {
let mut asm = Assembler::<X64Relocation>::new().unwrap();
// <emit a function prologue>
// ...
// Take some measurements
dynasm!(asm
; mov rcx, 5 ; lfence ; rdpmc ; lfence ; sub r14, rax
; mov rcx, 4 ; lfence ; rdpmc ; lfence ; sub r13, rax
; mov rcx, 3 ; lfence ; rdpmc ; lfence ; sub r12, rax
; mov rcx, 2 ; lfence ; rdpmc ; lfence ; sub r11, rax
; mov rcx, 1 ; lfence ; rdpmc ; lfence ; sub r10, rax
; mov rcx, 0 ; lfence ; rdpmc ; lfence ; sub r9, rax
);
// Measured code goes here
$($body)*
// Take another set of measurements and compute the difference
dynasm!(asm
; mov rcx, 0 ; lfence ; rdpmc ; lfence ; add r9, rax
; mov rcx, 1 ; lfence ; rdpmc ; lfence ; add r10, rax
; mov rcx, 2 ; lfence ; rdpmc ; lfence ; add r11, rax
; mov rcx, 3 ; lfence ; rdpmc ; lfence ; add r12, rax
; mov rcx, 4 ; lfence ; rdpmc ; lfence ; add r13, rax
; mov rcx, 5 ; lfence ; rdpmc ; lfence ; add r14, rax
);
// <save our results and emit a function epilogue>
// ...
asm.finalize().unwrap()
} }
}
Note that the LFENCE
instructions are necessary. Here, we're using them
to make sure that RDPMC
completes precisely at this point.
My library is mostly a collection of Rust macros and templates like this, plus the interfaces for collecting and analyzing measurements.
Working with runtime code generation macros here is especially nice because it makes things very readable (at least, to me?), and I'm pretty happy with it. It's basically the same setup as eigenform/ibstrace.
Here's an example of a test that measures six different events over four
NOP
instructions:
use lamina::*;
use lamina::ctx::PMCContext;
use lamina::pmc::PerfCtlDescriptor;
use lamina::event::Event;
fn main() -> Result<(), &'static str> {
// The kernel module always instruments PMCs on core 0
lamina::util::pin_to_core(0);
// Context for interactions with the kernel module
let mut ctx = PMCContext::new()?;
// Declare a set of events that we want to measure
let mut pmc = PerfCtlDescriptor::new()
.set(0, Event::ExRetCops(0x00)) // Retired ops
.set(1, Event::DeSrcOpDisp(0x03)) // Dispatched ops
.set(2, Event::ExRetInstr(0x00)) // Retired instructions
.set(3, Event::LsNotHaltedCyc(0x00)) // Cycles not in halt
.set(4, Event::LsIntTaken(0x00)) // Interrupts taken
.set(5, Event::LsSmiRx(0x00)); // SMI interrupts taken
// Tell the kernel to enable a set of counters
ctx.write(&pmc);
// Take an empty baseline/floor measurement
let code = emit_rdpmc_test_all!();
let mut test = PMCTest::new("floor", &code, &pmc);
test.run_iter(4096);
test.print();
// Measure something interesting
let code = emit_rdpmc_test_all!(
; nop
; nop
; nop
; nop
);
let mut test = PMCTest::new("4 nops", &code, &pmc);
test.run_iter(4096);
test.print();
Ok(())
}
Since we're always going to be dealing with a set of repeated tests, the results for each event have an associated minimum (the lowest number of observed events), a maximum (the highest number of observed events), and a mode (the most commonly observed number of events). In some cases, it's also useful to see the whole distribution of observed counter values. For example, running the test above yields the following results:
# Test 'floor'
ExRetCops(0) min=23 max=23 mde=23 dist={23: 4096}
DeSrcOpDisp(3) min=12 max=16 mde=16 dist={12: 1, 16: 4095}
ExRetInstr(0) min=25 max=25 mde=25 dist={25: 4096}
LsNotHaltedCyc(0) min=371 max=371 mde=371 dist={371: 4096}
LsIntTaken(0) min=0 max=0 mde=0 dist={0: 4096}
LsSmiRx(0) min=0 max=0 mde=0 dist={0: 4096}
# Test '4 nops'
ExRetCops(0) min=27 max=27 mde=27 dist={27: 4096}
DeSrcOpDisp(3) min=20 max=20 mde=20 dist={20: 4096}
ExRetInstr(0) min=29 max=29 mde=29 dist={29: 4096}
LsNotHaltedCyc(0) min=372 max=372 mde=372 dist={372: 4096}
LsIntTaken(0) min=0 max=0 mde=0 dist={0: 4096}
LsSmiRx(0) min=0 max=0 mde=0 dist={0: 4096}
Afterwards, we can take the difference between modes here to obtain the
actual number of events that are relevant to our measured code.
We can see that for the 4 NOP
instructions:
- There were 4 retired ops (
NOP
is a single micro-op) - There were 4 dispatched ops
- There were 4 retired instructions
- There was 1 cycle not-in-halt (
NOP
has a throughput of 4 per-cycle!)
It's also worth mentioning that in this setup, our emitted code is always
cleaned from the cache with CLFLUSH
before each test run.