Skip to content

reflexive.space

μarchonomicon: Return Instructions

Table of Contents

  1. Setting up the Stack
  2. Motivating Call and Return
    1. Indirection
    2. Example: RISC-V
    3. Example: ARM
    4. Example: x86
  3. Predicting a Return Instruction
  4. Mispredicting a Return Instruction: Capacity
  5. Mispredicting a Return Instruction: Breaking a Promise

Some notes and impressions about the properties of return instructions.

Setting up the Stack

Somewhere along the road, we decided that it's useful to represent programs as "a stack of different procedures." This has made a lot of people very happy and has been widely regarded as a good move. An ISA usually makes this easier for us by explicitly defining "call" and "return" instructions for entering and exiting procedures. But how are these implemented? And why?

Most of this is just an excuse to talk about branch predictor behavior [yet again] - but don't worry, it ends up being extremely simple, and there's only a little bit of context to deal with first.

Motivating Call and Return

Let's assume that we're dealing with some RISC-like pseudocode. Imagine we have a program like this, where all of the jmp targets can be known statically before runtime. All of these can be regarded as direct changes to control-flow: each jmp has a single unambiguous target address.

start:
    jmp foo_1
_ret_1:
    ...
    jmp foo_2
_ret_2:
    ...
    halt
foo_1:
    mov r0, #0xdead 
    jmp _ret_1
foo_2:
    mov r0, #0xdead 
    jmp _ret_2

This program uses the same procedure twice. It's kind of like the expected behavior of "call and return", but it has some duplicated code that we want to get rid of. How are we supposed to do this exactly? We can't just define a single instance of foo and expect to continue using jmp like before:

start:
    jmp foo
    ...
    jmp foo
    ...
    halt
foo:
    mov r0, #0xdead 
    jmp ??? // Uh oh, where are we supposed to return to?

Before, we had two copies of foo: one for each time (or "each distinct point in the program") where it needed to be used. Now, we're left with no way to distinguish when foo was called, and with no way to get back to where we came from! Notice that all of the ambiguity here occurs when we need to return from a procedure back to the caller.

From a purely "static" point of view, we simply don't have enough information to correctly express this program, and our version of the jmp instruction here isn't sufficient for doing this. In order to return from foo, we'd need the machine to have some way of remembering the path that we took into foo during runtime.

Indirection

As far as I can tell, an ISA really only needs two things to give us this behavior:

  1. A "call" instruction, which jumps to the address of some procedure and simultaneously stores the address of the next-sequential instruction somewhere (this is usually called the "return address")
  2. A "return" instruction, which loads a return address from somewhere and jumps to it

In this sense, a return instruction is really just an indirect change in control-flow with a special purpose: return to the instruction that comes after the most-recent previous procedure call.

Indirect branches are very special because they do not necessarily have a single fixed target address: instead, they let us write programs that dynamically load a target address from some part of the architectural ("programmer-visible") state.

Different instructions sets specify this behavior in slightly different ways. Here are a couple easy examples to look at while you stew on this.

Example: RISC-V

I think this is most obvious in the case of the RISC-V ISA:

  • "Call" behavior is captured by JAL (relative jump and link) and JALR (indirect jump with rs1 and link), which both save the return address to the register specified by rd

  • "Return" behavior is a special case of JALR where rs1 is the register containing the return address, and rd is left unused (set to the zero register x0).

Rearticulating our pseudocode from before into RISC-V:

start:
    jal x1, foo // Put the return address in x1
    ...
    jal x1, foo // Put the return address in x1
    ...
foo:
    ...
    jalr x0, x1 // Return to the address in x1

This is enough to give us the "call and return" behavior that we expect. In RISC-V, the return instruction is very obviously indistinguishable from an indirect branch.

It's also worth mentioning that the RISC-V specification suggests that the programmer can use x1 as a "link register" for this purpose, but the JAL and JALR instructions do not do this for us automatically. In this case, the idea of a dedicated link register is a convention that isn't actually embodied in the behavior of any instruction defined by the ISA.

Example: ARM

On the other hand, the ARM ISA goes one step further and actually enshrines the idea of a link register by providing instructions which explicitly use it:

  • The BL and BLR instructions automatically save the return address to the link register
  • The RET instruction is an indirect jump where:
    • The encoding is distinct from BLR and does not save a return address
    • The source register operand defaults to the link register

This lifts some of the burden in the case of "leaf functions." Since these instructions always use the link register, the programmer isn't always forced to worry about where the return address will be.

start:
    bl foo // Put the return address in the link register
    ...
    bl foo // Put the return address in the link register
    ...
foo: 
    ...
    ret    // Return to the address in the link register

However, in ARM and RISC-V, the idioms for call and return don't necessarily help programmers in cases where procedure calls are expected to be nested. In those cases, we still need to keep track of the "call stack" of previous return addresses. This typically involves copying the previous return address somewhere else: usually to some location in a remote memory.

Example: x86

The x86 approach is slightly different in that the return address is not kept in a general-purpose register at all. Instead, return addresses are are stored in main memory by using the address of the stack pointer rsp.

  • CALL stores the return address at the location given by the stack pointer, and then increments the stack pointer
  • RET loads the return address from the location given by the stack pointer, and then decrements the stack pointer

Instead of expecting the programmer to provide a GPR, the x86 ISA assumes that the return address will always be recovered by loading the value from the top of the call stack [in some remote memory].

start:
    call foo // Put the return address in [rsp]
    ...
    call foo // Put the return address in [rsp]
    ...
foo:
    ...
    ret      // Return to the address in [rsp]

This makes it somewhat simpler to deal with leaf functions and nested calls, so long as the programmer is able to faithfully maintain the stack and the value of the stack pointer.

Predicting a Return Instruction

Unsuprisingly, the typical scheme for predicting return instructions is implemented with stack-like structure for storing predicted return addresses: this is usually called the "Return Stack Buffer" (RSB) or "Return Address Stack" (RAS).

A RAS is maintained [transparently by hardware] in the same way that we manage an architectural stack, although the capacity of the RAS is comparatively very small:

  • When a call is encountered, push the return address on top of the RAS
  • When a return is encountered, pop the most-recent entry to make a prediction

This takes advantage of the assumption that a meaningful return address always links a return instruction to a previous call instruction. In this sense, the presence of a return instruction always constitutes a kind of "hint" to the machine.

Now assuming that the programmer has succeeded at keeping track of return addresses, predictions from the RAS should always correspond to the correct architectural return address.

This eliminates the latency associated with needing to resolve the return address, and it means that the machine can begin dealing with the instructions at ret_addr_2 [and beyond!] ahead of time.

As a result of this, any reasonable machine should always predict that the target of a return is the next-sequential address following the most-recent, previous call instruction.

The cost of not making this assumption is pretty high, and basically amounts to stalling the pipeline for the completion of each and every return instruction. Branch prediction schemes for other kinds of branches can become pretty complicated, but the case for return instructions is comparatively very simple and very effective.

Mispredicting a Return Instruction: Capacity

Since the capacity of a RAS implementation is very limited, it's very easy to imagine situations where the RAS will necessarily fail at keeping track of a return address: this always happens in cases where the depth of nested procedure calls exceeds the capacity of the RAS.

Imagine we have a toy machine with only three RAS entries, and we'd like to run a program with four consecutive calls followed by four consecutive returns:

When we reach the fourth call, the return address foo_ret ends up being discarded from the RAS in order to make room for leaf_ref. By the time we reach the fourth return, we find the that RAS is empty and cannot be used to make a meaningful prediction. The fourth call amounts to an overflow condition in the RAS, and the fourth return is forced to deal with the consequences of an underflow condition.

This picture assumes that storage for a RAS is distinct from the storage associated with a branch target buffer (BTB), which is usually much larger and used to cache target addresses for many different kinds of branches.

In addition to relying on a RAS, you might also decide to treat return instructions like any other indirect branch and store their target addresses in a BTB after they've been resolved. This is one way of handling situations where the RAS is empty and cannot provide a prediction.

This happens to be part of the mechanism behind Retbleed: for sufficiently deep call stacks in privileged code, return instructions may also subject to "branch target injection." If the BTB entry associated with a return instruction can be influenced by unprivileged code, this amounts to control over speculation into privileged code (which an attacker might try to use as a gadget for leaking information).

Mispredicting a Return Instruction: Breaking a Promise

There's another idea which is kind of obvious, but I think it's still interesting: this predictor has a "blind spot," and it's trivially easy for a program to defeat this branch prediction scheme. If a program can change the value of the return address in between call and return, that return instruction must always end up being mispredicted.

This is ultimately a consequence of the fact that, at least in all of the machines we've considered here, return addresses always end up being part of the architectural state in some way or another. For whatever reason, it always ends up being the programmer's responsibility to deal with storage for return addresses, and some machines can only partially abstract this away from us, if at all.

I get the impression that this is a projection of some "deep truth" about how we've collectively decided to design general-purpose computers, but I'm not sure that I have a single coherent package for it.

Please consider these hand-wavy, half-baked thoughts instead (in no particular order):

  • The utility behind an indirect branch is that it represents a place in the program where we'd like to relax the distinction between control and data.

  • We ideally want machines to handle arbitrarily deep stacks of procedures, but when we end up implementing this, all of the memories large enough for this task are necessarily very far away from the core.

  • Instruction sets in general-purpose machines specify call and return instructions in a pretty relaxed way; the use of call and return really only constitutes a promise to the machine that a program is free to (intentionally or unintentionally) break at any time.

  • Can we imagine a machine where the use of call and return is more like a contract, and less like a promise? Is this something we might want?