Classic Stack-Smashing

3 June 2017

Miscellania

Table of Contents
Buffer-Overflow Bugs
Controlling Execution
Non-Executable Stacks
NX Stack: Implementation
Randomization
Address-Space Randomization: Implementation
Stack Canaries
Stack Canaries: Implementation
Topics for Next Time

This is an example of a classic stack-smashing attack, and some discussion about associated mitigations in Linux.

Buffer-Overflow Bugs

First, we'll write some simple vulnerable code and compile it:

/* ex.c */
#include <unistd.h>
int foo(){ char buffer[32]; 
	read(0, buffer, 512); 
	return 0; }

int main(){ foo(); return 0; }

Here, the foo() function allocates a 32-byte buffer, but then attempts to read up to 512 bytes into it. This is a canonical buffer-overflow bug.

I find that Python is pretty useful for building nasty input in situations like this, so let's generate some to fill up this buffer:

""" gen.py """
#!/usr/bin/env python
with open('input', 'wb') as f:
    f.write(b'\x41'*32)

Let's also look at the stack with gdb and strategize a little. We'll set a breakpoint right before we return from foo() into the main function too. It's important to note here that, as we add data after our 32 bytes of 0x41, our input is going to grow downwards in this representation.

(gdb) break *foo+30
Breakpoint 1 at 0x400779
(gdb) run < input
...
Breakpoint 1, 0x0000000000400779 in foo ()
(gdb) x/10xg $rsp
0x7fffffffe870: 0x4141414141414141      0x4141414141414141
0x7fffffffe880: 0x4141414141414141      0x4141414141414141
0x7fffffffe890: 0x00007fffffffe8b0      0x0000000000400799
0x7fffffffe8a0: 0x00007fffffffe998      0x0000000100000000
0x7fffffffe8b0: 0x00000000004007a0      0x00007ffff7814511

Now, the value immediately after our input at 0x7fffffffe890 represents the saved base pointer of the previous frame (in this case, the base pointer for the main function). This is not particularly interesting to us - however, after this is the value of the return address at 0x7fffffffe898. When foo() returns, our processor will set its instruction pointer -- the %rip register on 64-bit x86 platforms -- to the value of the return address and continue execution in the main function.

Because we can potentially write 512-32 = 480 bytes past the end of the buffer, the bug affords us control over this return address, meaning that we have the ability to break the normal flow of execution within the program.

Controlling Execution

Let's build input again, this time adding some bytes to write over the saved $ebp and the return address:

#!/usr/bin/env python
from struct import pack
with open('input', 'wb') as f: 
    f.write(b'\x41'*32 + \
	pack("<Q", 0x7fffffffe8b0) + \
	pack("<Q", 0xdeadbeef))

The calls to struct.pack here are just for organizing our bytes properly -- the < just means "little-endian ordering" and Q means we're writing 8 bytes.

We append 8 bytes to overwrite the saved base pointer of the previous frame (here just using the address from the gdb output, although this doesn't necessarily matter). Then, we'll write over the lower 4 bytes of the return address with 0xdeadbeef. Here's what it looks like when we step through execution in gdb again:

(gdb) run < input
...
Breakpoint 1, 0x0000000000400779 in foo ()
(gdb) x/10xg $rsp
0x7fffffffe870: 0x4141414141414141      0x4141414141414141
0x7fffffffe880: 0x4141414141414141      0x4141414141414141
0x7fffffffe890: 0x00007fffffffe8b0      0x00000000deadbeef
0x7fffffffe8a0: 0x00007fffffffe998      0x0000000100000000
0x7fffffffe8b0: 0x00000000004007a0      0x00007ffff7814511
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00000000deadbeef in ?? ()

Our application throws SIGSEGV when execution attempts to return to the main function. Perhaps the values at address 0xdeadbeef don't contain any bytecode for our processor to fetch, or maybe we've attempted some kind of illegal memory access. Let's disassemble the main function for a moment while we reconsider our choice:

0x0000000000400780 <+0>:     55      push   %rbp             
0x0000000000400781 <+1>:     48 89 e5        mov    %rsp,%rbp
0x0000000000400784 <+4>:     48 83 ec 10     sub    $0x10,%rsp
0x0000000000400788 <+8>:     89 7d fc        mov    %edi,-0x4(%rbp)
0x000000000040078b <+11>:    48 89 75 f0     mov    %rsi,-0x10(%rbp)
0x000000000040078f <+15>:    b8 00 00 00 00  mov    $0x0,%eax
0x0000000000400794 <+20>:    e8 c2 ff ff ff  callq  0x40075b <foo>
0x0000000000400799 <+25>:    b8 00 00 00 00  mov    $0x0,%eax
0x000000000040079e <+30>:    c9      leaveq 
0x000000000040079f <+31>:    c3      retq

What if, instead of writing 0xdeadbeef into %rip, we wrote an address of some other code in our binary? Let's try overwriting %rip with 0x400798f, which is the address in the .text section of our program right before we call foo() in the main function! Here's what execution looks like:

(gdb) run < input
...
Breakpoint 1, 0x0000000000400779 in foo ()
(gdb) x/10xg $rsp
0x7fffffffe870: 0x4141414141414141      0x4141414141414141
0x7fffffffe880: 0x4141414141414141      0x4141414141414141
0x7fffffffe890: 0x00007fffffffe8b0      0x000000000040078f
0x7fffffffe8a0: 0x00007fffffffe998      0x0000000100000000
0x7fffffffe8b0: 0x00000000004007a0      0x00007ffff7814511
(gdb) step
Single stepping until exit from function foo,
which has no line number information.
0x000000000040078f in main ()
(gdb) step
Single stepping until exit from function main,
which has no line number information.

Breakpoint 1, 0x0000000000400779 in foo ()
(gdb) cont
Continuing.
[Inferior 1 (process 5069) exited normally]

Notice how we entered our *foo+30 breakpoint twice! We wrote over %rip with the address of the instruction just before the call to foo() in main. Upon returning from foo(), we just jumped backwards in the code to call foo() again instead of continuing on with the main function.

Now, you might very reasonably start wondering:

"If we control the instruction pointer and have the ability to write all over the stack, what's preventing us from just writing bytecode at the beginning of that buffer and then executing it by writing over the return address with the address of the buffer?"

On modern platforms, there are actually a couple different things preventing us from just writing code on the stack and executing it. Here's a [probably non-exhaustive] list of the reasons why I haven't attempted this in the example above:

Non-Executable Stacks

First of all, modern processors have a feature which allow an operating-system to mark particular pages of virtual memory as non-executable. For instance, I know that x86-64 reserves the highest-order bit on page-table entries for use as the no-execute bit).

/* ~ arch/x86/includes/asm/pgtable_types.h @ linux-4.11.3 */ 
...
#define _PAGE_BIT_PRESENT	0	/* is present */
#define _PAGE_BIT_RW		1	/* writeable */
#define _PAGE_BIT_USER		2	/* userspace addressable */
#define _PAGE_BIT_PWT		3	/* page write through */
#define _PAGE_BIT_PCD		4	/* page cache disabled */
...				...	...
#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */

Modern GCC has the ability to compile and link your application such that the stack is non-executable. One way you can see this is in /proc/$pid/maps entries, or in objdump output:

$ cat /proc/$(pgrep ex1)/maps|grep stack
7ffd69ae9000-7ffd69b0a000 rw-p 00000000 00:00 0	 	 [stack]

$ objdump -p bin/ex1|grep -A1 STACK
STACK off    0x0000000000000000 vaddr 0x0000000000000000 [...]
      filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-

You can control this with the -z [no]execstack GCC flag. Compare the above output with this version of ex1.c compiled with an executable stack:

$ cat /proc/$(pgrep ex1-exec)/maps|grep stack
7fff07f11000-7fff07f32000 rwxp 00000000 00:00 0          [stack]

$ objdump -p bin/ex1-exec|grep -A1 STACK
STACK off    0x0000000000000000 vaddr 0x0000000000000000 [...]
      filesz 0x0000000000000000 memsz 0x0000000000000000 flags rwx

NX Stack: Implementation

The objdump output comes from reading the program header section of the binary. An ELF file contains instructions for how to build the image of some process in memory -- each entry describes some properties of a memory section, and contains a p_flags field which describes the associated permissions. load_elf_binary() in fs/binfmt_elf.c is the kernel handler for ELF binaries. This appears to be the code responsible for reading p_flags and determining whether or not the stack will be executable:

/* ~ fs/binfmt_elf.c:797 @ linux-4.11.3 */
...
	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
		switch (elf_ppnt->p_type) {
		case PT_GNU_STACK:
			if (elf_ppnt->p_flags & PF_X)
				executable_stack = EXSTACK_ENABLE_X;
			else
				executable_stack = EXSTACK_DISABLE_X;
			break;
...

Eventually, this code calls the setup_arg_pages() function in fs/exec.c which passes this information into a call to mprotect_fixup().

/* ~ fs/exec.c:718 @ linux-4.11.3 */
...
	/*
	 * Adjust stack execute permissions; explicitly enable for
	 * EXSTACK_ENABLE_X, disable for EXSTACK_DISABLE_X and leave alone
	 * (arch default) otherwise.
	 */
	if (unlikely(executable_stack == EXSTACK_ENABLE_X))
		vm_flags |= VM_EXEC;
	else if (executable_stack == EXSTACK_DISABLE_X)
		vm_flags &= ~VM_EXEC;
	vm_flags |= mm->def_flags;
	vm_flags |= VM_STACK_INCOMPLETE_SETUP;

	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
			vm_flags);
...

Ultimately, contiguous regions of memory are described by vm_area_struct VMA objects in the kernel. mprotect_fixup() is the actual operation responsible for setting permissions on the stack's VMA here with vma->vm_flags = newflags.

Presumably the permissions on VMAs have some bearing on the permissions associated with underlying pages, although this is kind of unclear to me at the moment.

Randomization

On modern platforms, base addresses for memory allocations are randomized by an operating-system feature called ASLR (for Address Space Layout Randomization). Observe this code:

/* ex2.c */
#include <unistd.h>
#include <stdio.h>
int foo(){ char buffer[512]; 
	printf("buffer is at %p\n", &buffer);
	read(0, buffer, 2048); 
	return 0; }

int main(){ foo(); return 0; }

If we run this a couple times, you'll notice that the address of char buffer[512] is not fixed in any sense:

$ for i in {0..10..1}; do ./ex2 <<<"foo"; done
buffer is at 0x7fff0d7a64e0
buffer is at 0x7ffccd9451e0
buffer is at 0x7ffd2b1bd210
buffer is at 0x7ffd92dca5b0
buffer is at 0x7ffe4df69740
buffer is at 0x7ffc95787d00
buffer is at 0x7ffc9a42a860
buffer is at 0x7ffee99e5110
buffer is at 0x7ffe10df6430
buffer is at 0x7fff625094d0
buffer is at 0x7ffe95630e10

Say that you had written a bunch of nasty bytecode onto the stack, had control over the instruction pointer, and let's say that you even had an executable stack to work with. In order to point %rip at some bytecode in our input, we'd need to know the starting address of our input beforehand.

The position of our program's .text section in memory is perfectly deterministic, which is why breaking the flow of execution was easy in the first example. Here, it's not so simple. The most obvious strategy here would be to try addresses at random, but this is not exactly the best situation.

At this point, you might consider that there are other sections of our program to target that are executable. [Un]fortunately, the bases of those sections are probably randomized too! For example, shared libraries:

$ for i in {0..10..1}; do ldd ex2 | grep libc; done
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f154ff10000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007ff3db6e5000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fe8e9022000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fa4f4469000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fb632b44000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f5f666cd000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fe732fff000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f67170b0000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f5dbfdd7000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fe710a38000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f3e710b4000)

Address-Space Randomization: Implementation

Again, we're back in fs/binfmt_elf.c. Looks like the handler for ELF binaries (load_elf_binary(), in case you forgot) sets current->flags accordingly for the new process. The PF_RANDOMIZE flag designates that the virtual address space of the process will be randomized when it is initialized.

/* ~ fs/binfmt_elf.c:867 @ linux-4.11.3 */
...
	if(!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
		current->flags |= PF_RANDOMIZE;
	
	setup_new_exec(bprm);
	...

In the same file, randomize_stack_top() is called to suitably randomize the top of the stack (immediately before calling setup_arg_pages() as described in the previous section -- this function also adds some randomness). Early in setup_new_exec(), the mm->mmap_legacy_base field in the memory descriptor is randomized by calling arch_pick_mmap_layout() in arch/x86/mm/mmap.c. It also looks like that function randomizes the gaps between allocations.

arch_mmap_rnd() is the actual function that generates random numbers for the offsets with get_random_long().

Immediately after this, load_elf_binary() has to loop through and map the rest of the sections into memory while adding random offsets to the base addresses. This piece appears to randomize offsets for shared sections:

/* ~ fs/binfmt_elf.c:873 @ linux-4.11.3 */
...
	/* Now we do a little grungy work by mmapping the ELF image into
	   the correct location in memory. */
	for(i = 0, elf_ppnt = elf_phdata;
	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {

	    ...
		} else if (loc->elf_ex.e_type == ET_DYN) {
			...
			load_bias = ELF_ET_DYN_BASE - vaddr;
			if (current->flags & PF_RANDOMIZE)
				load_bias += arch_mmap_rnd();
			...

... and later, the data segment is randomized too via arch_randomize_brk() in arch/x86/kernel/process.c (which is basically just a wrapper around randomize_page() in drivers/char/random.c):

/* ~ fs/binfmt_elf.c:1077 @ linux-4.11.3 */
...
	if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
		current->mm->brk = current->mm->start_brk =
			arch_randomize_brk(current->mm);
...

In case you're unfamiliar, randomize_va_space here refers to the name of the sysctl parameter used to control address space randomization (which is usually turned on to some degree by default).

`/* ~ Documentation/sysctl/kernel.txt */
...
randomize_va_space:

This option can be used to select the type of process address
space randomization that is used in the system, for architectures
that support this feature.

0 - Turn the process address space randomization off.  This is the
    default for architectures that do not support this feature anyways,
    and kernels that are booted with the "norandmaps" parameter.

1 - Make the addresses of mmap base, stack and VDSO page randomized.
    This, among other things, implies that shared libraries will be
    loaded to random addresses.  Also for PIE-linked binaries, the
    location of code start is randomized.  This is the default if the
    CONFIG_COMPAT_BRK option is enabled.

2 - Additionally enable heap randomization.  This is the default if
    CONFIG_COMPAT_BRK is disabled.

    There are a few legacy applications out there (such as some ancient
    versions of libc.so.5 from 1996) that assume that brk area starts
    just after the end of the code+bss.  These applications break when
    start of the brk area is randomized.  There are however no known
    non-legacy applications that would be broken this way, so for most
    systems it is safe to choose full randomization.

    Systems with ancient and/or broken binaries should be configured
    with CONFIG_COMPAT_BRK enabled, which excludes the heap from process
    address space randomization.

Stack Canaries

One simple way of mitigating the threat of stack-smashing is to compile your application with GCC's -fstack-protector flag, ie.

$ make ex1-fixed
gcc ex1.c -o ../bin/ex1-stack-protector -fstack-protector
 
$ ./bin/ex1-stack-protector < input
*** stack smashing detected ***: ./bin/ex1-stack-protector terminated

...
Aborted (core dumped)

Our program just aborts now, but how is this accomplished? Let's see what gdb has to say about this after we feed it some harmless input (only 32 bytes):

$ gdb -batch -ex 'file bin/ex1-stack-protector' -ex 'break *foo+45' \
> -ex 'run < input' -ex 'x/10xg $rsp' -ex 'disas main'
Breakpoint 1 at 0x400593

Breakpoint 1, 0x0000000000400593 in foo ()
0x7fffffffdf70:	0x4141414141414141	0x4141414141414141
0x7fffffffdf80:	0x4141414141414141	0x4141414141414141
0x7fffffffdf90:	0x00000000004005d0	0xa719e95716303300
0x7fffffffdfa0:	0x00007fffffffdfb0	0x00000000004005bc
0x7fffffffdfb0:	0x00000000004005d0	0x00007ffff7a56511
Dump of assembler code for function main:
   0x00000000004005ae <+0>:	push   %rbp
   0x00000000004005af <+1>:	mov    %rsp,%rbp
   0x00000000004005b2 <+4>:	mov    $0x0,%eax
   0x00000000004005b7 <+9>:	callq  0x400566 <foo>
   0x00000000004005bc <+14>:	mov    $0x0,%eax
   0x00000000004005c1 <+19>:	pop    %rbp
   0x00000000004005c2 <+20>:	retq   
End of assembler dump.

It looks like the boundary between the two frames has changed slightly! Here, 0x7fffffffdfa8 contains the return address which points %rip back into the main function. It seems like GCC has padded the space in-between the stack and the previous frame with a few bytes.

The 8-byte value 0xa719e95716303300 before the saved base address and return address is called the stack canary. This is the mechanism that -fstack-protector uses to detect stack-smashing. In order to see how, we'll write past the buffer with 64 bytes this time and look at the disassembly for foo() in gdb again. First, some input:

#!/usr/bin/env python
from struct import pack
with open('input', 'wb') as f: 
    f.write(b'\x41'*40 + \
	pack("<Q", 0xbbbbbbbbbbbbbbbb) + \ # write over canary
	pack("<Q", 0xcccccccccccccccc) + \ # write over saved base
	pack("<Q", 0x4005b2))	           # write over return addr

Adding the flag has changed the behaviour of foo() to some degree. By default, -fstack-protector changes the prologue and epilogue of functions that (a) allocate buffers >8 bytes on the stack; and/or (b) call alloca() (for explicitly allocating memory on the stack). Here's what it looks like now:

Dump of assembler code for function foo:
   0x0000000000400566 <+0>:	push   %rbp
   0x0000000000400567 <+1>:	mov    %rsp,%rbp
   0x000000000040056a <+4>:	sub    $0x30,%rsp
+  0x000000000040056e <+8>:	mov    %fs:0x28,%rax
+  0x0000000000400577 <+17>:	mov    %rax,-0x8(%rbp)
+  0x000000000040057b <+21>:	xor    %eax,%eax
   0x000000000040057d <+23>:	lea    -0x30(%rbp),%rax
   0x0000000000400581 <+27>:	mov    $0x200,%edx
   0x0000000000400586 <+32>:	mov    %rax,%rsi
   0x0000000000400589 <+35>:	mov    $0x0,%edi
   0x000000000040058e <+40>:	callq  0x400460 <read@plt>
   0x0000000000400593 <+45>:	mov    $0x0,%eax
+  0x0000000000400598 <+50>:	mov    -0x8(%rbp),%rcx
+  0x000000000040059c <+54>:	xor    %fs:0x28,%rcx
+  0x00000000004005a5 <+63>:	je     0x4005ac <foo+70>
+  0x00000000004005a7 <+65>:	callq  0x400450 <__stack_chk_fail@plt>
   0x00000000004005ac <+70>:	leaveq 
   0x00000000004005ad <+71>:	retq   
End of assembler dump.

And these are the new instructions are used to set up the canary on the stack:

mov    %fs:0x28,%rax	; Put canary value in $rax
mov    %rax,-0x8(%rbp)	; Put canary value on the stack
xor    %eax,%eax	; Clear $rax	

...    ...		; Your function goes here

mov    -0x8(%rbp),%rcx	; Read the canary value on the stack
xor    %fs:0x28,%rcx	; xor with known-good value
je     0x4005ac <foo+70> ; if 0, continue execution
callq  0x400450 <__stack_chk_fail@plt> ; otherwise, abort

Stack Canaries: Implementation

It looks like there's a stack canary field in task_struct. You can see a call to get_random_long() in kernel/fork.c:

/* ~ kernel/fork.c:533 @ linux-4.11.3) */
...
	setup_thread_stack(tsk, orig);
	clear_user_return_notifier(tsk);
	clear_tsk_need_resched(tsk);
	set_task_stack_end_magic(tsk);

#ifdef CONFIG_CC_STACKPROTECTOR
	tsk->stack_canary = get_random_long();
#endif
...

And from the task_struct definition in include/linux/sched.h:

/* ~ include/linux/sched.h:483 @ linux-4.11.3 */

#ifdef CONFIG_CC_STACKPROTECTOR
	/* Canary value for the -fstack-protector GCC feature: */
	unsigned long			stack_canary;
#endif

The value is actually initialized in init/main.c in the start_kernel() function by calling boot_init_stack_canary() from arch/x86/include/asm/stackprotector.h. This happens when the kernel starts:

/* ~ arch/x86/include/asm/stackprotector.h:60 @ linux-4.11.3 */
...

static __always_inline void boot_init_stack_canary(void)
{
	u64 canary;
	u64 tsc;

#ifdef CONFIG_X86_64
	BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
	/*
	 * We both use the random pool and the current TSC as a source
	 * of randomness. The TSC only matters for very early init,
	 * there it already has some randomness on most systems. Later
	 * on during the bootup the random pool has true entropy too.
	 */
	get_random_bytes(&canary, sizeof(canary));
	tsc = rdtsc();
	canary += tsc + (tsc << 32UL);

	current->stack_canary = canary;
#ifdef CONFIG_X86_64
	this_cpu_write(irq_stack_union.stack_canary, canary);
#else
	this_cpu_write(stack_canary.canary, canary);
#endif
}

...

Presumably this is how the references to %fs:0x28 in the disassembly work. I actually don't know how the segment registers work, so I guess I 'ought to let stackprotector.h speak for itself here:

/*
 * GCC stack protector support.
 *
 * Stack protector works by putting predefined pattern at the start of
 * the stack frame and verifying that it hasn't been overwritten when
 * returning from the function.  The pattern is called stack canary
 * and unfortunately gcc requires it to be at a fixed offset from %gs.
 * On x86_64, the offset is 40 bytes and on x86_32 20 bytes.  x86_64
 * and x86_32 use segment registers differently and thus handles this
 * requirement differently.
 *
 * On x86_64, %gs is shared by percpu area and stack canary.  All
 * percpu symbols are zero based and %gs points to the base of percpu
 * area.  The first occupant of the percpu area is always
 * irq_stack_union which contains stack_canary at offset 40.  Userland
 * %gs is always saved and restored on kernel entry and exit using
 * swapgs, so stack protector doesn't add any complexity there.
 *
 * On x86_32, it's slightly more complicated.  As in x86_64, %gs is
 * used for userland TLS.  Unfortunately, some processors are much
 * slower at loading segment registers with different value when
 * entering and leaving the kernel, so the kernel uses %fs for percpu
 * area and manages %gs lazily so that %gs is switched only when
 * necessary, usually during task switch.
 *
 * As gcc requires the stack canary at %gs:20, %gs can't be managed
 * lazily if stack protector is enabled, so the kernel saves and
 * restores userland %gs on kernel entry and exit.  This behavior is
 * controlled by CONFIG_X86_32_LAZY_GS and accessors are defined in
 * system.h to hide the details.
 */

Topics for Next Time

I think the next post might actually proceed with exploitation in the face of some of the mitigations explored here (probably not all of them at once). There will most likely be some discussion about ret2libc, and potentially some really bad attempts at return-oriented programming (although we'll see how that goes).