Tiny 'hello, world' Binaries

5 May 2018

Miscellania

Table of Contents
Getting Rid of the Symbol Table
Ditching the Standard Libraries
Working Outside of the Compiler

A friend of mine brought up a great question some months ago:

"Why does GCC spit out a huuuge ELF when I compile the simplest possible 'hello, world' program? Shouldn't it be a lot smaller than ~8Kb? That seems pretty big ..."

NOTE: For reference: I'm using GCC 7.3.1. Your results might vary depending on which version of GCC you happen to have on your system. Also, it's worth mentioning that this isn't meant to be a comprehensive overview of the ELF file format -- I assume a hint of familiarity from the reader.

On the off chance that you've never seen 'hello, world' in C, it looks something like this:

// hello_typical.c
#include <stdio.h>
int main(){ 
	printf("hello, world\n");
}

The resulting ELF is over 8000 bytes! What gives?

[0]:[/tmp/]$ gcc -o 0_hello hello_typical.c
[0]:[/tmp/]$ ./0_hello
hello, world
[0]:[/tmp/]$ wc -c 0_hello
8360 0_hello
[0]:[/tmp/]$

If you're looking for the high-level answer to this question, it's basically something like: ".. because ELF files aren't just code - an ELF is a recipe with tons of extra data that tells your kernel how to build an image of a process in memory" , and it might also involve an answer like ".. because you have to load the standard libraries! - somebody implemented all of printf() just for you, among other things!"

8K is pretty big! What does it take to make our ELF smaller? I've been meaning to learn more about using GCC, so I thought this was interesting and decided to take a day to play with it.

Before we get into details, you should know that GNU Binutils includes a really useful tool called objdump which you can use to analyze the structure of particular ELF files¹.

Getting Rid of the Symbol Table

ELF files contain symbol tables, which are basically just mappings from "symbolic names for certain things" to "locations of the relevant data." You can use objdump -t or readelf -s to dump the symbol table/s from some executable. The symbol table contains things like:

Mappings from ELF section names to addresses
(like .text, .data, .bss, .rodata, and friends!)
Mappings from "function names" to addresses
Entries for function names in shared libraries
(addresses are filled-out at runtime by your linker!)

For example, here's all the symbols associated with the .text section in our binary right now:

[0]:[/tmp/]$ objdump --section=.text -t 0_hello
0_hello:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000520 l    d  .text	0000000000000000              .text
0000000000000550 l     F .text	0000000000000000              deregister_tm_clones
0000000000000590 l     F .text	0000000000000000              register_tm_clones
00000000000005e0 l     F .text	0000000000000000              __do_global_dtors_aux
0000000000000620 l     F .text	0000000000000000              frame_dummy
00000000000006c0 g     F .text	0000000000000002              __libc_csu_fini
0000000000000650 g     F .text	0000000000000065              __libc_csu_init
0000000000000520 g     F .text	000000000000002b              _start
000000000000062a g     F .text	0000000000000017              main

The full output across all sections is much larger than this! Unless you need symbols for debugging, the symbol table isn't necessary for your application to load. An easy way to save some space here is to tell GCC to strip out all of these symbols. This shaved off 2216 bytes!

[0]:[/tmp/]$ gcc -s -o 1_hello_no_symtable hello_typical.c
[0]:[/tmp/]$ objdump --section=.text -t 1_hello_no_symtable
1_hello_no_symtable:     file format elf64-x86-64

SYMBOL TABLE:
no symbols

[0]:[/tmp/]$ wc -c 1_hello_no_symtable
6144 0_hello
[0]:[/tmp/]$

It's also worth mentioning that ELF files have dynamic symbol tables which are distinct from the one that we just stripped off. These are typically required to for the binary to execute if you're linked dynamically against some shared libraries. For example, our trusty libc functions are resolved by filling out the dynamic symbol table during runtime. This is what our dynamic symbol table looks like:

[0]:[/tmp/]$ readelf --dyn-syms /tmp/1_hello_no_symtable

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)

Relocation and dynamic linking at runtime requires a lot of extra data in our ELF -- there are quite a few sections that hold the metadata necessary for supporting this: i.e. .interp, .dyn{sym,str}, .got, .plt, and .rela.* sections.

In any case, the actual machine code for puts()² is not emitted by GCC when we point it at our code Instead, puts() has already been compiled and lives on our system as part of the C standard libraries. In order to actually have program execution jump into puts(), there are a few things that need to happen.

At compile-time, GCC needs to somehow emit branching instructions without knowing the address of puts() beforehand. It you take a look at the emitted code, you'll notice that we jump into some offset from a symbol called plt! The PLT (Procedure Linkage Table) is the section in our ELF that helps us jump to the actual address filled in by the dynamic linker³.

There are a lot of other sections that you can simply remove from this binary in order to get it smaller -- but it'll never be really small unless we can get rid of all the sections necessary for dealing with functions in libraries that we link against.

Ditching the Standard Libraries

Of course, the easiest way for us to skip this whole process is to tell GCC that we don't want to link up with the standard libraries by passing the -nostdlib flag. At this point, we run into an obvious problem:

[0]:[/tmp]$ gcc -s -nostdlib hello_typical.c -o 2_hello_nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000000002f0
/tmp/ccgYzJBU.o: In function `main':
hello_typical.c:(.text+0xc): undefined reference to `puts'
collect2: error: ld returned 1 exit status

If we don't have printf() we'll have to find some way of emitting the machine code necessary for printing to the screen. Luckily, we can just use some assembly to issue the write syscall! At this point it would be kind of a hassle to deal with writing assembly inline in C, so let's just switch to pure assembly. write is syscall number 1 and takes three arguments: a file descriptor to write to, the address of a string, and a size:

// hello_64.S

// Here's our string
string:	.ascii "hello, world\n"
main:
	// Syscall number 1
	movq $1, %rax

	// stdout is file descriptor 1
	movq $1, %rdi

	// `$string` will resolve the address of our string
	movq $string, %rsi

	// Our string is 13 bytes
	movq $13, %rdx

	// Do the syscall, then issue the `exit` syscall (number 60).
	// In the interest of being small, let's just `exit` return
	// whatever happens to be in %rdi
	syscall
	movq $60, %rax
	syscall

The calling convention here for x86-64 is basically:

Put the syscall number in %rax
Arguments are %rdi, %rsi, %rdx (and then %r{10,8,9})
The return value falls out in %rax

My GCC complains about relocations when I compile this, but we don't need to generate position-independent code here, so we can just use the -no-pie flag. Portability is perhaps the main reason behind the existence of shared libraries like libc, and therefore part of the trade-off we make when we want to reduce the size of binaries.

Unfortunately, there's actually another thing we need to do in order to get this working:

[0]:[/tmp]$ gcc -s -nostdlib -no-pie hello_64.S -o 2_hello_nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000004000d4
[0]:[/tmp]$ ./2_hello_nostdlib
Segmentation fault (core dumped)

The standard libraries actually provide us with the program entry point called _start. Instead of having the distinction of a main(), let's just write the whole thing in _start:

// hello_64.S
.global _start
string:	.ascii "hello, world\n"
_start:
	movq	$1, %rax
	movq	$1, %rdi
	movq	$string, %rsi
	movq	$13, %rdx
	syscall
	movq	$60, %rax
	syscall

Now we're in a position to see how small 'hello, world' can be:

[0]:[/tmp]$ gcc -s -nostdlib -no-pie hello_64.S -o 2_hello_nostdlib
[0]:[/tmp]$ ./2_hello_nostdlib
hello, world
[1]:[/tmp]$ wc -c 2_hello_nostdlib
560 2_hello_nostdlib

560 bytes! That's 14 times smaller than the original! Interestingly, we can actually continue saving space if we consider that this is a 64-bit version. We should save space if we can make sure all the addresses are 32-bits.

// hello_32.S
.global _start
string:	.ascii "hello, world\n"
_start:
	movl	$4, %eax
	movl	$1, %ebx
	movl	$string, %ecx
	movl	$13, %edx
	int	$0x80
	movl	$1, %eax
	int	$0x80

The 32-bit calling convention is just a little bit different here because we use the set of 32-bit registers. It's also worth mentioning that Linux has different numbers for the 32-bit syscalls. write is 4 here, and exit is 1.

[0]:[/tmp]$ gcc -s -no-pie -m32 -nostdlib hello_32.S -o 3_hello_nostdlib_32 
[0]:[/tmp]$ ./3_hello_nostdlib_32
hello, world
[1]:[/tmp]$ wc -c 3_hello_nostdlib_32
392 3_hello_nostdlib_32

392 bytes is 21 times smaller than our original 8360-byte version! If you take a look at the ELF sections, we've cut it down to just two here: .text and .note.gnu.build-id. We can pass -Wl,--build-id=none to GCC in order to make only .text, cutting the size down even more:

[0]:[/tmp]$ gcc -s -no-pie -m32 -nostdlib hello_32.S -Wl,--build-id=none -o 4_hello_nostdlib_32_opt
[1]:[/tmp]$ wc -c 4_hello_nostdlib_32
264 4_hello_nostdlib_32

Working Outside of the Compiler

Beyond this point, I couldn't get GCC to make a binary any smaller than 264 bytes. You would probably have to build an ELF file by hand if you wanted to get smaller. I'm not going to build one by hand, but let's see if we can modify the one that GCC gave us and make it smaller. GCC enshrines certain conventions for building ELF files, but remember that GCC isn't necessarily using everything solely necessary and sufficient for Linux to load the program into memory and execute it⁴.

[0]:[/tmp]$ readelf -e 4_hello_nostdlib_32_opt
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x8048061
  Start of program headers:          52 (bytes into file)
  Start of section headers:          144 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         1
  Size of section headers:           40 (bytes)
  Number of section headers:         3
  Section header string table index: 2

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        08048054 000054 00002a 00  AX  0   0  1
  [ 2] .shstrtab         STRTAB          00000000 00007e 000011 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x08048000 0x08048000 0x0007e 0x0007e R E 0x1000

 Section to Segment mapping:
  Segment Sections...
   00     .text

Looking at the header data, there are three section header entries starting at offset 0x90 in our binary, each taking up a total of 40 bytes, plus another 0x10 for the actual table with the string names of sections. After a little bit of research, it turns out that the actual loader in Linux doesn't use these bytes for anything, so you can just lop them off! -- although this may have the effect of breaking some of the ELF-parsing programs that we've been using:

[0]:[/tmp]$ cat tiny.py
#!/usr/bin/python
from os import chmod

with open("4_hello_nostdlib_32_opt", "rb") as f:
    data = bytearray(f.read())

output2 = data[:0x7e]

with open("5_hello_tiny", "wb") as f:
    f.write(output2)

chmod("5_hello_tiny", 0o755)
[0]:[/tmp]$ cat tiny.py; ./tiny.py; ./5_hello_tiny; wc -c 5_hello_tiny
hello, world
126 5_hello_tiny
[0]:[/tmp]$ xxd 5_hello_tiny
00000000: 7f45 4c46 0101 0100 0000 0000 0000 0000  .ELF............
00000010: 0200 0300 0100 0000 6180 0408 3400 0000  ........a...4...
00000020: 9000 0000 0000 0000 3400 2000 0100 2800  ........4. ...(.
00000030: 0300 0200 0100 0000 0000 0000 0080 0408  ................
00000040: 0080 0408 7e00 0000 7e00 0000 0500 0000  ....~...~.......
00000050: 0010 0000 6865 6c6c 6f2c 2077 6f72 6c64  ....hello, world
00000060: 0ab8 0400 0000 bb01 0000 00b9 5480 0408  ............T...
00000070: ba0d 0000 00cd 80b8 0100 0000 cd80       ..............

It's definitely possible to make these even smaller. I have a number of ideas.

We can probably put the 'hello, world' string somewhere in the ELF header, move the code up to fill the old location of the string, and then patch the last byte in the entry point address, and the last byte in the pointer to the string loaded into %ecx. If we trim off the space and newline from our string, it should fit right into the unused magic bytes at the top! This is actually pretty easy and saves us 13 bytes.

My other thought was to try writing the code over parts of the ELF header, which is basically everything before offset 0x52.

We can probably collide the code with the program header to some degree. At first, I was able to simply move the code back all the way up to the least-significant byte in the p_flags field without throwing SIGSEGV or anything nasty. This saves us another 7 bytes.

Interestingly, if you keep pushing the code backwards, they eventually intersect into a valid executable when code starts at offset 0x41!

Here's the code I used to slice up the file:

#!/usr/bin/python
from os import chmod

with open("4_hello_nostdlib_32_opt", "rb") as f:
    data = bytearray(f.read())

output2 = data[:0x7e]
code = output2[0x61:]
entrypoint_offset = 0x41

# Instead of letting the string live at 0x54, stuff it into the unused space
# in the e_ident bytes at the beginning of the file!
output2[0x05:0x10] = bytearray(b'hello,world')

# Patch the pointer to the 'hello,world' string
code[0x0b] = 0x5

# Now, move the code backwards, filling up the space we freed up, and also 
# intersecting with parts of the program header
output2[entrypoint_offset:] = code

# Patch the entrypoint address
output2[0x18] = entrypoint_offset

with open("6_hello_tiny", "wb") as f:
    f.write(output2)
chmod("6_hello_tiny", 0o755)

And here's what it looks like in the shell. The resulting valid ELF is only 94 bytes! That's ~88 times smaller than what we started with!

[0]:[/tmp]$ ./tiny.py; ./6_hello_tiny
hello,world[1]:[/tmp]$ wc -c 6_hello_tiny
94 6_hello_tiny
[0]:[/tmp]$ xxd 6_hello_tiny
00000000: 7f45 4c46 0168 656c 6c6f 2c77 6f72 6c64  .ELF.hello,world
00000010: 0200 0300 0100 0000 4180 0408 3400 0000  ........A...4...
00000020: 9000 0000 0000 0000 3400 2000 0100 2800  ........4. ...(.
00000030: 0300 0200 0100 0000 0000 0000 0080 0408  ................
00000040: 00b8 0400 0000 bb01 0000 00b9 0580 0408  ................
00000050: ba0d 0000 00cd 80b8 0100 0000 cd80       ..............

This is about as deep as I go for today. The moral of the story is something along the lines of: "Compilers and shared libraries are nice, but sometimes the overhead isn't worth it if you care about the size of your binaries!"

In case you're interesting in pushing onward, I'd guess your best option at this point would be to write a whole valid ELF header which is also somehow interpretable as a 'hello,world' binary in x86-64.

Neat, huh?

Also, in case you didn't know, there's actually a section 5 manpage on the ELF format (see man elf) which is super useful!

Note here that GCC actually seems to optimize our printf() call into puts().

I'm pretty sure the GOT (Global Offset Table) section also has something to do with this. The branching instructions in the PLT are indirect jumps that dereference some address at some offset from the GOT.

⁴

You can review the ELF loader code at fs/binfmt_elf.c in the kernel source tree if you really want to know what's necessary and sufficient. I'd actually like to build an ELF header from scratch sometime with this code in mind, sort of in the spirit of this article's attempts to create the smallest possible ELF.

⁵

Observe the output of readelf -x .interp <my_elf_file> and be amazed! On Linux, the .interp section contains the path to the dynamic linker ld.so