Exploring Kernel Tracing with eBPF

19 February 2017

Miscellania

Table of Contents
Starting With a Simple Example
Pulling arguments to execve()
Example with connect()
Other Experiments?

Some thoughts after watching Brendan Gregg's LCA2017 talk about developments in eBPF tracing. Figured I'd play around with it a bit. It's really cool!

In the past I've done a bit of playing around with ftrace, which I thought was extremely cool and potentially very useful, although I've really only used it to look at the in-kernel call stacks for particular things that I was interested in. There are obviously far reaching implications here for engineers who work with Linux systems, in that the work done on eBPF faculties in the kernel is a massive step towards secure, low-overhead, easier ways to peer into the state of live systems.

These tools (plus perf, systemtap, and such) are indispensable for performance engineers like Brendan and others. However, I am [at least, immediately] more interested in using them for pedagogical purposes.

Anyhow, I'll be showing off little pieces of iovisor/bcc, which is a framework for easily creating eBPF programs for kernel tracing. BPF was originally developed for letting userspace filter messages passed through sockets, although nowadays it's used in a couple different places throughout the kernel as well (i.e netfilter when you use iptables, and seccomp for system call filtering)! The project has bindings for Python which I'll be playing around with today. I think this will mostly be a quick exploration into attaching eBPF programs to kprobes.

Starting With a Simple Example

Here's your standard kind of tool using the bcc libraries. Consider the following code:

#!/usr/bin/python
# who-calls-execve.py
from bcc import BPF
import ctypes as c
timer = 0

# Our eBPF program
code= """
#include <linux/sched.h>

/* Marshal data into this struct */
struct my_struct {
    u32 pid;
    u64 timestamp;
    char name[TASK_COMM_LEN];
};

/* Send data to userspace via the perf ring-buffer */
BPF_PERF_OUTPUT(events);

int foo(struct pt_regs *ctx) {
    struct my_struct un = {};

    un.pid = bpf_get_current_pid_tgid();
    un.timestamp = bpf_ktime_get_ns();
    bpf_get_current_comm(&un.name, sizeof(un.name));

    events.perf_submit(ctx, &un, sizeof(un));
    return 0;
}
"""

# Submit our eBPF program to the kernel, coupling it to the kprobe for execve()
b = BPF(text=code)
b.attach_kprobe(event="sys_execve", fn_name="foo")

# For data to unmarshal from perf_buffer
TASK_COMM_LEN = 16
class output(c.Structure):
        _fields_ = [("pid", c.c_ulonglong), 
                    ("timestamp", c.c_ulonglong),
                    ("name", c.c_char * TASK_COMM_LEN)]

# Function to call when we unmarshal data
def callback(reg, data, size):
    global timer
    ret = c.cast(data, c.POINTER(output)).contents
    if timer == 0:
        timer = ret.timestamp
    delta = (ret.timestamp - timer) / 1000000000
    print("%-18.9f %-16s %-6d" % (delta, ret.name, ret.pid))

# Wait for events, using our callback() function to unmarshal data
b["events"].open_perf_buffer(callback)

# Main loop
while 1:
    b.kprobe_poll()

At this point, perhaps you exclaim:

"Hey wait! - didn't you mention Python bindings?"
"Why is there a bunch of C? It's disgusting."

Yes, but you don't write eBPF programs in Python - you write them in C (in Python). Interestingly, the Linux kernel itself contains a JIT compiler and sandbox which are used to verify and execute eBPF programs after they've been submitted through the bpf() syscall. This isn't your standard C either - but a restricted version of C without globals, loops, and a few other things that might cause problems for the kernel. Otherwise, one imagines a situation where it becomes easy [for lax programmers, or malicious actors] to consume kernel resources by passing a buggy eBPF program for the kernel to run. The Python bindings are an interface to easily setting up code to be converted into bytecode before being submitted to bpf().

Also keep in mind that your kernel version does matter if you're interested in playing around with this stuff. You might only have partial access to certain eBPF helper functions depending on how new your kernel is. Luckily, the folks working on the IOVisor repository seem to maintain this handy chart!

In this particular program there are a couple of useful functions that help us here. For example: bpf_ktime_get_ns() gives us the time in nanoseconds, and bpf_get_current_pid_tgid() retrieves the process ID and thread-group ID of the current process. After we're done pulling out things that we find interesting, we just fill up my_struct, let events.perf_submit() ship our data over to userspace.

These lines are of particular interest:

# Submit our eBPF program, coupling it to the kprobe for execve()
b = BPF(text=code)
b.attach_kprobe(event="sys_execve", fn_name="foo")

These statements actually translate our C code, and then attach it to the kprobe for the execve() syscall. Take a look at this strace output after I temporarily cut out the rest of code after these lines:

[0][][~/proj/bpf]$ >/dev/null sudo strace -v -e trace=bpf python who-calls-execve.py
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, ...}, 48) = 3
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=26, ...}, 48) = 4
+++ exited with 0 +++

The actual function of the bpf() syscall is actually multiplexed via the first argument. The first call appears to set up some shared memory for passing data between kernel and userspace, and the second call is the one that actually submits our bytecode to the kernel.

The manpages for bpf(2) provide us with this explanation of eBPF maps:

"eBPF maps are a generic data structure for storage of different data types. Data types are generally treated as binary blobs, so a user just specifies the size of the key and the size of the value at map- creation time. In other words, a key/value for a given map can have an arbitrary structure. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors."

At the end, we register a callback function for unmarshalling data that uses the ctypes library, then sit in a loop waiting for our selected kprobe to trigger our eBPF program. Now, every time that execve() is called across our system, our little Python program will spit out a line with the name and PID of the process that called execve()! Oh, and I forgot to mention that you always need CAP_SYS_ADMIN to call bpf(), so don't forget to preface with sudo.

[0][][~/proj/bpf]$ sudo python who-calls-execve.py
0.000000000        b'panel'         29347 
0.000887345        b'panel'         29348 
0.920194382        b'vim'           29349 
0.922226181        b'bash'          29352 
0.922254946        b'bash'          29353 
1.001668396        b'panel'         29354 
1.002102664        b'panel'         29355 
2.002823650        b'panel'         29356 
2.003521383        b'panel'         29357 
3.004389624        b'panel'         29358 
3.005049868        b'panel'         29359 
4.006655327        b'panel'         29360 
4.008514914        b'panel'         29361 
^C[0][][~/proj/bpf]$

Pulling arguments to `execve()`

So this is pretty cool - we've asked the Linux kernel to record the PID and "name" of all processes that call execve(). Why not also ask the kernel to record the first argument to execve() so we can see what binaries are being called?

We'll add another field to my_struct so we can unmarshal the data:

struct my_struct {
    u32 pid;
    u64 timestamp;
    char name[TASK_COMM_LEN];
    char path[32];
};

Then, in the foo() function that gets attached to the kprobe, we'll add an argument that will auto-magically be filled in by execve()'s first argument (a string called path):

int foo(struct pt_regs *ctx, char *path) {
    ...
    bpf_probe_read(&un.path, sizeof(un.path), path);

    ...
    return 0;
}

In the interest of security, eBPF programs can only perform memory operations in their respective stacks. Here, bpf_probe_read() is another helper function which reads some memory elsewhere for us, copying it so that we can ship it back to userspace. This is an example of explicitly using bpf_probe_read(), although note that the bcc tooling does a lot of legwork in calling this for you when necessary by using clang to mangle up whatever C you've written. Then, all we need to do is add a field in the ctypes structure we're using to unmarshal data, and add a line to our callback function.

...

class output(c.Structure):
        _fields_ = [("pid", c.c_ulonglong), 
                    ("timestamp", c.c_ulonglong),
                    ("name", c.c_char * TASK_COMM_LEN),
                    ("path", c.c_char * PATH_LEN)]
...

def callback(reg, data, size):
    ...

    print("%-18.9f %-16s %-5d %-16s" % (delta, 
        ret.name.decode('ascii'), 
        ret.pid, 
        ret.path.decode('ascii')))

Now we can see that the output is obviously from the clock loop on my [admittedly hacky] desktop panel script calling date and sleep every second:

[0][][~/proj/bpf]$ sudo python who-calls-execve.py
0.000000000        panel            8432  /usr/bin/date   
0.001826737        panel            8433  /usr/bin/sleep  
1.003943415        panel            8434  /usr/bin/date   
1.005622976        panel            8435  /usr/bin/sleep  
2.007018396        panel            8436  /usr/bin/date   
2.007790844        panel            8437  /usr/bin/sleep  
^C[0][][~/proj/bpf]$

Example with connect()

Let's try to write one that shows us what processes are calling out over the network, and maybe returning a corresponding destination IP address. We'll do this by attaching a program to tcp_connect() which pulls out relevant data from the struct sock passed to the function.

All we need to do is add a field to our struct my_struct:

struct my_struct {
...
u64 addr;
};

Then, in our function foo() which is attached to the kprobe, we accept a struct sock *sk and pull out the intended destination address:

int foo(struct pt_regs *ctx, struct sock *sk) {
struct my_struct un = {};

struct sock_common s = sk->__sk_common;
un.addr = s.skc_daddr;

... }

Since we're attaching to tcp_connect(), we'll include that in our call to BPF.attach_kprobe():

b.attach_kprobe(event="tcp_connect", fn_name="foo")

The kernel's internal representation of destination IP address in struct sock_common needs to be unfurled into a string. I had to do a little bit of poking around on http://lxr.free-electrons.com/ to find what I was looking for! It's an excellent resource for swiftly making sense of the kernel source. If you're interested in the kernel and haven't heard of it, I'd highly reccommend checking it out!

In our callback() function, all weed need to do is import socket and call socket.inet_ntop() to get a nice human-readable result:

  def callback(reg, data, size):
      ...
      print("%-18.9f %-16s %-5d %-16s" % (delta, 
          ret.name.decode('ascii'), 
          ret.pid, 
          socket.inet_ntop(socket.AF_INET, struct.pack("I", ret.addr))))

..and that should be about it! I actually don't know what the errors are about - presumably they're just some wierd problems with the kernel headers. Check it out though!:

[0][][~/proj/bpf]$ sudo python who-calls-connect.py
In file included from /virtual/main.c:3:
In file included from include/net/sock.h:51:
In file included from include/linux/netdevice.h:38:
In file included from include/linux/dmaengine.h:20:
In file included from include/linux/device.h:24:
In file included from include/linux/pinctrl/devinfo.h:21:
In file included from include/linux/pinctrl/consumer.h:17:
In file included from include/linux/seq_file.h:10:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
      [-Wtautological-compare]
        if (id < 0 || id >= READING_MAX_ID)
            ~~ ^ ~
1 warning generated.
0.000000000        Socket Thread    11131 127.0.0.1
0.012109194        Socket Thread    11131 127.0.0.1
6.929860198        nc               29226 127.0.0.1
11.181322456       nc               29241 127.0.2.1
16.092408807       nc               29254 127.0.3.3

The first two requests are from Firefox pointed at localhost, and the others are just from calling netcat!

Other Experiments?

These two programs are rudimentary examples compared to some of the prebuilt tools that come along in the bcc repository. I might end up playing around with these libraries some more in the future when the need arises. I think it might be cool to build a little toy monitoring application that pulls in some statistics from interesting functions in the kernel and throws them into a web application for visualization. Recently I've also been meaning to explore how namespaces are represented in the kernel - might end up using bcc to aid in exploration.