Exploring Kernel Tracing with eBPF
Table of Contents
Some thoughts after watching Brendan Gregg's LCA2017 talk about developments in eBPF tracing. Figured I'd play around with it a bit. It's really cool!
In the past I've done a bit of playing around with ftrace
, which I thought was extremely cool
and potentially very useful, although I've really only used it to look at the in-kernel call stacks
for particular things that I was interested in. There are obviously far reaching implications
here for engineers who work with Linux systems, in that the work done on eBPF faculties in the kernel
is a massive step towards secure, low-overhead, easier ways to peer into the state of live systems.
These tools (plus perf
, systemtap
, and such) are indispensable for performance engineers like
Brendan and others. However, I am [at least, immediately] more interested in using them for pedagogical
purposes.
Anyhow, I'll be showing off little pieces of iovisor/bcc,
which is a framework for easily creating eBPF programs for kernel tracing.
BPF was originally developed for
letting userspace filter messages passed through sockets, although nowadays it's used in a couple
different places throughout the kernel as well (i.e netfilter
when you use iptables
, and seccomp
for system call filtering)!
The project has bindings for Python which I'll be playing around with today.
I think this will mostly be a quick exploration into attaching eBPF programs to kprobes.
Starting With a Simple Example
Here's your standard kind of tool using the bcc
libraries. Consider the following code:
#!/usr/bin/python
# who-calls-execve.py
from bcc import BPF
import ctypes as c
timer = 0
# Our eBPF program
code= """
#include <linux/sched.h>
/* Marshal data into this struct */
struct my_struct {
u32 pid;
u64 timestamp;
char name[TASK_COMM_LEN];
};
/* Send data to userspace via the perf ring-buffer */
BPF_PERF_OUTPUT(events);
int foo(struct pt_regs *ctx) {
struct my_struct un = {};
un.pid = bpf_get_current_pid_tgid();
un.timestamp = bpf_ktime_get_ns();
bpf_get_current_comm(&un.name, sizeof(un.name));
events.perf_submit(ctx, &un, sizeof(un));
return 0;
}
"""
# Submit our eBPF program to the kernel, coupling it to the kprobe for execve()
b = BPF(text=code)
b.attach_kprobe(event="sys_execve", fn_name="foo")
# For data to unmarshal from perf_buffer
TASK_COMM_LEN = 16
class output(c.Structure):
_fields_ = [("pid", c.c_ulonglong),
("timestamp", c.c_ulonglong),
("name", c.c_char * TASK_COMM_LEN)]
# Function to call when we unmarshal data
def callback(reg, data, size):
global timer
ret = c.cast(data, c.POINTER(output)).contents
if timer == 0:
timer = ret.timestamp
delta = (ret.timestamp - timer) / 1000000000
print("%-18.9f %-16s %-6d" % (delta, ret.name, ret.pid))
# Wait for events, using our callback() function to unmarshal data
b["events"].open_perf_buffer(callback)
# Main loop
while 1:
b.kprobe_poll()
At this point, perhaps you exclaim:
"Hey wait! - didn't you mention Python bindings?"
"Why is there a bunch of C? It's disgusting."
Yes, but you don't write eBPF programs in Python - you write them in C (in Python).
Interestingly, the Linux kernel itself contains a JIT compiler and sandbox which are used to verify and
execute eBPF programs after they've been submitted through the bpf()
syscall. This isn't your
standard C either - but a restricted version of C without globals, loops, and a few other things
that might cause problems for the kernel.
Otherwise, one imagines a situation where it becomes easy [for lax programmers, or malicious actors]
to consume kernel resources by passing a buggy eBPF program for the kernel to run.
The Python bindings are an interface to easily setting up code to be converted into bytecode
before being submitted to bpf()
.
Also keep in mind that your kernel version does matter if you're interested in playing around with this stuff. You might only have partial access to certain eBPF helper functions depending on how new your kernel is. Luckily, the folks working on the IOVisor repository seem to maintain this handy chart!
In this particular program there are a couple of useful functions that help us here.
For example: bpf_ktime_get_ns()
gives us the
time in nanoseconds, and bpf_get_current_pid_tgid()
retrieves the process ID and thread-group ID
of the current process. After we're done pulling out things that we find interesting, we just fill
up my_struct
, let events.perf_submit()
ship our data over to userspace.
These lines are of particular interest:
# Submit our eBPF program, coupling it to the kprobe for execve()
b = BPF(text=code)
b.attach_kprobe(event="sys_execve", fn_name="foo")
These statements actually translate our C code, and then attach it to the kprobe
for the execve()
syscall. Take a look at this strace
output after I temporarily cut out the
rest of code after these lines:
[0][][~/proj/bpf]$ >/dev/null sudo strace -v -e trace=bpf python who-calls-execve.py
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, ...}, 48) = 3
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=26, ...}, 48) = 4
+++ exited with 0 +++
The actual function of the bpf()
syscall is actually multiplexed via the first argument. The first
call appears to set up some shared memory for passing data between kernel and userspace, and
the second call is the one that actually submits our bytecode to the kernel.
The manpages for bpf(2)
provide us with this explanation of eBPF maps:
"eBPF maps are a generic data structure for storage of different data types. Data types are generally treated as binary blobs, so a user just specifies the size of the key and the size of the value at map- creation time. In other words, a key/value for a given map can have an arbitrary structure. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors."
At the end, we register a callback function for unmarshalling data that uses
the ctypes
library, then sit in a loop waiting for our selected kprobe to trigger our eBPF program.
Now, every time that execve()
is called across our system, our little Python program will spit out
a line with the name and PID of the process that called execve()
! Oh, and I forgot to mention that
you always need CAP_SYS_ADMIN
to call bpf()
, so don't forget to preface with sudo
.
[0][][~/proj/bpf]$ sudo python who-calls-execve.py
0.000000000 b'panel' 29347
0.000887345 b'panel' 29348
0.920194382 b'vim' 29349
0.922226181 b'bash' 29352
0.922254946 b'bash' 29353
1.001668396 b'panel' 29354
1.002102664 b'panel' 29355
2.002823650 b'panel' 29356
2.003521383 b'panel' 29357
3.004389624 b'panel' 29358
3.005049868 b'panel' 29359
4.006655327 b'panel' 29360
4.008514914 b'panel' 29361
^C[0][][~/proj/bpf]$
Pulling arguments to execve()
So this is pretty cool - we've asked the Linux kernel to record the PID and "name" of all processes
that call execve()
. Why not also ask the kernel to record the first argument to execve()
so we
can see what binaries are being called?
We'll add another field to my_struct
so we can unmarshal the data:
struct my_struct {
u32 pid;
u64 timestamp;
char name[TASK_COMM_LEN];
char path[32];
};
Then, in the foo()
function that gets attached to the kprobe, we'll add an argument that will
auto-magically be filled in by execve()
's first argument (a string called path
):
int foo(struct pt_regs *ctx, char *path) {
...
bpf_probe_read(&un.path, sizeof(un.path), path);
...
return 0;
}
In the interest of security, eBPF programs can only perform memory operations in their respective
stacks. Here, bpf_probe_read()
is another helper function which reads some memory elsewhere for
us, copying it so that we can ship it back to userspace.
This is an example of explicitly using bpf_probe_read()
, although note that the bcc
tooling
does a lot of legwork in calling this for you when necessary by using clang
to mangle up whatever
C you've written. Then, all we need to do is add a field in the ctypes
structure we're
using to unmarshal data, and add a line to our callback function.
...
class output(c.Structure):
_fields_ = [("pid", c.c_ulonglong),
("timestamp", c.c_ulonglong),
("name", c.c_char * TASK_COMM_LEN),
("path", c.c_char * PATH_LEN)]
...
def callback(reg, data, size):
...
print("%-18.9f %-16s %-5d %-16s" % (delta,
ret.name.decode('ascii'),
ret.pid,
ret.path.decode('ascii')))
Now we can see that the output is obviously from the clock loop on my [admittedly hacky] desktop panel
script calling date
and sleep
every second:
[0][][~/proj/bpf]$ sudo python who-calls-execve.py
0.000000000 panel 8432 /usr/bin/date
0.001826737 panel 8433 /usr/bin/sleep
1.003943415 panel 8434 /usr/bin/date
1.005622976 panel 8435 /usr/bin/sleep
2.007018396 panel 8436 /usr/bin/date
2.007790844 panel 8437 /usr/bin/sleep
^C[0][][~/proj/bpf]$
Example with connect()
Let's try to write one that shows us what processes are calling out over the network, and maybe
returning a corresponding destination IP address. We'll do this by attaching a program to
tcp_connect() which
pulls out relevant data from the struct sock
passed to the function.
All we need to do is add a field to our struct my_struct
:
struct my_struct {
...
u64 addr;
};
Then, in our function foo()
which is attached to the kprobe, we accept a struct sock *sk
and pull out the intended destination address:
int foo(struct pt_regs *ctx, struct sock *sk) {
struct my_struct un = {};
struct sock_common s = sk->__sk_common;
un.addr = s.skc_daddr;
... }
Since we're attaching to tcp_connect()
, we'll include that in our call to BPF.attach_kprobe()
:
b.attach_kprobe(event="tcp_connect", fn_name="foo")
The kernel's internal representation of destination IP address in
struct sock_common
needs to be unfurled into a string.
I had to do a little bit of poking around on
http://lxr.free-electrons.com/ to
find what I was looking for! It's an excellent resource for swiftly
making sense of the kernel source. If you're interested in the kernel
and haven't heard of it, I'd highly reccommend checking it out!
In our callback()
function, all weed need to do is import socket
and call
socket.inet_ntop()
to get a nice human-readable result:
def callback(reg, data, size):
...
print("%-18.9f %-16s %-5d %-16s" % (delta,
ret.name.decode('ascii'),
ret.pid,
socket.inet_ntop(socket.AF_INET, struct.pack("I", ret.addr))))
..and that should be about it! I actually don't know what the errors are about - presumably they're just some wierd problems with the kernel headers. Check it out though!:
[0][][~/proj/bpf]$ sudo python who-calls-connect.py
In file included from /virtual/main.c:3:
In file included from include/net/sock.h:51:
In file included from include/linux/netdevice.h:38:
In file included from include/linux/dmaengine.h:20:
In file included from include/linux/device.h:24:
In file included from include/linux/pinctrl/devinfo.h:21:
In file included from include/linux/pinctrl/consumer.h:17:
In file included from include/linux/seq_file.h:10:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
[-Wtautological-compare]
if (id < 0 || id >= READING_MAX_ID)
~~ ^ ~
1 warning generated.
0.000000000 Socket Thread 11131 127.0.0.1
0.012109194 Socket Thread 11131 127.0.0.1
6.929860198 nc 29226 127.0.0.1
11.181322456 nc 29241 127.0.2.1
16.092408807 nc 29254 127.0.3.3
The first two requests are from Firefox pointed at localhost
, and the others
are just from calling netcat
!
Other Experiments?
These two programs are rudimentary examples compared to some of the prebuilt tools
that come along in the bcc
repository. I might end up playing around with these
libraries some more in the future when the need arises. I think it might be cool to
build a little toy monitoring application that pulls in some statistics from interesting
functions in the kernel and throws them into a web application for visualization.
Recently I've also been meaning to explore how namespaces are represented in the
kernel - might end up using bcc
to aid in exploration.