The recent release of Red Hat Enterprise Linux 7.6 enables extended Berkeley Packet Filter (eBPF) in-kernel virtual machine which can be used for system tracing. In this blog we introduce the basic concept of this technology and few example use cases. We also present some of the existing tooling built on top of eBPF.

Before starting with eBPF it's worth noting that traditional Berkeley Packet Filter available via setsockopt(SO_ATTACH_FILTER) is still available unmodified.

eBPF enables programmers to write code which gets executed in kernel space in a more secure and restricted environment. Yet this environment enables them to create tools which otherwise would require writing a new kernel module.

The eBPF in Red Hat Enterprise Linux 7.6 is provided as Tech Preview and thus doesn't come with full support and is not suitable for deployment in production. It is provided with the primary goal to gain wider exposure, and potentially move to full support in the future.

eBPF in Red Hat Enterprise Linux 7.6 is enabled only for tracing purposes, which allows attaching eBPF programs to probes, tracepoints and perf events. Other use cases such as eBPF socket filters or eXpress DataPath (XDP) are not enabled at this stage.

Design of eBPF

eBPF introduces a new syscall, bpf(2). This syscall is used for all eBPF operations like loading programs, attaching them to certain events, creating eBPF maps and access the map contents from tools. We'll talk about eBPF maps later.

eBPF programs are written in a special assembly language. When an application uses the bpf(2) syscall to load the program into the kernel the eBPF verifier inspects the code for safe execution. The special assembly language allows execution of the programs in a sandboxed virtual machine, with access only to a limited set of resources and data, the verifier is designed to ensure the program safely terminates. This means that an eBPF program should be safe to run, even in production, as it is designed not to cause any unwanted side-effects.

After loading these programs into the kernel, they are just-in-time compiled into machine native code.

eBPF maps provide a generic key-value store, which can be accessed from both in-kernel loaded eBPF programs as well as the userland applications via bpf(2) syscall. Together maps can be used to pass data from kernel to userland and vice versa.

A limited set of kernel functions may be called from eBPF programs. These functions provide read-write access to eBPF maps, may be used to retrieve current processor id and task pointer and also provide access to perf events.

Note that all eBPF programs need to be always compiled against the kernel-headers of the specific kernel where these will run.

Quick start guide to eBPF with bcc-tools

eBPF was enabled in Red Hat Enterprise Linux 7.6 Beta release onwards so the first step is to ensure we are running a Linux kernel newer than 3.10.0-940.el7 with eBPF support:

# uname -r
3.10.0-940.el7.x86_64

Developing tools based on eBPF can require deep knowledge of the kernel. Fortunately many of these tools are already created and ready to use. These tools can be used on their own and can also serve as a reference for creation of new eBPF programs.

eBPF programs can be attached to any function in the kernel with access to its arguments, thus only user with CAP_SYS_ADMIN capability can use the bpf(2) syscall. Therefore I am running all examples below as root user.

Development of tracing tools using eBPF can be simplified by using Berkeley Compiler Collection, BCC. Many useful pre-created tools are also shipping as part of the bcc-tools package. For more details about BCC and the tools it provides follow the project’s GitHub page.

To start using bcc-tools, install the respective package:

# yum install bcc-tools

These tools require no configuration and are ready to be used. Let's say we want to trace all kill signals being sent by processes running on my machine. There's a tool killsnoop(8) for this purpose:

# /usr/share/bcc/tools/killsnoop
TIME  	PID	COMM         	SIG  TPID   RESULT
17:40:14  18310  bash         	15   18315  0

This is the output of the killsnoop(8) command after I sent a signal to the process with pid 18315 using kill(1) command.

Using eBPF with systemtap

There are few other interesting tools which internally use eBPF. SystemTap release 3.2 includes BPF backend which can use eBPF to run stap scripts rather than using kernel modules as traditional SystemTap.

To use this backend install systemtap using yum:

# yum install systemtap

RHEL-7.6 Beta comes with systemtap-3.3:

# rpm -qa systemtap
systemtap-3.3-2.el7.x86_64

Systemtap uses debuginfo to understand function arguments thus we also need to install kernel-debuginfo package from respective channels:

# subscription-manager repos —enable=rhel-7-server-debug-rpms
# yum install kernel-debuginfo-$(uname -r)

Let's say I'd like to create a similar tool to killsnoop we've presented above. This tool is just tracing kill(2) syscall, so we can trace respective kernel function for the same outcome:

# stap —runtime=bpf -e 'probe kernel.function("sys_kill") { printf("PID %d sends signal %d\n", $pid, $sig); }'
PID 13197 sends signal 15
PID 13197 sends signal 15

This output was printed while I was running these commands in the other shell:

# kill 13197
# kill 13197

There are more details on stapbpf can be found in stapbpf(8) man page or Aaron Merey’s post "Introducing stapbpf - SystemTap’s new BPF backend."

Debugging of eBPF

Let's verify that running these tools really uses eBPF for execution. RHEL-7.6 comes with bpftool which can be used to list and dump eBPF programs loaded in the running kernel.

# yum install bpftool
# bpftool prog list
3: kprobe  name syscall__kill  tag 46f0a9df02801539  gpl
    loaded_at Aug 30/16:23  uid 0
    xlated 240B  jited 172B  memlock 4096B  map_ids 3
4: kprobe  name do_ret_sys_kill  tag ff8388e5d5f0b53a  gpl
    loaded_at Aug 30/16:23  uid 0
    xlated 392B  jited 251B  memlock 4096B  map_ids 3,4

We can see that two eBPF programs are loaded on the machine while I was running the killsnoop tool. This is because killsnoop (kprobe) trace both call entry and return, thus two BPF programs are attach. The one named syscall__kill is used to trace the entry call of kill(2) syscall handler, and the other named do_ret_sys_kill trace returns from this handlers (in-order to record functions return value).

Another option to list processes using eBPF is to run the bpflist(8) tool from bcc-tools:

# /usr/share/bcc/tools/bpflist
PID	COMM         	TYPE 	COUNT
13159  killsnoop    	prog 	2   
13159  killsnoop    	map  	2   

We can use bpftool to dump and disassemble one of these programs:

# bpftool prog dump xlated id 3
   0: (79) r7 = *(u64 *)(r1 +104)
   1: (79) r6 = *(u64 *)(r1 +112)
   2: (85) call bpf_get_current_pid_tgid#56176
   3: (63) *(u32 *)(r10 -4) = r0
   4: (b7) r1 = 0
   5: (7b) *(u64 *)(r10 -16) = r1
   6: (7b) *(u64 *)(r10 -24) = r1
   7: (7b) *(u64 *)(r10 -32) = r1
   8: (67) r0 <<= 32
   9: (77) r0 >>= 32
  10: (7b) *(u64 *)(r10 -40) = r0
  11: (bf) r1 = r10
  12: (07) r1 += -24
  13: (b7) r2 = 16
  14: (85) call bpf_get_current_comm#56400
  15: (67) r0 <<= 32
  16: (77) r0 >>= 32
  17: (55) if r0 != 0x0 goto pc+10
  18: (63) *(u32 *)(r10 -32) = r7
  19: (63) *(u32 *)(r10 -28) = r6
  20: (18) r1 = map[id:3]
  22: (bf) r2 = r10
  23: (07) r2 += -4
  24: (bf) r3 = r10
  25: (07) r3 += -40
  26: (b7) r4 = 0
  27: (85) call bpf_map_update_elem#56240
  28: (b7) r0 = 0
  29: (95) exit

This is the eBPF assembly used by killsnoop program to collect the data it needs.

Creating eBPF tools

Let's take a deep dive and look at how the killsnoop(8) tool is implemented. We can see that the tool itself is actually a Python script:

# file /usr/share/bcc/tools/killsnoop
/usr/share/bcc/tools/killsnoop: Python script, ASCII text executable

A closer look at this script, reveals that the Python script contains some C code, as quoted text in the variable bpf_text.

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct val_t {
   u64 pid;
   int sig;
   int tpid;
   char comm[TASK_COMM_LEN];
};

struct data_t {
   u64 pid;
   int tpid;
   int sig;
   int ret;
   char comm[TASK_COMM_LEN];
};

BPF_HASH(infotmp, u32, struct val_t);
BPF_PERF_OUTPUT(events);

int syscall__kill(struct pt_regs *ctx, int tpid, int sig)
{
	u32 pid = bpf_get_current_pid_tgid();
	FILTER

	struct val_t val = {.pid = pid};
	if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) {
    	val.tpid = tpid;
    	val.sig = sig;
    	infotmp.update(&pid, &val);
	}

	return 0;
};

int do_ret_sys_kill(struct pt_regs *ctx)
{
	struct data_t data = {};
	struct val_t *valp;
	u32 pid = bpf_get_current_pid_tgid();

	valp = infotmp.lookup(&pid);
	if (valp == 0) {
    	// missed entry
    	return 0;
	}

	bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm);
	data.pid = pid;
	data.tpid = valp->tpid;
	data.ret = PT_REGS_RC(ctx);
	data.sig = valp->sig;

	events.perf_submit(ctx, &data, sizeof(data));
	infotmp.delete(&pid);

	return 0;
}
"""

The bpf_text string contains the C code which is compiled into the BPF assembly and passed to the kernel. The BPF_HASH() macro tells bcc framework to create a BPF map of type hash named 'infotmp'. This map is used to pass data between sys_kill kprobe and respective kretprobe which executes on exit of sys_kill() function. The map is indexed by sender’s process id (pid) thus the key type is u32. Value type of val_t is used to pass necessary data to the return probe.

BPF_PERF_OUTPUT() creates a perf event buffer which is later used to pass data to userland. We can see that the final event contains all important information about the signal consisting of pid’s of sending and receiving process, signal number itself, return value of the kill() syscall and the name of the sender. In the sys_kill() kprobe we don’t yet know the result of the call which we can only see in the return from this function.

The code for syscall kprobe and kretprobe is defined further in functions syscall__kill() and do_ret_sys_kill(). The kprobe function stores the target process pid, number of the syscall and the name of the sender in infotmp map. The kretprobe looks up this data in infotmp map using current process pid as the index, the code accompanies them with the return value of the call and submits these data to the perf event buffer.

The rest of the script is rather trivial:

# initialize BPF
b = BPF(text=bpf_text)
kill_fnname = b.get_syscall_fnname("kill")
b.attach_kprobe(event=kill_fnname, fn_name="syscall__kill")
b.attach_kretprobe(event=kill_fnname, fn_name="do_ret_sys_kill")

Here we initiate the bcc BPF framework and tell it to attach the syscall__kill() function from the C code to kprobe defined for syscall kill(2) and do_rt_sys_kill() to attach to the return probe of the same syscall.

class Data(ct.Structure):
	_fields_ = [
    	("pid", ct.c_ulonglong),
    	("tpid", ct.c_int),
    	("sig", ct.c_int),
    	("ret", ct.c_int),
    	("comm", ct.c_char * TASK_COMM_LEN)
	]

The data structure layout is defined to match the struct data_t in the C code.

# process event
def print_event(cpu, data, size):
	event = ct.cast(data, ct.POINTER(Data)).contents

	if (args.failed and (event.ret >= 0)):
    	    return

	print("%-9s %-6d %-16s %-4d %-6d %d" % (strftime("%H:%M:%S"),
    	event.pid, event.comm.decode(), event.sig, event.tpid, event.ret))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
	b.perf_buffer_poll()

Finally poll for perf events and print each received perf event.

Writing eBPF tools requires deeper understanding of Linux kernel internals to identify what kprobes need to be employed to collect the data. The BCC framework provides multiple ways to collect, store, sort and analyse the data so the amount of data passed to userspace is minimal. Once these tools are created eBPF can provide an effective way to collect any data from running system and offer a new way to monitor status of all kernel subsystems.

Of interest

News to note—just for you