The recent release of Red Hat Enterprise Linux 7.6 enables extended Berkeley Packet Filter (eBPF) in-kernel virtual machine which can be used for system tracing. In this blog we introduce the basic concept of this technology and few example use cases. We also present some of the existing tooling built on top of eBPF.
Before starting with eBPF it's worth noting that traditional Berkeley Packet Filter available via setsockopt(SO_ATTACH_FILTER) is still available unmodified.
eBPF enables programmers to write code which gets executed in kernel space in a more secure and restricted environment. Yet this environment enables them to create tools which otherwise would require writing a new kernel module.
The eBPF in Red Hat Enterprise Linux 7.6 is provided as Tech Preview and thus doesn't come with full support and is not suitable for deployment in production. It is provided with the primary goal to gain wider exposure, and potentially move to full support in the future.
eBPF in Red Hat Enterprise Linux 7.6 is enabled only for tracing purposes, which allows attaching eBPF programs to probes, tracepoints and perf events. Other use cases such as eBPF socket filters or eXpress DataPath (XDP) are not enabled at this stage.
Read more about optimizing performance for the open-hybrid enterprise.
Read more about optimizing performance for the open-hybrid enterprise.
Design of eBPF
eBPF introduces a new syscall, bpf(2). This syscall is used for all eBPF operations like loading programs, attaching them to certain events, creating eBPF maps and access the map contents from tools. We'll talk about eBPF maps later.
eBPF programs are written in a special assembly language. When an application uses the bpf(2) syscall to load the program into the kernel the eBPF verifier inspects the code for safe execution. The special assembly language allows execution of the programs in a sandboxed virtual machine, with access only to a limited set of resources and data, the verifier is designed to ensure the program safely terminates. This means that an eBPF program should be safe to run, even in production, as it is designed not to cause any unwanted side-effects.
After loading these programs into the kernel, they are just-in-time compiled into machine native code.
eBPF maps provide a generic key-value store, which can be accessed from both in-kernel loaded eBPF programs as well as the userland applications via bpf(2) syscall. Together maps can be used to pass data from kernel to userland and vice versa.
A limited set of kernel functions may be called from eBPF programs. These functions provide read-write access to eBPF maps, may be used to retrieve current processor id and task pointer and also provide access to perf events.
Note that all eBPF programs need to be always compiled against the kernel-headers of the specific kernel where these will run.
Quick start guide to eBPF with bcc-tools
eBPF was enabled in Red Hat Enterprise Linux 7.6 Beta release onwards so the first step is to ensure we are running a Linux kernel newer than 3.10.0-940.el7 with eBPF support:
# uname -r 3.10.0-940.el7.x86_64
Developing tools based on eBPF can require deep knowledge of the kernel. Fortunately many of these tools are already created and ready to use. These tools can be used on their own and can also serve as a reference for creation of new eBPF programs.
eBPF programs can be attached to any function in the kernel with access to its arguments, thus only user with CAP_SYS_ADMIN capability can use the bpf(2) syscall. Therefore I am running all examples below as root user.
Development of tracing tools using eBPF can be simplified by using Berkeley Compiler Collection, BCC. Many useful pre-created tools are also shipping as part of the bcc-tools package. For more details about BCC and the tools it provides follow the project’s GitHub page.
To start using bcc-tools, install the respective package:
# yum install bcc-tools
These tools require no configuration and are ready to be used. Let's say we want to trace all kill signals being sent by processes running on my machine. There's a tool killsnoop(8) for this purpose:
# /usr/share/bcc/tools/killsnoop TIME PID COMM SIG TPID RESULT 17:40:14 18310 bash 15 18315 0
This is the output of the killsnoop(8) command after I sent a signal to the process with pid 18315 using kill(1) command.
Using eBPF with systemtap
There are few other interesting tools which internally use eBPF. SystemTap release 3.2 includes BPF backend which can use eBPF to run stap scripts rather than using kernel modules as traditional SystemTap.
To use this backend install systemtap using yum:
# yum install systemtap
RHEL-7.6 Beta comes with systemtap-3.3:
# rpm -qa systemtap systemtap-3.3-2.el7.x86_64
Systemtap uses debuginfo to understand function arguments thus we also need to install kernel-debuginfo package from respective channels:
# subscription-manager repos --enable=rhel-7-server-debug-rpms # yum install kernel-debuginfo-$(uname -r)
Let's say I'd like to create a similar tool to killsnoop we've presented above. This tool is just tracing kill(2) syscall, so we can trace respective kernel function for the same outcome:
# stap --runtime=bpf -e 'probe kernel.function("sys_kill") { printf("PID %d sends signal %d\n", $pid, $sig); }' PID 13197 sends signal 15 PID 13197 sends signal 15
This output was printed while I was running these commands in the other shell:
# kill 13197 # kill 13197
There are more details on stapbpf can be found in stapbpf(8) man page or Aaron Merey’s post "Introducing stapbpf - SystemTap’s new BPF backend."
Debugging of eBPF
Let's verify that running these tools really uses eBPF for execution. RHEL-7.6 comes with bpftool which can be used to list and dump eBPF programs loaded in the running kernel.
# yum install bpftool # bpftool prog list 3: kprobe name syscall__kill tag 46f0a9df02801539 gpl loaded_at Aug 30/16:23 uid 0 xlated 240B jited 172B memlock 4096B map_ids 3 4: kprobe name do_ret_sys_kill tag ff8388e5d5f0b53a gpl loaded_at Aug 30/16:23 uid 0 xlated 392B jited 251B memlock 4096B map_ids 3,4
We can see that two eBPF programs are loaded on the machine while I was running the killsnoop tool. This is because killsnoop (kprobe) trace both call entry and return, thus two BPF programs are attach. The one named syscall__kill is used to trace the entry call of kill(2) syscall handler, and the other named do_ret_sys_kill trace returns from this handlers (in-order to record functions return value).
Another option to list processes using eBPF is to run the bpflist(8) tool from bcc-tools:
# /usr/share/bcc/tools/bpflist PID COMM TYPE COUNT 13159 killsnoop prog 2 13159 killsnoop map 2
We can use bpftool to dump and disassemble one of these programs:
# bpftool prog dump xlated id 3 0: (79) r7 = *(u64 *)(r1 +104) 1: (79) r6 = *(u64 *)(r1 +112) 2: (85) call bpf_get_current_pid_tgid#56176 3: (63) *(u32 *)(r10 -4) = r0 4: (b7) r1 = 0 5: (7b) *(u64 *)(r10 -16) = r1 6: (7b) *(u64 *)(r10 -24) = r1 7: (7b) *(u64 *)(r10 -32) = r1 8: (67) r0 <<= 32 9: (77) r0 >>= 32 10: (7b) *(u64 *)(r10 -40) = r0 11: (bf) r1 = r10 12: (07) r1 += -24 13: (b7) r2 = 16 14: (85) call bpf_get_current_comm#56400 15: (67) r0 <<= 32 16: (77) r0 >>= 32 17: (55) if r0 != 0x0 goto pc+10 18: (63) *(u32 *)(r10 -32) = r7 19: (63) *(u32 *)(r10 -28) = r6 20: (18) r1 = map[id:3] 22: (bf) r2 = r10 23: (07) r2 += -4 24: (bf) r3 = r10 25: (07) r3 += -40 26: (b7) r4 = 0 27: (85) call bpf_map_update_elem#56240 28: (b7) r0 = 0 29: (95) exit
This is the eBPF assembly used by killsnoop program to collect the data it needs.
Creating eBPF tools
Let's take a deep dive and look at how the killsnoop(8) tool is implemented. We can see that the tool itself is actually a Python script:
# file /usr/share/bcc/tools/killsnoop /usr/share/bcc/tools/killsnoop: Python script, ASCII text executable
A closer look at this script, reveals that the Python script contains some C code, as quoted text in the variable bpf_text.
bpf_text = """ #include <uapi/linux/ptrace.h> #include <linux/sched.h> struct val_t { u64 pid; int sig; int tpid; char comm[TASK_COMM_LEN]; }; struct data_t { u64 pid; int tpid; int sig; int ret; char comm[TASK_COMM_LEN]; }; BPF_HASH(infotmp, u32, struct val_t); BPF_PERF_OUTPUT(events); int syscall__kill(struct pt_regs *ctx, int tpid, int sig) { u32 pid = bpf_get_current_pid_tgid(); FILTER struct val_t val = {.pid = pid}; if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) { val.tpid = tpid; val.sig = sig; infotmp.update(&pid, &val); } return 0; }; int do_ret_sys_kill(struct pt_regs *ctx) { struct data_t data = {}; struct val_t *valp; u32 pid = bpf_get_current_pid_tgid(); valp = infotmp.lookup(&pid); if (valp == 0) { // missed entry return 0; } bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm); data.pid = pid; data.tpid = valp->tpid; data.ret = PT_REGS_RC(ctx); data.sig = valp->sig; events.perf_submit(ctx, &data, sizeof(data)); infotmp.delete(&pid); return 0; } """
The bpf_text string contains the C code which is compiled into the BPF assembly and passed to the kernel. The BPF_HASH() macro tells bcc framework to create a BPF map of type hash named 'infotmp'. This map is used to pass data between sys_kill kprobe and respective kretprobe which executes on exit of sys_kill() function. The map is indexed by sender’s process id (pid) thus the key type is u32. Value type of val_t is used to pass necessary data to the return probe.
BPF_PERF_OUTPUT() creates a perf event buffer which is later used to pass data to userland. We can see that the final event contains all important information about the signal consisting of pid’s of sending and receiving process, signal number itself, return value of the kill() syscall and the name of the sender. In the sys_kill() kprobe we don’t yet know the result of the call which we can only see in the return from this function.
The code for syscall kprobe and kretprobe is defined further in functions syscall__kill() and do_ret_sys_kill(). The kprobe function stores the target process pid, number of the syscall and the name of the sender in infotmp map. The kretprobe looks up this data in infotmp map using current process pid as the index, the code accompanies them with the return value of the call and submits these data to the perf event buffer.
The rest of the script is rather trivial:
# initialize BPF b = BPF(text=bpf_text) kill_fnname = b.get_syscall_fnname("kill") b.attach_kprobe(event=kill_fnname, fn_name="syscall__kill") b.attach_kretprobe(event=kill_fnname, fn_name="do_ret_sys_kill")
Here we initiate the bcc BPF framework and tell it to attach the syscall__kill() function from the C code to kprobe defined for syscall kill(2) and do_rt_sys_kill() to attach to the return probe of the same syscall.
class Data(ct.Structure): _fields_ = [ ("pid", ct.c_ulonglong), ("tpid", ct.c_int), ("sig", ct.c_int), ("ret", ct.c_int), ("comm", ct.c_char * TASK_COMM_LEN) ]
The data structure layout is defined to match the struct data_t in the C code.
# process event def print_event(cpu, data, size): event = ct.cast(data, ct.POINTER(Data)).contents if (args.failed and (event.ret >= 0)): return print("%-9s %-6d %-16s %-4d %-6d %d" % (strftime("%H:%M:%S"), event.pid, event.comm.decode(), event.sig, event.tpid, event.ret)) # loop with callback to print_event b["events"].open_perf_buffer(print_event) while 1: b.perf_buffer_poll()
Finally poll for perf events and print each received perf event.
Writing eBPF tools requires deeper understanding of Linux kernel internals to identify what kprobes need to be employed to collect the data. The BCC framework provides multiple ways to collect, store, sort and analyse the data so the amount of data passed to userspace is minimal. Once these tools are created eBPF can provide an effective way to collect any data from running system and offer a new way to monitor status of all kernel subsystems.
저자 소개
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.