-
Products
JBoss Enterprise Middleware
Web Server Developer Studio Portfolio Edition JBoss Operations Network FuseSource Integration Products Web Framework Kit Application Platform Data Grid Portal Platform SOA Platform Business Rules Management System (BRMS) Data Services Platform Messaging JBoss Community or JBoss enterprise -
Solutions
By IT challenge
Application development Business process management Enterprise application integration Interoperability Operational efficiency Security VirtualizationMigration Center
Migrate to Red Hat Enterprise Linux Systems management Upgrading to Red Hat Enterprise Linux JBoss Enterprise Middleware IBM AIX to Red Hat Enterprise Linux HP-UX to Red Hat Enterprise Linux Solaris to Red Hat Enterprise Linux UNIX to Red Hat Enterprise Linux Start a conversation with Red Hat Migration services
Issue #5 March 2005
Features
- Red Hat Summit: Learn, network, experience open source
- Tiemann's take on the Summit
- Meet the Summit speakers
- Video: Red Hat's philosophy of customer service
- Fedora: Powered by the community
- Video: Backstage pass: Red Hat Enterprise Linux 4
- Red Hat Network in action
- Demo: Take the Red Hat Desktop virtual tour
- RSS: News when you want it
- How I learned to stop worrying and love the command line,
part 2 - Certified applications for Red Hat Enterprise Linux 4
- Gaining insight into the Linux kernel with Kprobes
- Tiemann named president of OSI
- The security dilemma, part 1: Intrusion detection
From the Inside
In each Issue
- Editor's blog
- Red Hat speaks
- Ask Shadowman
- Tips & tricks
- Fedora status report
- Magazine archive
- Contest
Feedback
Gaining insight into the Linux® kernel with Kprobes
by William Cohen
Introduction
Many times kernel developers have resorted to using the "diagnostic print statements" approach to understand what is occurring in the Linux kernel. This technique can be painful because a new kernel must be built and installed on the machine. The machine must then be rebooted with the new kernel. Each new experiment requires another reboot of the machine, which could take minutes on some machines.
Developers have found the ability to inspect the operation of unmodified executables to be very useful. In the case of userspace applications developers can use debuggers to set breakpoints at specific locations in the unmodified executable. When the processor encounters a breakpoint the developer uses the debugger to inspect program state to gain insight into how the program is operating (or failing). There are advantages to this method of examining the program operation over the traditional technique of compiling "diagnostic print statements" into the program:
- The developer does not change the source code of the original program.
- The developer avoids unintended changes caused by rebuilding the executable.
- The developer can avoid the expense of recompiling the program and restarting the program each time something else is examined. In some cases it may not be feasible for the developer to rebuild the application.
Due to interrupt handling it is not feasible to completely stop the Linux kernel and wait for the developer to type in commands. However, it is possible to place snippets of instrumentation code in the kernel to collect information at specific locations to determine whether a specific function is being executed and state of variables. The recent 2.6 Linux kernels, including the x86 kernel in the upcoming Fedora Core 4, have support to allow developers to gather information about the Linux kernel's operation without compiling or booting a new kernel. This is implemented with Kprobes, a dynamic instrumentation system. This article describes how Kprobes operate and provides kernel instrumentation examples.
Kprobes
Kprobes is a dynamic instrumentation system in the mainline 2.6 Linux kernel and will be enabled in the soon to be released x86 Fedora Core 4 kernels. Kprobes allows one to gather additional information about kernel operation without recompiling or rebooting a kernel. Kprobes enables locations in the kernel to be instrumented with code, and the instrumentation code runs when the processor encounters that probe point. Once the instrumentation code completes execution, the kernel continues normal execution.
The Kprobes instrumentation is built as a kernel module. Thus, rather than having to recompile and reboot the system with an instrumented kernel, a kprobe instrumentation module can be written, compiled, and loaded on the system. There is no need to reboot the system. Once the instrumentation module has served its purpose, it can be unloaded, and the kernel returned to its normal operation.
There are two types of kernel probes available: kprobes and jprobes. A kprobe inserts a probe at a specific instruction. The instrumentation provided by a kprobe could be inserted anywhere in a function, thus the kprobe code cannot make assumptions about local variables or arguments passed into the function being probed. A jprobe instruments the entry of a function and allows the probe to examine the arguments passed into the probed function.
The kprobe support in the kernel provides simple data structures and a set
of functions to allow the insertion and removal of kernel probes. A data
structure is filled out and registered with a call to either the
register_kprobe or
register_jprobe function. The data
structure passed to the register function must remain allocated until the
kernel probe is unregistered with either a matching
unregister_kprobe or
unregister_jprobe. Table 1, Kernel probes management functions” summarizes the functions used to register
and unregister the probes. The register functions return zero if the
operation was successful and a negative value if the operation was
unsuccessful.
| int register_kprobe(struct kprobe *p); |
| int register_jprobe(struct jprobe *p); |
| void unregister_kprobe(struct kprobe *p); |
| void unregister_jprobe(struct jprobe *p); |
Listing 1, kprobe data structure shows the fields of struct
kprobe. The
addr field is the linear address of the
instruction being probed. The developer needs to determine the appropriate
address for addr. In the examples in this
article the address was an exported function and could be placed in the
code. In other cases the you may have to examine the
System.Map file or the disassembled kernel code to
find the appropriate value for the address. The
pre_handler field is a function pointer
to the function run before the execution of the probed instruction. The
post_handler field is a function pointer
to the function executed following the execution of the instruction. The
fault_handler field is a pointer to the
function to run if there is a fault during the execution of the probe
code.
struct kprobe {
/* elided fields for internal state information */
kprobe_opcode_t *addr;
kprobe_pre_handler_t pre_handler;
kprobe_post_handler_t post_handler;
kprobe_fault_handler_t fault_handler;
/* elided fields for internal state information */
};
The jprobe is built on top of the basic kprobe. The jprobes simplify the
instrumentation of function entries and allow one to inspect the arguments
passed to the function. The struct
jprobe contains a struct
kprobe for the kprobe information related
to the jprobe. There are two pieces of information that need to entered
into the struct jprobe: the entry field
which points to the instrumentation function that has the same arguments
list as the instrumented function and the
addr field in kp. The other fields in the
struct kprobe are filled out when the
jprobe is registered.
struct jprobe {
struct kprobe kp;
kprobe_opcode_t *entry; /* probe handling code to jump to */
};
The execution of a kprobe has similarities to the execution of a
breakpoint set by a debugger. The instruction at the kernel probe location
is saved in a buffer, and the instruction at that location is replaced by
an breakpoint instruction. When the processor encounters the breakpoint,
the trap handler is invoked. A check is made to determine whether there is
a kprobe registered at this location. If there is no probe registered for
that location, the breakpoint is passed on to the normal handler. If a
probe is found, the pre_handler function
is executed, the probed instruction is executed, then the
post_handler function is executed. The
execution resumes at the instruction following the probed instruction.
Examples
This article contains two examples: one example using a kprobe and the
other example using a jprobe. Most all of the block device I/O goes
through the function
generic_make_request. It is useful to
instrument generic_make_request to
observe its operation. Both examples instrument the
generic_make_request function.
You need to have the kernel-devel RPM matching the
running kernel installed to build these examples. Listing 3, Makefile shows the simple makefile used to build the
instrumentation modules after the kernel-devel RPM
has been installed. There are two source files in the directory:
kprobebio.c and jprobebio.c. In
conjunction with the makefile supplied by
kernel-devel, this makefile creates
kprobebio.ko and jprobebio.ko,
the kernel modules.
Assuming that the kernel-devel RPM matching the
running kernel is installed, you can create the modules with the following
command:
make -C /lib/modules/`uname -r`/build M=`pwd` modules
Kprobe example
The kprobe example kprobebio.c in Listing 4, kprobebio.c demonstrates how to counts the number of times the
generic_make_request function is
called. Since the kprobe is a module, the instrumentation is inserted
when the module is loaded. When the instrumentation is removed, the
results of the instrumentation are written to
/var/log/messages by a printk in
this example. Other means of extracting the data are possible.
The include for linux/kprobes.h contains the needed
data structures for kprobes and jprobes. The include for
linux/blkdev.h declares the function
generic_make_request, which is needed
to put the probe in the correct location.
The inst_generic_make_request function
is the instrumentation function that is called each time the
generic_make_request function is
called. Normally, as in this case, the instrumentation function returns
a value of 0 to indicate that instrumented instruction should be handled
normally.
The function init_module sets up the
kprobe data structure and starts the instrumentation. There is only an
instrumentation function to execute before the executed instructions:
pre_handler. Thus, the
post_handler and
fault_handler are set to NULL. The
address of the instrumented function is set in
kp.addr. The data structure is
registered via register_kprobe. After
the register_kprobe, the
instrumentation is operating and counting the number of times that
generic_make_request is called. The
cleanup_module unregisters the probe
and then writes the data to /var/log/messages via a
printk.
The instrumentation is started as root with the following command:
/sbin/insmod kprobebio.ko
The instrumentation is shutdown as root with the following command:
/sbin/rmmod kprobebio
When the module is unloaded, the data is written to
/var/log/messages. Listing 5, Output of kprobebio module in /var/log/messages show the output from this particular
example.
Feb 23 12:09:20 slingshot kernel: kprobe registered Feb 23 12:09:31 slingshot kernel: kprobe unregistered Feb 23 12:09:31 slingshot kernel: generic_make_request() called 52 times.
/var/log/messagesJprobe example
Another useful mechanism provided by Kprobes support is Jprobes. Jprobes
allow instrumentation of the function entry and access to the arguments
passed into the instrumented function. Listing 6, jprobebio.c
shows the the code to generate instrumentation that counts the number of
times that generic_make_request is
called. The example in Listing 6, jprobebio.c also accumulates the
number of sectors moved in the requests and keeps track of the calls and
sectors on a per-device basis.
The linux/bio.h is included to describe the data
structure used by
generic_make_request. This is required
because the instrumentation function
inst_generic_make_request now has the
same arguments as the original
generic_make_request function. These
arguments can be accessed inside the instrumentation function. For this
example the bio pointer is examined to determine the device for which
the request is being made and the number of sectors being transfered. A
simple hash table is implemented to separate the data for the different
devices.
Another significant difference between kprobes and jprobes is how the
instrumentation function is exited. In a jprobe there needs to be an
explicit jprobe_return rather than a
kprobe function's return 0;.
A jprobe uses a struct jprobe to describe the instrumentation
point. In this example the entry is made to point to the
inst_generic_make_request
function. In init_module the kprobe
field in the jprobe struct is initialized to point at the function
being instrumented,
generic_make_request. The other
fields in the kprobe field are set up appropriately for the jprobe
when the register_jprobe function is
called.
When the module is removed from the kernel,
cleanup_module is executed. This unregisters the
probe and prints out the recorded data much in the same way that the
earlier kprobe example operates. Like the kprobebio example,
jprobebio module instrumentation is started when it is loaded into the
kernel with an insmod command and writes out the
data when the module is removed with an rmmod
command. Listing 7, Output of the jprobebio module in /var/log shows the output of
the module in /var/log/messages.
Feb 23 13:55:01 slingshot kernel: plant jprobe at c024f900, handler addr e09e4000 Feb 23 13:55:02 slingshot crond(pam_unix)[5969]: session closed for user root Feb 23 13:55:21 slingshot kernel: jprobe unregistered Feb 23 13:55:21 slingshot kernel: generic_make_request() called 119 times for 952 sectors. Feb 23 13:55:21 slingshot kernel: bdev 0xcb199da8 (3,5) 26 208 sectors. Feb 23 13:55:21 slingshot kernel: bdev 0xdf00eda8 (3,2) 93 744 sectors.
/var/logThe future
The examples in this article show how to write simple instrumentation using the Kprobes support in the Fedora Core 4 kernels. However, one might notices that the instrumentation is written in raw C code, and it is quite possible to crash the machine if the instrumentation code has a flaw in it. The Kprobes mechanism is also a very low-level interface that simply places individual probes where directed. There is no predefined library that selects groups of probe points to measure things that a regular user might be interested in. Thus, currently Kprobes requires a good understanding of the kernel to know which locations in the kernel to instrument to get data and to perform analysis on the collected data to produce a meaningful result.
An effort has started to address these deficiencies in the current kprobe instrumentation: SystemTap. SystemTap will provide a safer language for writing the instrumentation and a library of useful instrumentation.




