Issue #11 September 2005

Instrumenting the Linux Kernel with SystemTap

Introduction

The goal of SystemTap is to provide infrastructure to simplify the gathering of information about the running Linux kernel so that it can be further analyzed. This can assist in identifying the underlying cause of a performance or functional problem. SystemTap is designed to eliminate the need for the developer to go through the tedious instrument, recompile, install, and reboot sequence required to collect data on the operation of the kernel. The recent addition of Kprobes to the Linux kernel provide the needed support but does not provide an easy to use infrastructure. SystemTap provides a simple command line interface and scripting language for writing kernel instrumentation.

SystemTap provides a simple command line interface and scripting language to writing kernel instrumentation.

SystemTap is still under development and evolving—do not use it on production systems and expect that things will change. However, SystemTap in its current state can still be a useful tool for developers. This article describes how to install SystemTap on a Fedora™ Core 4 machine, some example SystemTap instrumentation scripts, and how SystemTap is implemented.

Setting up SystemTap

The examples for this article were generated on a typical ThinkPad® T41 laptop with an Intel® Pentium® M processor. Table 1, “Hardware and software configuration of machine” details the hardware and software configuration used for the examples. The vast majority of the software is the stock Fedora Core 4 software simply updated via yum on September 13, 2005. There are three additional RPMs required to run SystemTap: the kernel-debuginfo RPM that provides the information about the locations of variable and functions in the kernel, the kernel-devel rpm required for SystemTap to build the kernel modules, and the systemtap RPM itself.

Check for the version of the kernel running by using the uname -r command. If the needed kernel-devel RPM is not installed, it can be installed via the following command by root:

yum install kernel-devel

The debuginfo RPMs are not installed by default due to size and the rare need for them. However, the matching kernel-debuginfo RPM for Fedora Core 4 can be installed in a manner similar to kernel-devel using:

yum install kernel-debuginfo

For the i386 the updated Fedora Core 4 kernel, kernel-devel, and kernel-debuginfo RPMs can be obtained from the following URLs (or one of the Fedora mirror sites):

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/debug/

For the i386 a current version of the SystemTap RPM can be obtained from Fedora Core 4 testing directory:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386

The SystemTap RPM is installed like any other RPM. This RPM provides the SystemTap translator (stap) and support runtime libraries.

Hardware Thinkpad T41
Intel Pentium M 1.6GHz
1024 KB Cache
512MB DRAM
40GB Hard disk
Software Fedora Core 4 (update via yum 20050913)
kernel-2.6.12-1.1447_FC4
kernel-devel-2.6.12-1.1447_FC4
kernel-debuginfo-2.6.12-1.1447_FC4
systemtap-0.4-1
Table 1. Hardware and software configuration of machine

After the SystemTap RPM is installed, you should be able to run stap, the command that compiles an instrumentation script, installs it, and starts data collection to see which version is installed. To verify that command is available and to determine which version of SystemTap is installed, execute the command stap -V, which produces output similar to:

SystemTap translator/driver (version 0.4 built 2005-09-07)
Copyright (C) 2005 Red Hat, Inc.
This is free software; see the source for copying conditions.

Due to the nature of SystemTap you will either need to run the commands as root or have an entry in /etc/sudoers for your login to have the required privileges to install the instrumentation in the kernel.

Simple SystemTap script example

Refer to Example 1, “A script to count the number of times generic_make_request is called”. This is equivalent to the kprobebio.c example in the Red Hat Magazine March 2005 article, Gain insight into the Linux kernel with Kprobes. Due to the SystemTap's design, this example is much smaller than the equivalent raw kprobe example in the previously mentioned article.

There are a few basic types available in the SystemTap scripts such as integers and strings. The type in use is inferred by the context; there is no explicit typing. Thus, in Example 1, “A script to count the number of times generic_make_request is called” the variable count_generic_make_request has no type. The global indicates that it can be used by any of the functions in the script.

The actual probes in SystemTap start with the keyword probe followed by a description of where to place the probe and the script body to run when the probe executes. In Example 1, “A script to count the number of times generic_make_request is called” there are three probes. The first probe instruments the actual kernel function generic_make_request. The probe begin executes before any other probes in the script executes and is typically used for initialization. In this case it prints a message so you know when the probe starts. The probe end executes when the instrumentation script is being shutdown and usually handles output of the collected information as in this example.

/* kprobebio.stp
   This is a simple module to get information about block I/O operations.
   Will Cohen
*/

global count_generic_make_request

probe kernel.function("generic_make_request")
{
	++count_generic_make_request;
}

probe begin { log("starting probe") }

probe end
{
	log("ending probe")
	log("generic_make_request() called "
	 . string(count_generic_make_request)
	 . " times.");
}
Example 1. A script to count the number of times generic_make_request is called

Running the script is relatively simple. The stap command is run either as root or the user is placed in the sudoer list to provide them with root privileges. The console output of stap kprobebio.stp is as follows:

starting probe
ending probe
generic_make_request() called 24 times.

The stap command compiles the script into a loadable kernel module, and the module is loaded. When the module is loaded the starting probe messages is printed. This indicates that data is being collected. Control-C is pressed by the user to conclude the data collection. SystemTap removes the probes and prints out the number of times that the function generic_make_request() is called.

Note:
The first time that you install instrumentation with stap you may see messages about relayfs.ko not being valid. SystemTap uses relayfs to move data from kernel to user space. If no relayfs modules exists, SystemTap builds a relayfs module for the kernel. You should not see that message on following instrumentation.

A slightly less simple SystemTap script example

Assume you would like to get a better idea of the number of read and write system calls performed on the system and the amount of data read and written on the system. Refer to Example 2, “A script to accumulate the amount of data read and written by the read and write system calls”. The beginning of the listing has the global variables to store the state information. There are two probes collecting the data: one for the reads and one for the writes. Each time the sys_read or sys_write function is called the associated count is incremented. Another thing to note about this script is that the function argument count is accessed to track the number of bytes that the operation reads or writes. The probe end function writes out the data.

/* rwblock.stp
   This is a simple get count the number of read and write operations and
   accumulate the number of bytes for each. 

   Will Cohen
*/

global count_sys_read
global count_sys_write
global sys_read_bytes
global sys_write_bytes

probe kernel.function("sys_read")
{
	++count_sys_read;
	sys_read_bytes += $count;
}

probe kernel.function("sys_write")
{
	++count_sys_write;
	sys_write_bytes += $count;
}

probe begin { log("starting probe") }

probe end
{
	log("ending probe")
	log(string(sys_read_bytes) . " bytes read with "
	. string(count_sys_read) . " calls to sys_read(), avg size of read "
	. string(sys_read_bytes/count_sys_read) );
	log(string(sys_write_bytes) . " bytes written with "
	. string(count_sys_write) . " calls to sys_write(), avg size of write "
 	. string(sys_write_bytes/count_sys_write) );
}
Example 2. A script to accumulate the amount of data read and written by the read and write system calls

The following is example output when the system is fairly idle and the probe is run for approximately a minute with the command stap rwblock.stp.

starting probe
ending probe
1748216 bytes read with 662 calls to sys_read(), avg size of read 2640
124725 bytes written with 483 calls to sys_write(), avg size of write 258

There are two things that are obvious from this output: there is an order of magnitude more data being read than written on the system, and the average size of a read is much larger than a write (2640 bytes versus 258 bytes).

SystemTap implementation

SystemTap takes a compiler-oriented approach to generating instrumentation. Refer to Figure 1, “Flow of data in SystemTap” for an overall diagram of SystemTap used in this discussion. In the upper right hand corner of the diagram is the probe.stp, the probe script the developer has written. This is parsed by the translator into parse trees. During this time the input is checked for syntax errors. The translator then performs elaboration, pulling in additional code from the script library and determining locations of probe points and variables from the debug information. After the elaboration is complete the translator can generate the probe.c, the kernel module in C.

The probe.c file is compiled into a regular kernel module, probe.ko, using the GCC compiler. The compilation may pull in support code from the runtime libraries. After GCC has generated the probe.ko, the SystemTap daemon is started to collect the output of the instrumentation module. The instrumentation module is loaded into the kernel, and data collection is started. Data from the instrumentation module is transferred to user-space via relayfs and displayed by the daemon. When the user hits Control-C the daemon unloads the module, which also shuts down the data collection process.

Flow of data in SystemTap
Figure 1. Flow of data in SystemTap

Future work

SystemTap is a relatively new tool. As a result, SystemTap is rapidly evolving. The syntax and operation are refined as people gain experience using SystemTap. Work is progressing to fix bugs found in SystemTap and enhance SystemTap to make the instrumentation process easier.

Work has begun on writing instrumentation to collect particular pieces of information a user might be interested in. These tapsets will provide useful building blocks for additional instrumentation.

SystemTap makes heavy use of the debug information generated during the generation of executables. There is a need to improve the quality of the debug information generated by the GCC compiler to provide better mappings between the executable binary and the source code. As you may have noticed when installing the kernel-debuginfo RPM, the kernel debuginfo RPM is rather large. Work is being pursued to factor out the redundant information in the debug information to produce smaller debuginfo RPMs.

Further reading

About the author

William Cohen is a performance tools engineer at Red Hat, Inc. Will received his BS in electrical engineering from the University of Kansas. He earned a MSEE and a PhD from Purdue University. In his spare time he bicycles and takes pictures with his digital cameras.