-
Products
JBoss Enterprise Middleware
Web Server Developer Studio Portfolio Edition JBoss Operations Network FuseSource Integration Products Web Framework Kit Application Platform Data Grid Portal Platform SOA Platform Business Rules Management System (BRMS) Data Services Platform Messaging JBoss Community or JBoss enterprise -
Solutions
By IT challenge
Application development Business process management Enterprise application integration Interoperability Operational efficiency Security VirtualizationMigration Center
Migrate to Red Hat Enterprise Linux Systems management Upgrading to Red Hat Enterprise Linux JBoss Enterprise Middleware IBM AIX to Red Hat Enterprise Linux HP-UX to Red Hat Enterprise Linux Solaris to Red Hat Enterprise Linux UNIX to Red Hat Enterprise Linux Start a conversation with Red Hat Migration services
Issue #11 September 2005
Features
- Performance tuning tools: ps, top, sar, iostat, and vmstat
- Instrumenting the Linux kernel with SystemTap
- Performance tuning with GCC, Part 1
- Computer worms, Red Hat, and you
- Coming soon: OpenOffice.org 2.0
- Video: Security in a Networked World
- Keyboard shortcuts: Faster than the speed of mouse
- Webcast: Intel's enabling strategies for 64-bit and multi-core processors
- Webcast: Red Hat Storage Management overview
- Knowing what it means to miss New Orleans
From the Inside
In each Issue
- Editor's blog
- Red Hat speaks
- Ask Shadowman
- Tips & tricks
- Fedora status report
- Magazine archive
- Contest
Feedback
Instrumenting the Linux Kernel with SystemTap
by William Cohen
- Introduction
- Setting up SystemTap
- Simple SystemTap script example
- A slightly less simple SystemTap script example
- SystemTap implementation
- Future work
- Further reading
- About the author
Introduction
The goal of SystemTap is to provide infrastructure to simplify the gathering of information about the running Linux kernel so that it can be further analyzed. This can assist in identifying the underlying cause of a performance or functional problem. SystemTap is designed to eliminate the need for the developer to go through the tedious instrument, recompile, install, and reboot sequence required to collect data on the operation of the kernel. The recent addition of Kprobes to the Linux kernel provide the needed support but does not provide an easy to use infrastructure. SystemTap provides a simple command line interface and scripting language for writing kernel instrumentation.
SystemTap provides a simple command line interface and scripting language to writing kernel instrumentation.
SystemTap is still under development and evolvingdo not use it on production systems and expect that things will change. However, SystemTap in its current state can still be a useful tool for developers. This article describes how to install SystemTap on a Fedora™ Core 4 machine, some example SystemTap instrumentation scripts, and how SystemTap is implemented.
Setting up SystemTap
The examples for this article were generated on a typical ThinkPad® T41
laptop with an Intel® Pentium® M processor. Table 1, “Hardware and software configuration of machine”
details the hardware and
software configuration used for the examples. The vast majority of the
software is the stock Fedora Core 4 software simply updated via yum on
September 13, 2005. There are three additional RPMs required to
run SystemTap: the kernel-debuginfo RPM that provides the information
about the locations of variable and functions in the kernel, the
kernel-devel rpm required for SystemTap to build the kernel modules, and
the systemtap RPM itself.
Check for the version of the kernel running by using the uname
-r command. If the
needed kernel-devel RPM is not installed, it can be
installed via the following command by root:
yum install kernel-devel
The debuginfo RPMs are not installed by default due to size and the rare
need for them. However, the matching kernel-debuginfo
RPM for Fedora Core 4 can be installed in a manner similar to
kernel-devel using:
yum install kernel-debuginfo
For the i386 the updated Fedora Core 4 kernel, kernel-devel, and
kernel-debuginfo RPMs can
be obtained from the following URLs (or one of the Fedora mirror sites):
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/ http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/debug/
For the i386 a current version of the SystemTap RPM can be obtained from Fedora Core 4 testing directory:
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/4/i386
The SystemTap RPM is installed like any other RPM. This RPM provides the
SystemTap translator (stap) and support runtime
libraries.
| Hardware |
Thinkpad T41 Intel Pentium M 1.6GHz 1024 KB Cache 512MB DRAM 40GB Hard disk |
| Software |
Fedora Core 4 (update via yum 20050913) kernel-2.6.12-1.1447_FC4 kernel-devel-2.6.12-1.1447_FC4 kernel-debuginfo-2.6.12-1.1447_FC4 systemtap-0.4-1 |
After the SystemTap RPM is installed, you should be able to run
stap, the command that compiles an instrumentation
script, installs it, and starts data collection to see which version is
installed. To verify that command is available and to determine which version of
SystemTap is installed, execute the command stap
-V, which produces output similar to:
SystemTap translator/driver (version 0.4 built 2005-09-07) Copyright (C) 2005 Red Hat, Inc. This is free software; see the source for copying conditions.
Due to the nature of SystemTap you will either need to run the commands as root
or have an entry in /etc/sudoers for your login to have the
required privileges to install the instrumentation in the kernel.
Simple SystemTap script example
Refer to Example 1, “A script to count the number of times generic_make_request is called”.
This is equivalent to the kprobebio.c
example in the Red Hat Magazine March 2005 article,
Gain insight
into the Linux kernel with Kprobes. Due to the
SystemTap's design, this example is much smaller than the equivalent raw
kprobe example in the previously mentioned article.
There are a few basic types available in the SystemTap scripts such as
integers and strings. The type in use is inferred by the context; there is
no explicit typing. Thus, in Example 1, “A script to count the number of times generic_make_request is called” the variable
count_generic_make_request has no type. The global
indicates that it can be used by any of the functions in the script.
The actual probes in SystemTap start with the keyword probe followed by
a description of where to place the probe and the script body to run when
the probe executes. In Example 1, “A script to count the number of times generic_make_request is called” there are three
probes. The first probe instruments the actual kernel function
generic_make_request. The probe begin executes
before any other probes in the script executes and is typically used for
initialization. In this case it prints a message so you know when the
probe starts. The probe end executes when the instrumentation script is
being shutdown and usually handles output of the collected information as
in this example.
/* kprobebio.stp
This is a simple module to get information about block I/O operations.
Will Cohen
*/
global count_generic_make_request
probe kernel.function("generic_make_request")
{
++count_generic_make_request;
}
probe begin { log("starting probe") }
probe end
{
log("ending probe")
log("generic_make_request() called "
. string(count_generic_make_request)
. " times.");
}
Running the script is relatively simple. The stap
command is run either as root or the user is placed in the sudoer list to
provide them with root privileges. The console output of stap
kprobebio.stp is as follows:
starting probe ending probe generic_make_request() called 24 times.
The stap command compiles the script into a loadable
kernel module, and the module is loaded. When the module is loaded the
starting probe messages is printed. This
indicates that data is being collected. Control-C is pressed by the user
to conclude the data collection. SystemTap removes the probes and prints
out the number of times that the function
generic_make_request() is called.
- Note:
-
The first time that you install instrumentation with
stapyou may see messages aboutrelayfs.konot being valid. SystemTap uses relayfs to move data from kernel to user space. If no relayfs modules exists, SystemTap builds a relayfs module for the kernel. You should not see that message on following instrumentation.
A slightly less simple SystemTap script example
Assume you would like to get a better idea of the number of read and write
system calls performed on the system and the amount of data read and
written on the system. Refer to Example 2, “A script to accumulate the amount of data read and written by the read and write system calls”.
The beginning of the listing has the global
variables to store the state information. There are two probes collecting
the data: one for the reads and one for the writes. Each time the
sys_read or sys_write function is
called the associated count is incremented. Another thing to note about
this script is that the function argument
count
is accessed to track the
number of bytes that the operation reads or writes. The
probe end
function writes out the data.
/* rwblock.stp
This is a simple get count the number of read and write operations and
accumulate the number of bytes for each.
Will Cohen
*/
global count_sys_read
global count_sys_write
global sys_read_bytes
global sys_write_bytes
probe kernel.function("sys_read")
{
++count_sys_read;
sys_read_bytes += $count;
}
probe kernel.function("sys_write")
{
++count_sys_write;
sys_write_bytes += $count;
}
probe begin { log("starting probe") }
probe end
{
log("ending probe")
log(string(sys_read_bytes) . " bytes read with "
. string(count_sys_read) . " calls to sys_read(), avg size of read "
. string(sys_read_bytes/count_sys_read) );
log(string(sys_write_bytes) . " bytes written with "
. string(count_sys_write) . " calls to sys_write(), avg size of write "
. string(sys_write_bytes/count_sys_write) );
}
The following is example output when the system is fairly idle and the probe is
run for approximately a minute with the command stap
rwblock.stp.
starting probe ending probe 1748216 bytes read with 662 calls to sys_read(), avg size of read 2640 124725 bytes written with 483 calls to sys_write(), avg size of write 258
There are two things that are obvious from this output: there is an order of magnitude more data being read than written on the system, and the average size of a read is much larger than a write (2640 bytes versus 258 bytes).
SystemTap implementation
SystemTap takes a compiler-oriented approach to generating
instrumentation. Refer to Figure 1, “Flow of data in SystemTap” for an overall diagram of
SystemTap used in this discussion. In the upper right hand
corner of the diagram is the probe.stp, the probe
script the developer has written. This is parsed by the translator into
parse trees. During this time the input is checked for syntax errors. The
translator then performs elaboration, pulling in additional code from the
script library and determining locations of probe points and variables from
the debug information. After the elaboration is complete the translator
can generate the probe.c, the kernel module in C.
The probe.c file is compiled into a regular kernel
module, probe.ko, using the GCC compiler. The
compilation may pull in support code from the runtime libraries. After
GCC has generated the probe.ko, the SystemTap daemon
is started to collect the output of the instrumentation module. The
instrumentation module is loaded into the kernel, and data collection is
started. Data from the instrumentation module is transferred to user-space
via relayfs and displayed by the daemon. When the user hits Control-C the
daemon unloads the module, which also shuts down the data collection
process.
Future work
SystemTap is a relatively new tool. As a result, SystemTap is rapidly evolving. The syntax and operation are refined as people gain experience using SystemTap. Work is progressing to fix bugs found in SystemTap and enhance SystemTap to make the instrumentation process easier.
Work has begun on writing instrumentation to collect particular pieces of information a user might be interested in. These tapsets will provide useful building blocks for additional instrumentation.
SystemTap makes heavy use of the debug information generated during the
generation of executables. There is a need to improve the quality of the
debug information generated by the GCC compiler to provide better mappings
between the executable binary and the source code. As you may have noticed
when installing the kernel-debuginfo RPM, the kernel
debuginfo RPM is rather large. Work is being pursued to factor out the
redundant information in the debug information to produce smaller
debuginfo RPMs.
Further reading
- Gaining insight into the Linux kernel with Kprobes by William Cohen, March 2005.
- SystemTap website




