Red Hat, Inc.'s Network Console and Crash Dump Facility

Michael K. Johnson

johnsonm@redhat.com

Abstract

In Red Hat Linux Advanced Server 2.1, Red Hat, Inc. provides its first crash dump facility. Unlike traditional crash dump facilities, this facility dumps memory images to a centralized server via the network. This paper summarizes the reasons for this unusual choice, explains how to set up Red Hat, Inc.'s network crash dump facility (netdump) both from the client and server standpoint, provides some information on implementation considerations such as security and performance, and mentions some potential future directions. It also explains how to use network console output logging, a related feature which has been introduced at the same time.


Table of Contents
Netdump Rationale
How do I set this up?
Implementation considerations
What might the future hold?

Netdump Rationale

Why crash dumps at all?

The goal of a crash dump facility is to provide fault analysis, particularly exhaustive first fault analysis (first fault analysis is when a bug can be corrected without requiring reproducing the bug), in the case of software or hardware bugs that manifest as system crashes (in Linux parlance, Oops, BUG(), or panic). Linux has traditionally provided an abbreviated signature of a crash which includes the processor state (on the processor that registered the crash), a stack trace, and a limited instruction trace. The utility of these signatures has been proved over the years; they nearly always provide all the information that is required to debug a fault, even at first fault.

The network console functionality provides the ability to log all kernel messages, including Linux crash signature messages, to a network syslog server. This has very low system requirements; it merely requires a simple syslog server (any Linux system can serve as a syslog server) that allows incoming network logging. This allows first fault analysis of the majority of crashes.

However, some crashes involve more subtle problems. Some of these problems are not easy, or even possible, to fix after seeing a single Linux crash signature message. Successful first fault analysis of these kinds of problems is sometimes enabled by the ability to look at a memory dump of the kernel image. It is no guarantee, but in certain kinds of cases it significantly increases the odds of successful first fault analysis.

Obviously, we would rather that crashes never happened in the first place, and we work hard to make them unlikely. However, we are honest; in reality, no software is perfect, and neither is hardware. (A common cause of crashes -- perhaps the common cause of crashes — is faulty hardware, and sometimes the hardware fault leaves a recognizable footprint in the signature or memory image.) So the next best thing to perfection is to limit both the downtime and potential damage of a crash.

Why dump to network instead of disk?

The classical UNIX crash dump facility dumps the memory image to the swap partition, and then provides a facility to analyze it before reusing the swap partition for swap. This is what many experienced UNIX system administrators are used to and expect.

Classical UNIX operating systems provide support for a very limited set of hardware devices. Company X sells both the hardware and the operating system, and thus can operate within a very constrained set of device drivers, and support only those device drivers. This level of control over the hardware makes it relatively easy for the responsible vendors (that is, those who exhibit care for users' data) to implement special miniature non-interrupt-driven device drivers that run in write-protected memory to dump data to the swap partition. This highly restricted case can be audited and protected.

Linux is not internally designed like a classical UNIX operating system here. Major differences include:

So what? There are two main problems that come up; failure to dump a memory image, and overwriting parts of file systems because the crash has damaged some data structures or code being used to do the dump. Do not laugh, the later happens in real life; failures in drivers, the SCSI layer, or other intermediate data structures or code is as common a place as any for bugs that cause a crash. A simple failure to dump the memory image is the more common of the two, and can be caused by a myriad of problems, including failures in interrupt handling (for example, interrupts being disabled at the time of the crash; a common problem), locks taken and not released, and data structures that are inconsistent at the time of the crash causing the system to wait forever.

By contrast, network devices are simple, are easy to modify to enable a non-interrupt-driven polled mode, and even if there is a bug in a network device driver, it is entirely likely not to disable the crash dump over the network, because the code path used for network crash dump is highly restricted. The entire network stack can crash and network crash dump can still work, because the network crash dump code implements a separate small but standard-compliant subset of the UDP protocol sufficient to perform the crash dump. Interrupts can be disabled, arbitrary locks can be held indefinitely, and the network crash dump will still function perfectly.

Dumping to the network is much more manageable than dumping to disk. The dumps are preserved for comparative analysis, something that is not easy to do with disk-based crash dumps. It is possible (though not yet implemented) to compare the crash signature to existing signatures and choose not to store some dumps. It is trivially possible to do initial programmatic analysis and provide that analysis when notifying administrators of the crash. Space for storing crash dumps becomes a centrally managed resource, rather than an afterthought. After completing the dump (or rebooting without dumping if the network crash dump storage server requests it after analyzing the signature or checking disk space or doing any other tests), the crashed computer can reboot and be immediately available for normal use, without having to go through a manual, interactive stage of crash dump retrieval and storage.

Dumping to the network is more reliable. There are many more potential points of not only software, but also hardware failure in the dump-to-disk path than in the dump-to-network path. In the case of discovering hardware failure instead of a software failure, successful first fault analysis is much more likely for this reason as well.

Why not save the memory image over a reboot?

There are patches available to Linux, and implementations for some versions of the UNIX operating system, which reboot without clearing memory, allowing the memory image to be saved to disk by a known non-corrupted kernel image (as known as is possible, of course) and then rebooting again after the image has been saved. There are several problems with this approach, particularly on the PC hardware platform.

First, there is no standard way to preserve memory in PCs on a warm reboot. Most PC platforms clear memory on warm reboot, and while a few chipsets provide the ability to avoid this, it is not standardized, not generally available, sometime changes even between revisions of the same firmware, and thus is not a reasonable solution for a general-purpose operating system.

There is a much bigger problem, however. Fundamentally, significant amounts of memory must be reserved for this before the crash, or large amounts of memory must be located afterward, either to use to put the crashed kernel image into before reboot or to put the new, trusted kernel image into during the reboot process. (On PC systems, we do not have the luxury of have firmware that knows how to load a special crash dump kernel into reserved memory, so we do not even have the option of reserving a moderate amount of memory for loading a special kernel into after a fault.)

This memory is usually harvested from memory already in use for other things, such as user-space pages, page cache, buffer cache, and so forth, which limits the amount of information available in the dump, reducing the number of clues available as to the cause of the crash and thus reducing the likelihood of successful first fault analysis. In addition, there is no general guarantee that sufficient memory will be available in the sacrificed locations, so there is a distinct possibility that the dump process will fail from simple lack of available memory — imagine if the kernel crash was due to a memory leak that ate up available memory?

By dumping memory before rebooting, we are able to reserve only the memory needed by the crash dump code itself (a few pages of memory), and also have the possibility to later extend the information provided in the dump to include hardware state and other information that might not be preserved across a warm reboot.