Copyright © 2002 by Red Hat, Inc.
| Abstract |
In Red Hat Linux Advanced Server 2.1, Red Hat, Inc. provides its first crash dump facility. Unlike traditional crash dump facilities, this facility dumps memory images to a centralized server via the network. This paper summarizes the reasons for this unusual choice, explains how to set up Red Hat, Inc.'s network crash dump facility (netdump) both from the client and server standpoint, provides some information on implementation considerations such as security and performance, and mentions some potential future directions. It also explains how to use network console output logging, a related feature which has been introduced at the same time. |
The goal of a crash dump facility is to provide fault analysis, particularly exhaustive first fault analysis (first fault analysis is when a bug can be corrected without requiring reproducing the bug), in the case of software or hardware bugs that manifest as system crashes (in Linux parlance, Oops, BUG(), or panic). Linux has traditionally provided an abbreviated signature of a crash which includes the processor state (on the processor that registered the crash), a stack trace, and a limited instruction trace. The utility of these signatures has been proved over the years; they nearly always provide all the information that is required to debug a fault, even at first fault.
The network console functionality provides the ability to log all kernel messages, including Linux crash signature messages, to a network syslog server. This has very low system requirements; it merely requires a simple syslog server (any Linux system can serve as a syslog server) that allows incoming network logging. This allows first fault analysis of the majority of crashes.
However, some crashes involve more subtle problems. Some of these problems are not easy, or even possible, to fix after seeing a single Linux crash signature message. Successful first fault analysis of these kinds of problems is sometimes enabled by the ability to look at a memory dump of the kernel image. It is no guarantee, but in certain kinds of cases it significantly increases the odds of successful first fault analysis.
Obviously, we would rather that crashes never happened in the first place, and we work hard to make them unlikely. However, we are honest; in reality, no software is perfect, and neither is hardware. (A common cause of crashes -- perhaps the common cause of crashes — is faulty hardware, and sometimes the hardware fault leaves a recognizable footprint in the signature or memory image.) So the next best thing to perfection is to limit both the downtime and potential damage of a crash.
The netdump service:
Saves memory images of up to the first 4GB of memory on a server somewhere on the network.
Saves a textual representation of the Oops/panic/BUG message and
preceding kernel messages in a file associated with the memory image,
and after the memory image has been saved, attempts to append task and
memory state information in textual form (the equivalent of
Sends console messages to a syslog server in addition to, or instead of, sending them to a netdump server.
Requires a supported network adapter on the client machine (the machine whose memory image is being saved).
Requires a server with sufficient storage space and network to store the dumps.
Requires some manual setup on both the client and server side (as of this writing).
Requires that packets be able to traverse between the client and server, both for ssh connections and for UDP packets aimed at the netdump port on client and server (default is port 6666).
Currently requires that the server not change IP address (this can be worked around).
The classical UNIX crash dump facility dumps the memory image to the swap partition, and then provides a facility to analyze it before reusing the swap partition for swap. This is what many experienced UNIX system administrators are used to and expect.
Classical UNIX operating systems provide support for a very limited set of hardware devices. Company X sells both the hardware and the operating system, and thus can operate within a very constrained set of device drivers, and support only those device drivers. This level of control over the hardware makes it relatively easy for the responsible vendors (that is, those who exhibit care for users' data) to implement special miniature non-interrupt-driven device drivers that run in write-protected memory to dump data to the swap partition. This highly restricted case can be audited and protected.
Linux is not internally designed like a classical UNIX operating system here. Major differences include:
SCSI controllers (except for a few hardware RAID controllers) use a rich and complex generic SCSI layer that does not deal well with interrupts and timers not functioning. (This is true of some UNIX operating systems, but not all.)
Interrupt routing and mechanisms differ from system to system.
There are a great many different SCSI controllers, several of which are in common use in enterprise-class environments. Most of these controllers were not designed to support falling back to a simple, non-interrupt-driven mode for providing crash dump facility.
The standard PC BIOS firmware lacks facilities common to traditional dedicated-hardware UNIX servers that would make swap space dumping potentially safer.
So what? There are two main problems that come up; failure to dump a memory image, and overwriting parts of file systems because the crash has damaged some data structures or code being used to do the dump. Do not laugh, the later happens in real life; failures in drivers, the SCSI layer, or other intermediate data structures or code is as common a place as any for bugs that cause a crash. A simple failure to dump the memory image is the more common of the two, and can be caused by a myriad of problems, including failures in interrupt handling (for example, interrupts being disabled at the time of the crash; a common problem), locks taken and not released, and data structures that are inconsistent at the time of the crash causing the system to wait forever.
By contrast, network devices are simple, are easy to modify to enable a non-interrupt-driven polled mode, and even if there is a bug in a network device driver, it is entirely likely not to disable the crash dump over the network, because the code path used for network crash dump is highly restricted. The entire network stack can crash and network crash dump can still work, because the network crash dump code implements a separate small but standard-compliant subset of the UDP protocol sufficient to perform the crash dump. Interrupts can be disabled, arbitrary locks can be held indefinitely, and the network crash dump will still function perfectly.
Dumping to the network is much more manageable than dumping to disk. The dumps are preserved for comparative analysis, something that is not easy to do with disk-based crash dumps. It is possible (though not yet implemented) to compare the crash signature to existing signatures and choose not to store some dumps. It is trivially possible to do initial programmatic analysis and provide that analysis when notifying administrators of the crash. Space for storing crash dumps becomes a centrally managed resource, rather than an afterthought. After completing the dump (or rebooting without dumping if the network crash dump storage server requests it after analyzing the signature or checking disk space or doing any other tests), the crashed computer can reboot and be immediately available for normal use, without having to go through a manual, interactive stage of crash dump retrieval and storage.
Dumping to the network is more reliable. There are many more potential points of not only software, but also hardware failure in the dump-to-disk path than in the dump-to-network path. In the case of discovering hardware failure instead of a software failure, successful first fault analysis is much more likely for this reason as well.
There are patches available to Linux, and implementations for some versions of the UNIX operating system, which reboot without clearing memory, allowing the memory image to be saved to disk by a known non-corrupted kernel image (as known as is possible, of course) and then rebooting again after the image has been saved. There are several problems with this approach, particularly on the PC hardware platform.
First, there is no standard way to preserve memory in PCs on a warm reboot. Most PC platforms clear memory on warm reboot, and while a few chipsets provide the ability to avoid this, it is not standardized, not generally available, sometime changes even between revisions of the same firmware, and thus is not a reasonable solution for a general-purpose operating system.
There is a much bigger problem, however. Fundamentally, significant amounts of memory must be reserved for this before the crash, or large amounts of memory must be located afterward, either to use to put the crashed kernel image into before reboot or to put the new, trusted kernel image into during the reboot process. (On PC systems, we do not have the luxury of have firmware that knows how to load a special crash dump kernel into reserved memory, so we do not even have the option of reserving a moderate amount of memory for loading a special kernel into after a fault.)
This memory is usually harvested from memory already in use for other things, such as user-space pages, page cache, buffer cache, and so forth, which limits the amount of information available in the dump, reducing the number of clues available as to the cause of the crash and thus reducing the likelihood of successful first fault analysis. In addition, there is no general guarantee that sufficient memory will be available in the sacrificed locations, so there is a distinct possibility that the dump process will fail from simple lack of available memory — imagine if the kernel crash was due to a memory leak that ate up available memory?
By dumping memory before rebooting, we are able to reserve only the memory needed by the crash dump code itself (a few pages of memory), and also have the possibility to later extend the information provided in the dump to include hardware state and other information that might not be preserved across a warm reboot.
We believe that at this time, network crash dumps are the most general and supportable option for crash dumps on the generic PC platform. This does not mean that it is always the best option for every situation, but that it is the best choice for us to support for the general case now. In some cases, other forms of crash dump may be appropriate, and we may choose to support other forms of crash dump as well at some indefinite time in the future.
| Next | ||
| How do I set this up? |