There is no encryption of the data that is sent over the wire. We recommend, therefore, that you deploy the network crash dump facility only on networks on which you trust the data in memory. Switched networks will help prevent casual sniffing, but are not a panacea.
It would not be impossible to add encryption at a future date, but it would dramatically increase the footprint of the code needed. One of the benefits of network crash dump is its small footprint; a bad pointer causing random code to be overwritten is statistically less likely to damage code that has a smaller footprint.
The hardware address (macaddr) of the netdump server or first-hop router is stored when the IP address is looked up at the time the module is inserted. This helps prevent spoof attacks, and simplifies the network crash dump client code. If you change the hardware or IP address of the netdump server or first-hop router, you will need to reload the netdump service on the clients. If you expect to do this regularly, you may wish to set up a cron job (perhaps as a script in the /etc/cron.daily/ directory) that reloads the netdump service.
Another safeguard against spoofing is a 64-bit random number (the cookie) that is communicated to the server from the client when the netdump facility is initialized; this random number must be present in all commands. This prevents certain spoof attacks and also protects against denial-of-service attacks: Commands from rogue servers to send pages of memory will be ignored if they do not know the 64-bit random number; this gives a measure of protection against blind-man denial-of-service attacks on switched networks.
In our tests, we have achieved roughly wire speed (approximately 12 MB/s) per second on an unloaded 100 mbit Ethernet network. This translates into approximately 5 minutes to dump a memory image of a system with 4GB of memory. That is an upper limit, since we do not dump memory between 4GB and 64GB on IA-32 machines with PAE capability. While this may seem like a long time for a system to be down, we avoid the need for a second reboot that is needed with several swap space crash dump mechanisms, and with swap space crash dump, the crash dump still has to be put somewhere before the server can resume normal duties, so network crash dump is at least competitive in terms of server downtime and in many cases can be significantly faster, because the analysis can be offloaded to the crash dump server.
While a dump is being done, interrupts are disabled. That means that the keyboard cannot be used to reboot a server while a dump is in process; to reboot a server you will have to use the reset button, the power switch, or a remote power switch unit.
If you are using a hardware watchdog, it should be configured to give you enough time to perform the dump before rebooting the machine. Normally, hardware watchdogs are not used in situations where crash dumps are desired, because the possibility of rebooting within an "arbitrary" amount of time (arbitrary relative to the crash dump code) conflicts with the goal of first fault analysis. We do not recommend mixing network crash dumps and hardware watchdogs. However, there is no conflict between netdump's syslog facility and hardware watchdogs, and very often, the messages logged are sufficient without the memory image to determine the source of the problem.
One compromise approach to first fault analysis on systems with hardware watchdogs is to use netdump to record everything except the actual memory image; this can be done nearly instantaneously. To do this, in the netdump-crash script, check the IP address ($1 in the script) against the set of machines with hardware watchdogs, and for machines with hardware watchdogs, exit with a return code of 1. This will cause the extra debugging information normally requested after the memory image has been transferred to be transferred without the memory image, giving more debugging information without taking several minutes to dump the contents of memory. Here is an example:
#!/bin/bash
# /var/crash/scripts/netdump-crash
case $1 in
10.0.14.*)
# these machines have hardware watchdog cards; do not try
# to store a memory image
exit 1
;;
esac
exit 0
|