Skip to main content

Set up self-healing services with systemd

Quiet your pager and keep users happy by using automation to recover from common IT service failures.

It's a fact of life. Systems, software, and services fail. Keeping users happy and the pager quiet is always at the front of every sysadmin's mind. Therefore, knowing how to handle service failure quickly, efficiently, and (ideally) automatically is a hallmark of a capable (and well-rested) sysadmin. This article walks you through a few ways systemd can help you mitigate failure in your services.

Restart failed units

Systemd makes it very easy to restart a unit when it fails. Sometimes, this is all you really need. I have worked with buggy software that occasionally encounters an irrecoverable error, crashes, and must be restarted. Ideally, you would be able to fix the underlying software problem, but that isn't always within your control.

The following service unit will restart a service if it fails. Restart=on-failure covers the broadest range of failure scenarios, such as unclean signals and unclean exit codes:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

Check out the systemd service documentation for more restart options.

The StartLimitBurst=2 and StartLimitIntervalSec=30 settings tell systemd that if the service unsuccessfully tries to restart itself twice within 30 seconds, it should enter a failed state and no longer try to restart. This ensures that if the service is truly broken, systemd won't continuously try to restart it. You should always tune these settings to values that make sense for your workload.

You can restart the failed counter with the systemctl reset-failed command.

[ For more tips, see A beginner's guide to network troubleshooting in Linux. ]

Take action on failure

Restarting a service is great, but taking specific actions when a unit fails is even better. Maybe you're using software with a known bug that requires a cache file to be deleted when it crashes, or perhaps you want to initiate a script that collects logs and system information so that the problem can be diagnosed. Systemd allows you to specify units that run when a service fails.

This example specifies OnFailure=my-app-recovery.service to tell systemd that if my service fails, it should start the my-app-recovery unit:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
OnFailure=my-app-recovery.service

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

The my-app-recovery unit is just a one-shot service unit that runs this script:

[Unit]
Description=My App

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/my-app-recovery.sh

This script could do anything: Perform some manual workaround to get the service running again, fire off an alert to a monitoring system, or zip up some temporary logs and application state for troubleshooting. In this case, it's just writing a message to a temp file and restarting the service:

#!/bin/bash

echo 'Attempting to recover!' > /tmp/recovery_info
systemctl reset-failed my-app
systemctl restart my-app

When this unit enters a failure state, the unit's logs will clearly show that the OnFailure dependencies have been triggered:

Aug 30 03:04:30 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 1.
Aug 30 03:04:30 server01 systemd[1]: Stopped My App.
Aug 30 03:04:30 server01 systemd[1]: Started My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 2.
Aug 30 03:04:32 server01 systemd[1]: Stopped My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Start request repeated too quickly.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: Failed to start My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Triggering OnFailure= dependencies.

Be careful with restarting services within an OnFailure script. You don't want to have a scenario where your script is so good at restarting the service that you never know there's a problem. It's wise to provide some type of input into your alerting system so that it knows when it encounters a failure condition.

Have you tried turning it off and on again?

Every sysadmin knows the value of a good restart for fixing a strange problem, and you might be tempted to just throw a reboot in your OnFailure script. Thankfully, systemd includes built-in functionality to trigger system restarts on unit failures. In this example, the system will gracefully reboot when the unit fails:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
FailureAction=reboot

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

There are several valid values for FailureAction, so be sure to review the systemd unit documentation for a complete understanding of its capabilities.

[ Watch this free on-demand webinar: Preparing your IT infrastructure for the next 10 years. ]

Automated recovery

Keeping services running smoothly is the goal of any dedicated sysadmin, but automatically handling failure scenarios differentiates the rookies from the seasoned veterans. Systemd includes powerful features for automating your responses to keep services running. In this article, you learned about a few simple systemd features that will help you keep your systems in good working order.

Check out these related articles on Enable Sysadmin

Topics:   Troubleshooting   Automation   Failures  
Author’s photo

Anthony Critelli

Anthony Critelli is a Linux systems engineer with interests in automation, containerization, tracing, and performance. He started his professional career as a network engineer and eventually made the switch to the Linux systems side of IT. He holds a B.S. and an M.S. More about me

On Demand: Red Hat Summit 2021 Virtual Experience

Relive our April event with demos, keynotes, and technical sessions from
experts, all available on demand.

Related Content

OUR BEST CONTENT, DELIVERED TO YOUR INBOX