It's a fact of life. Systems, software, and services fail. Keeping users happy and the pager quiet is always at the front of every sysadmin's mind. Therefore, knowing how to handle service failure quickly, efficiently, and (ideally) automatically is a hallmark of a capable (and well-rested) sysadmin. This article walks you through a few ways systemd can help you mitigate failure in your services.
Restart failed units
Systemd makes it very easy to restart a unit when it fails. Sometimes, this is all you really need. I have worked with buggy software that occasionally encounters an irrecoverable error, crashes, and must be restarted. Ideally, you would be able to fix the underlying software problem, but that isn't always within your control.
The following service unit will restart a service if it fails. Restart=on-failure covers the broadest range of failure scenarios, such as unclean signals and unclean exit codes:
[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure
Check out the systemd service documentation for more restart options.
The StartLimitBurst=2 and StartLimitIntervalSec=30 settings tell systemd that if the service unsuccessfully tries to restart itself twice within 30 seconds, it should enter a failed state and no longer try to restart. This ensures that if the service is truly broken, systemd won't continuously try to restart it. You should always tune these settings to values that make sense for your workload.
You can restart the failed counter with the systemctl reset-failed command.
[ For more tips, see A beginner's guide to network troubleshooting in Linux. ]
Take action on failure
Restarting a service is great, but taking specific actions when a unit fails is even better. Maybe you're using software with a known bug that requires a cache file to be deleted when it crashes, or perhaps you want to initiate a script that collects logs and system information so that the problem can be diagnosed. Systemd allows you to specify units that run when a service fails.
This example specifies OnFailure=my-app-recovery.service to tell systemd that if my service fails, it should start the my-app-recovery unit:
[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
OnFailure=my-app-recovery.service
[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure
The my-app-recovery unit is just a one-shot service unit that runs this script:
[Unit]
Description=My App
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/my-app-recovery.sh
This script could do anything: Perform some manual workaround to get the service running again, fire off an alert to a monitoring system, or zip up some temporary logs and application state for troubleshooting. In this case, it's just writing a message to a temp file and restarting the service:
#!/bin/bash
echo 'Attempting to recover!' > /tmp/recovery_info
systemctl reset-failed my-app
systemctl restart my-app
When this unit enters a failure state, the unit's logs will clearly show that the OnFailure dependencies have been triggered:
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 1.
Aug 30 03:04:30 server01 systemd[1]: Stopped My App.
Aug 30 03:04:30 server01 systemd[1]: Started My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 2.
Aug 30 03:04:32 server01 systemd[1]: Stopped My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Start request repeated too quickly.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: Failed to start My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Triggering OnFailure= dependencies.
Be careful with restarting services within an OnFailure script. You don't want to have a scenario where your script is so good at restarting the service that you never know there's a problem. It's wise to provide some type of input into your alerting system so that it knows when it encounters a failure condition.
Have you tried turning it off and on again?
Every sysadmin knows the value of a good restart for fixing a strange problem, and you might be tempted to just throw a reboot in your OnFailure script. Thankfully, systemd includes built-in functionality to trigger system restarts on unit failures. In this example, the system will gracefully reboot when the unit fails:
[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
FailureAction=reboot
[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure
There are several valid values for FailureAction, so be sure to review the systemd unit documentation for a complete understanding of its capabilities.
[ Watch this free on-demand webinar: Preparing your IT infrastructure for the next 10 years. ]
Automated recovery
Keeping services running smoothly is the goal of any dedicated sysadmin, but automatically handling failure scenarios differentiates the rookies from the seasoned veterans. Systemd includes powerful features for automating your responses to keep services running. In this article, you learned about a few simple systemd features that will help you keep your systems in good working order.
About the author
Anthony Critelli is a Linux systems engineer with interests in automation, containerization, tracing, and performance. He started his professional career as a network engineer and eventually made the switch to the Linux systems side of IT. He holds a B.S. and an M.S. from the Rochester Institute of Technology.
More like this
Slash VM provisioning time on Red Hat Openshift Virtualization using Red Hat Ansible Automation Platform
Red Hat Ansible Automation Platform: Measuring Business Impact with Dashboard and Analytics
Technically Speaking | Taming AI agents with observability
Where Coders Code | Command Line Heroes
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds