It's a fact of life. Systems, software, and services fail. Keeping users happy and the pager quiet is always at the front of every sysadmin's mind. Therefore, knowing how to handle service failure quickly, efficiently, and (ideally) automatically is a hallmark of a capable (and well-rested) sysadmin. This article walks you through a few ways systemd can help you mitigate failure in your services.

Restart failed units

Systemd makes it very easy to restart a unit when it fails. Sometimes, this is all you really need. I have worked with buggy software that occasionally encounters an irrecoverable error, crashes, and must be restarted. Ideally, you would be able to fix the underlying software problem, but that isn't always within your control.

The following service unit will restart a service if it fails. Restart=on-failure covers the broadest range of failure scenarios, such as unclean signals and unclean exit codes:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

Check out the systemd service documentation for more restart options.

The StartLimitBurst=2 and StartLimitIntervalSec=30 settings tell systemd that if the service unsuccessfully tries to restart itself twice within 30 seconds, it should enter a failed state and no longer try to restart. This ensures that if the service is truly broken, systemd won't continuously try to restart it. You should always tune these settings to values that make sense for your workload.

You can restart the failed counter with the systemctl reset-failed command.

[ For more tips, see A beginner's guide to network troubleshooting in Linux. ]

Take action on failure

Restarting a service is great, but taking specific actions when a unit fails is even better. Maybe you're using software with a known bug that requires a cache file to be deleted when it crashes, or perhaps you want to initiate a script that collects logs and system information so that the problem can be diagnosed. Systemd allows you to specify units that run when a service fails.

This example specifies OnFailure=my-app-recovery.service to tell systemd that if my service fails, it should start the my-app-recovery unit:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
OnFailure=my-app-recovery.service

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

The my-app-recovery unit is just a one-shot service unit that runs this script:

[Unit]
Description=My App

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/my-app-recovery.sh

This script could do anything: Perform some manual workaround to get the service running again, fire off an alert to a monitoring system, or zip up some temporary logs and application state for troubleshooting. In this case, it's just writing a message to a temp file and restarting the service:

#!/bin/bash

echo 'Attempting to recover!' > /tmp/recovery_info
systemctl reset-failed my-app
systemctl restart my-app

When this unit enters a failure state, the unit's logs will clearly show that the OnFailure dependencies have been triggered:

Aug 30 03:04:30 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:30 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 1.
Aug 30 03:04:30 server01 systemd[1]: Stopped My App.
Aug 30 03:04:30 server01 systemd[1]: Started My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Scheduled restart job, restart counter is at 2.
Aug 30 03:04:32 server01 systemd[1]: Stopped My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Start request repeated too quickly.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Failed with result 'exit-code'.
Aug 30 03:04:32 server01 systemd[1]: Failed to start My App.
Aug 30 03:04:32 server01 systemd[1]: my-app.service: Triggering OnFailure= dependencies.

Be careful with restarting services within an OnFailure script. You don't want to have a scenario where your script is so good at restarting the service that you never know there's a problem. It's wise to provide some type of input into your alerting system so that it knows when it encounters a failure condition.

Have you tried turning it off and on again?

Every sysadmin knows the value of a good restart for fixing a strange problem, and you might be tempted to just throw a reboot in your OnFailure script. Thankfully, systemd includes built-in functionality to trigger system restarts on unit failures. In this example, the system will gracefully reboot when the unit fails:

[Unit]
Description=My App
StartLimitIntervalSec=30
StartLimitBurst=2
FailureAction=reboot

[Service]
ExecStart=/usr/local/sbin/my-app.sh
Restart=on-failure

There are several valid values for FailureAction, so be sure to review the systemd unit documentation for a complete understanding of its capabilities.

[ Watch this free on-demand webinar: Preparing your IT infrastructure for the next 10 years. ]

Automated recovery

Keeping services running smoothly is the goal of any dedicated sysadmin, but automatically handling failure scenarios differentiates the rookies from the seasoned veterans. Systemd includes powerful features for automating your responses to keep services running. In this article, you learned about a few simple systemd features that will help you keep your systems in good working order.


저자 소개

Anthony Critelli is a Linux systems engineer with interests in automation, containerization, tracing, and performance. He started his professional career as a network engineer and eventually made the switch to the Linux systems side of IT. He holds a B.S. and an M.S. from the Rochester Institute of Technology.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래