In this article, I present a few tricks to handle error conditions—Some strictly don't fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.
Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.
Say that you have a cron
job on each one of your Linux systems, and you have a script to collect the hardware information from each:
#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)
DATADIR="$HOME/Documents/lshw-dump"
/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
echo "Visiting: $server"
/usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
/usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done
If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.
Visiting: dmaf5
lshw-dump.json 100% 54KB 136.9MB/s 00:00
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
"boot": "normal",
"chassis": "desktop",
"family": "Default string",
"sku": "Default string",
"uuid": "00020003-0004-0005-0006-000700080009"
}
Here are some possibilities of why things went wrong:
- Your report didn’t run because the server was down
- You couldn't create the directory where the files need to be saved
- The tools you need to run the script are missing
- You can't collect the report because your remote machine crashed
- One or more of the reports is corrupt
The current version of the script has a problem—It will run from the beginning to the end, errors or not:
./collect_data_from_servers.sh
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 48.8MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9
Next, I demonstrate a few things to make your script more robust and in some times recover from failure.
The nuclear option: Failing hard, failing fast
The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $?
or $!
variable, have sometimes a broader meaning. The bash man page tells you:
For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.
As usual, you should always read the man page of the scripts you're calling, to see what the conventions are for each of them. If you've programmed with a language like Java or Python, then you're most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.
If you add set -o errexit
to your script, from that point forward it will abort the execution if any command exists with a code != 0
. But errexit
isn't used when executing functions inside an if
condition, so instead of remembering that exception, I rather do explicit error handling.
Take a look at version two of the script. It's slightly better:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/ )
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9 Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then
21 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi
23 declare -A server_pid
24 for server in ${servers[*]}; do
25 echo "Visiting: $server"
26 /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27 server_pid[$server]=$! # Save the PID of the scp of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33 wait ${server_pid[$server]}
34 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done
Here's what changed:
- Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
- Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
- On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
- On lines 33-35, I wait for the
scp
task to finish, get the return code, and if it's an error, abort. - On line 37, I check that the file could be parsed, otherwise, I exit with an error.
So how does the error handling look now?
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 146.1MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory
As you can see, this version is better at detecting errors but it's very unforgiving. Also, it doesn’t detect all the errors, does it?
When you get stuck and you wish you had an alarm
The code looks better, except that sometimes the scp
could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.
Another example is to try to access a directory through NFS where $HOME
is mounted from an NFS server:
/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
And you discover hours later that the NFS mount point is stale and your script is stuck.
A timeout is the solution. And, GNU timeout comes to the rescue:
/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it's still running after 20.0 seconds, then send a KILL signal (kill -9
). If in doubt, check which signals are supported in your system (kill -l
, for example).
If this isn't clear from my dialog, then look at the script for more clarity.
/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real 0m20.003s
user 0m0.000s
sys 0m0.003s
Back to the original script to add a few more options and you have version three:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * Open SSH: http://www.openssh.com/portable.html
5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
6 # * JQ: http://stedolan.github.io/jq/
7 # * timeout: https://www.gnu.org/software/coreutils/
8 #
9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
11 # Author: Jose Vicente Nunez
12 #
13 set -o errtrace # Enable the err trap, code will get called when an error is detected
14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
15
16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
17 for dependency in ${dependencies[@]}; do
18 if [ ! -x $dependency ]; then
19 echo "ERROR: Missing $dependency"
20 exit 100
21 fi
22 done
23
24 declare -a servers=(
25 macmini2
26 mac-pro-1-1
27 dmaf5
28 )
29
30 function remote_copy {
31 local server=$1
32 echo "Visiting: $server"
33 /usr/bin/timeout --kill-after 25.0s 20.0s \
34 /usr/bin/scp \
35 -o BatchMode=yes \
36 -o logLevel=Error \
37 -o ConnectTimeout=5 \
38 -o ConnectionAttempts=3 \
39 ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
40 return $?
41 }
42
43 DATADIR="$HOME/Documents/lshw-dump"
44 if [ ! -d "$DATADIR" ]; then
45 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
46 fi
47 declare -A server_pid
48 for server in ${servers[*]}; do
49 remote_copy $server &
50 server_pid[$server]=$! # Save the PID of the scp of a given server for later
51 done
52 # Iterate through all the servers and:
53 # Wait for the return code of each
54 # Check the exit code from each scp
55 for server in ${!server_pid[*]}; do
56 wait ${server_pid[$server]}
57 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
58 done
59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
60 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
61 done
What are the changes?:
- Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
- Created a
remote_copy
function, which uses a timeout to make sure thescp
finishes no later than 45.0s—line 33. - Added a connection timeout of 5 seconds instead of the TCP default—line 37.
- Added a retry to
scp
on line 38—3 attempts that wait 1 second between each.
There other ways to retry when there's an error.
Waiting for the end of the world-how and when to retry
You noticed there's an added retry to the scp
command. But that retries only for failed connections, what if the command fails during the middle of the copy?
Sometimes you want to just fail because there's very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you're able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.
Here are the changes to the remote_copy
, to keep this brief (version four):
#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3
# Blah blah blah...
function remote_copy {
local server=$1
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
echo "INFO: Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s \
/usr/bin/scp \
-o BatchMode=yes \
-o logLevel=Error \
-o ConnectTimeout=5 \
-o ConnectionAttempts=3 \
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
return $status
}
DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
/usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
remote_copy $server $MAX_RETRIES &
server_pid[$server]=$! # Save the PID of the scp of a given server for later
done
# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
wait ${server_pid[$server]}
test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done
# Blah blah blah, process the files you just copied...
How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there's a retry for a random time between 1 and 60 seconds before exiting:
INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
If I fail, do I have to do this all over again? Using a checkpoint
Suppose that the remote copy is the most expensive operation of this whole script and that you're willing or able to re-run this script, maybe using cron
or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.
You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don't bother to check again for that day.
Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry
).
A new version (version five) of the script has code to record the status of the copy (lines 15-33):
15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20 /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25 local machine=$1
26 test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30 machine=$1
31 /usr/bin/touch $CACHE_DIR/$machine
32 return $?
33 }
Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.
And then, add this new helper logic into the remote_copy
function (lines 52-81):
52 function remote_copy {
53 local server=$1
54 check_previous_run $server
55 test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56 local retries=$2
57 local now=1
58 status=0
59 while [ $now -le $retries ]; do
60 echo "INFO: Trying to copy file from: $server, attempt=$now"
61 /usr/bin/timeout --kill-after 25.0s 20.0s \
62 /usr/bin/scp \
63 -o BatchMode=yes \
64 -o logLevel=Error \
65 -o ConnectTimeout=5 \
66 -o ConnectionAttempts=3 \
67 ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68 status=$?
69 if [ $status -ne 0 ]; then
70 sleep_time=$(((RANDOM % 60)+ 1))
71 echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72 /usr/bin/sleep ${sleep_time}s
73 else
74 break # All good, no point on waiting...
75 fi
76 ((now=now+1))
77 done
78 test $status -eq 0 && mark_previous_run $server
79 test $? -ne 0 && status=1
80 return $status
81 }
The first time it runs, a new new message for the cache directory is printed out:
./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:
./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
Imagine how this speeds up when you have more machines that should not be revisited.
Leaving crumbs behind: What to log, how to log, and verbose output
If you're like me, I like a bit of context to correlate with when something goes wrong. The echo
statements on the script are nice but what if you could add a timestamp to them.
If you use logger
, you can save the output on journalctl
for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl
right away.
So instead of just doing echo
, you can also add a call to logger
like this using a new bash function called ‘message
’:
SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
function message {
message="$1"
func_name="${2-unknown}"
priority=6
if [ -z "$2" ]; then
echo "INFO:" $message
else
echo "ERROR:" $message
priority=0
fi
/usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}
You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.
So how is this useful? Well, you could get
the messages between 1:26 PM and 1:27 PM, only errors (priority=0
) and only for our script (collect_data_from_servers.v6.sh
) like this, output in JSON format:
journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
"_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
"_AUDIT_LOGINUID" : "1000",
"SYSLOG_IDENTIFIER" : "logger",
"PRIORITY" : "0",
"_TRANSPORT" : "journal",
"_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
"__REALTIME_TIMESTAMP" : "1623518797641880",
"_AUDIT_SESSION" : "3",
"_GID" : "1000",
"MESSAGE_ID" : "collect_data_from_servers.v6.sh",
"MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
"_CAP_EFFECTIVE" : "0",
"CODE_FUNC" : "remote_copy",
"_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
"_COMM" : "logger",
"CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
"_PID" : "41832",
"__MONOTONIC_TIMESTAMP" : "25928272252",
"_HOSTNAME" : "dmaf5",
"_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
"__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
"_UID" : "1000"
}
Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.
You can take a look at the whole version six of the script.
Don't be so eager to replace your data until you've checked it.
If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:
Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'
That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don't attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):
function remote_copy {
local server=$1
check_previous_run $server
test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
message "Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s \
/usr/bin/scp \
-o BatchMode=yes \
-o logLevel=Error \
-o ConnectTimeout=5 \
-o ConnectionAttempts=3 \
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
if [ $status -eq 0 ]; then
/usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
status=$?
if [ $status -eq 0 ]; then
/usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
test $? -ne 0 && status=1
else
message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
fi
fi
return $status
}
Choose the right tools for the task and prep your code from the first line
One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.
You should ALWAYS check for error syntax before running your script:
bash -n $my_bash_script.sh
Seriously. It should be as automatic as performing any other test.
Read the bash man page and get familiar with must-know options, like:
set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv
Use ShellCheck to check your bash scripts
It's very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.
shellcheck collect_data_from_servers.v7.sh
In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.
In collect_data_from_servers.v7.sh line 16:
if [ ! -x $dependency ]; then
^---------^ SC2086: Double quote to prevent globbing and word splitting.
Did you mean:
if [ ! -x "$dependency" ]; then
...
If you're wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.
You noticed something with the background scp processes
You probably noticed that if you kill the script, it leaves some forked processes behind. That isn't good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.
This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.
Let's recap
[ Download now: A sysadmin's guide to Bash scripting. ]
1. You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2. Speaking of transitory conditions, you don't need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3. Bash 'trap' is your friend. Use it for cleanup and error handling.
4. When downloading data from any source, assume it's corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5. Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6. You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7. And finally: Use a Bash lint helper like ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts...
If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.
About the author
Proud dad and husband, software developer and sysadmin. Recreational runner and geek.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit