High-performance computing (HPC) needs a parallel filesystem to mount storage over all nodes, and it needs to mount this storage using the InfiniBand (IB) network standard. Storage organization adds the mount command for the Lustre filesystem in /etc/rc.local
to mount at boot time.
When I was building an HPC cluster, I needed to mount this storage over all of my master and compute nodes. But, when I was rebooting any node, it could not mount the Lustre filesystem. Ultimately, I wrote a script to solve this problem.
The Crux
There are many scenarios in which there are dependencies for services, or for the network to start any service. You can face this kind of issue when you’re working with multiple original equipment manufacturers (OEMs) simultaneously, where an application from one OEM is dependent on another OEM. These dependencies can make it hard to coordinate with all of the OEMs to resolve the issue.
In my scenario, I was facing this problem. I was neither getting any support from the storage organization, nor from my InfiniBand service provider. In these kinds of scenarios, system administrators in a system integrator organization need to resolve the issue and run all services perfectly.
Basically, the storage OEMs add the mount
command in /etc/rc.local
to mount the network filesystem on boot. When the command executes, there is a condition that the IB network should be active, because we’re mounting it from storage using the InfiniBand network.
In my case, the InfiniBand network was taking too much time to activate, and before that happened Lustre was trying to mount the filesystem.
Creativity > Problem
To avoid this issue, I wrote a script and placed it in my /etc/rc.local
file to make my filesystem mount automatically at boot. I saved this script in a shared directory (in an HPC cluster, /home
and /lscratch
are NFS shared among all nodes) and then executed it from /etc/rc.local
or a cron
job:
[root@master ~]# cat /lscratch/lustre-script.sh
#!/bin/bash
h=$(ibstat | grep State | awk '{print $2}')
for i in {1..150}
do
#To Check Current Status of InfiniBand.
echo "Current Status of InfiniBand $h"
if [[ ( "$h" == "Active" ) ]]
then
echo "IB is UP, mounting lustre File System."
/bin/mount -o flock,defaults -t lustre 192.168.43.20@o2ib,192.168.43.23@o2ib1:192.168.43.26@o2ib,192.168.43.28@o2ib1:/lustreshare /storage
exit
else
echo "IB is down at checking $i"
echo "Retrying.."
sleep 2
fi
done
mail -s 'Unable to mount Lustre file System on $(hostname)' admin@mydomain.com <<< 'InfiniBand is not active, therefore didn’t try to mount the Lustre File System. Need your attention'
[root@master ~]# chmod +x /lscratch/lustre-script.sh
Here, I’ve saved a script under the shared directory /lscratch
and changed the permissions to execute. This script breaks down as follows:
- At the beginning, I set a variable that checks the status of the InfiniBand network, filters
State
from the command’s output, and prints the second column. This setup looks for the wordActive
, which tells us that the InfiniBand network is up. - On the third line, I used a
for
loop to test the InfiniBand network’s status 150 times. - On the sixth line, the script prints the current result of the InfiniBand network’s state.
- On the seventh line, I used an
if
condition to check the current state of the InfiniBand network, looking for the wordActive
. If the condition on the seventh line is true, the script proceeds to the ninth line and prints the sentence:IB is UP, mounting lustre File System.
- After printing the sentence, the script tries to mount the Lustre filesystem and then exits. In the case of the condition failing, the script prints two sentences, sleeps for two seconds, and then starts the new execution from line six.
Using these steps, this script checks the InfiniBand state up to 150 times, therefore running for up to 300 seconds. If the script finds Active
, then it mounts the Lustre filesystem and exits. I’ve also added one last line in the script. It sends an email to a specified email address if the InfiniBand network does not become active in the first 300 seconds.
Now, you can execute this script from /etc/rc.local
. I’ve added the following entry in /etc/rc.local
to create a log of script execution:
[root@master ~]# tail -n 1 /etc/rc.local
sh /lscratch/lustre-script.sh >> /lscratch/lustre-script.log
These logs can help you monitor the script.
We can also add a cron
job for our script. Here is an example that I’ve added in my setup. It runs after each reboot:
[root@master ~]# crontab –l
@reboot sh /lscratch/lustre-script.sh >> /lscratch/lustre-script.log
Using this script, I can have my Lustre mount command wait until the InfiniBand network becomes Active
. This is the basic script I wrote to help my Lustre mount command. Using this script’s syntax, you can create your own script to wait for any independent service to start before any other dependent one.
About the author
I'm a techie guy with lots of love for Linux. I've started my career with a US-based project as Linux Administrator. Later, I got an opportunity to work with HPC clusters, where I learned several other products. I love to teach, write blogs, troubleshoot complex issues, and write scripts to automate tasks. I also love to read books and watch movies/web series.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit