How to approach router (and other) network incidents
When a network incident hits, the right approach can make the difference between chaos and getting the problem fixed quickly. Now that you know the problem is with the router let’s take a look at how to approach your network problem so you can quickly efficiently get down to business.
From experience originating in the IT corridors of corporate networking teams, this is the second article in a series targeting network router challenges that most system administrators face. This guide assumes that you are familiar with the commands and interface for the type of router you are operating, so the focus is on ways of thinking and acting when it comes to tracking down and resolving issues.
Your worst enemy during any high-impact incident is panic because it prevents logical thinking and rational actions, resulting in a wild stream of uncoordinated activities. Users that are upset and keep calling or shouting and managers that "demand" an immediate resolution can easily make the situation worse. So can teammates with good intent who "fix" something you just changed, or invoke uncoordinated changes to other network components as they hope to quickly resolve the problem. When this chaos occurs, you can soon find yourself in a much more complicated situation with completely new issues.
Find a trusted source of reference
I often use services from sunet.se, which is the Swedish University Network. They have been operational since the early 1980s. Their server,
ping.sunet.se (184.108.40.206), can be used to reliably check network connectivity to and from the internet using two of the tools discussed in the previous article,
ping: While the “ping” utility uses Internet Control Message Protocol (ICMP) to send echo request and echo reply messages on port 7, the “traceroute” utility sends, by default, a sequence of packets using User Datagram Protocol (UDP) on ports 33434 to 33534.
Linux includes an option to use ICMP echo request packets (
-I) or any arbitrary protocol (
-P), such as UDP, TCP using TCP SYN packets, or ICMP. This way, you can enter the command:
# traceroute -p22 220.127.116.11
to check if port 22 is available on the router, thus making it possible to initiate an SSH connection.
Again, I recommend that you read Anthony Critelli’s article, A beginner’s guide to network troubleshooting in Linux, where he explores the functional commands I have mentioned, plus several more.
Communicate and document
You and your team must be in complete sync during a high-impact incident. Share, talk, and communicate. Make sure you have alternative channels of communication like Slack or WhatsApp that allow you to stay connected for the duration of the incident and communicate who is doing what.
Use your cell phones or your laptop for the chat, or whatever else works. This capability is a tremendous asset because chances are you that will not be in the same room, or perhaps not even in the same location, which allows for different viewpoints—which is key when it comes to troubleshooting network issues.
You need to document what you have done. Doing this will help you to achieve structured troubleshooting and allow you, if necessary, to roll back. More importantly, if you need to involve other more senior technicians, you will be able to answer their first request: "Tell me what you have done."
The chat logs can also be used as part of the documentation, so agree upfront who will record the chat room history (or maybe do all of this yourself to prevent loss of information).
Make use of visual reporting
If you can explain something to a less technical colleague (e.g., your manager), chances are that you will gain a supporter that can protect you from having to send unnecessary reports to other layers of managers.
Ask your local manager to take care of communications to other units and upper management. Here is where graphs and colored trace results are very useful.
Have a plan
If you are able to show a plan of action, this makes everyone more at ease, even if it is rudimentary or if the audience does not understand all of the steps. It is always best for the affected community to receive a message saying, "We are working on it."
Having a plan and explaining it to someone less technical also means you challenge yourself to go through the incident and planned resolution in simpler terms. Doing this can be a catalyst for finding a solution. It is always better for the affected community to receive messages that outlines a basic plan of action that shows progress by indicating which of the steps that have been completed - instead of the rather lofty and reoccurring chant “we are working on it”.
Apply preventative measures
And finally, before anything happens at all, prepare your organization regarding how to act during a critical incident. Agree who will be the communicator and regularly share this with information the rest of the organization. Doing this will ensure that you can focus on what you and your team need to do during an incident. Everyone should have their purpose and know-how, and when, to communicate. Enforced change control through documentation and an automation tool like Ansible is essential.
The reason why Ansible should be regarded as an essential change control tool is that all software deployments to the router's firmware can be controlled by Ansible. The same goes for all the configurations and subsequent config changes. Ansible will automatically version and keep track of what was deployed, when, where and by whom. This way it is easy to answer the question “has anything been changed” and the follow-up question “what was changed”. Ansible Tower is the central server that can host all these scripts for easy access and good security.
So, let’s review. Stay calm and don’t panic. Make sure you are connected to your own team and other relevant teams. Collaborate and share in a structured way. Make sure the designated communications person is ready and can shield you all from a reporting frenzy. Find a trusted point of reference like an internet provider. Have a plan and document what you have done.
In a future article, we will look at how to actually troubleshoot the router.
Want more on networking topics? Check out the Linux networking cheat sheet.