When you have exhausted most of your troubleshooting measures, there are a few additional actions to consider. Here is a guide to support you through the last few steps of resolving your router issues.
From experience originating in the IT corridors of corporate networking teams, this is the fourth article in a series targeting network router challenges that most system administrators face. This guide assumes that you are familiar with the commands and interface for the type of router you are operating, so the focus is on ways of thinking and acting when it comes to tracking down and resolving issues.
To reboot or not
Restarting the router is still a valid approach to addressing connectivity issues. While you can unplug the power cable from your home router, doing so might be a different challenge altogether with a company router possibly situated in a data center far away. So, a reboot might require you to physically send someone to flick the switch or unplug the cable.
The repercussions for such an endeavor can also be more severe if the router has additional interfaces that support transactional operations (e.g., databases). Panic and stress are your worst enemies in situations like these. Shouting managers and upset users might cause hasty decisions, which could make the situation worse.
If the primary router is the one having issues and the configuration is not properly mirrored to the secondary router, taking the primary off-line might cause the secondary to start distributing out-of-date routing information, and thus making the problem worse.
Always work step-by-step and don’t use any shortcuts.
Apply updates regularly
Keeping routers up-to-date and avoiding the buildup of technical debt is preferred, even if an update might contain issues. If you do the update routine regularly and with a high degree of automation—preferably using a tool like Ansible Tower—you can also roll back in the rare case that an update should cause issues.
There is a much bigger risk, both from a security and functionality point of view, if you let your routers lag behind in the update schedule. Having a configuration management database (CMDB), or equivalent, where you can find an overview of all network components and their life cycle management history will help you to maintain a dependable network environment.
Check Common Vulnerabilities and Exposures alerts
The Common Vulnerabilities and Exposures (CVE) community collects and presents known system weaknesses. They will surely have specifics regarding your network product software and hardware. You can, for example, find this information for Cisco here.
Connect with the console cable
Again, if the router is in a geographically different place from your person, this action might not be an option. However, if you are the person on the ground and have access to the infrastructure center, give this a try. Consult the router manual to make sure that you understand the messages relayed back to your console.
Check the log files
A classic error is having overgrown logfiles that bloat the system and slow operations down to a crawl (or complete halt). In this case, you need to flush the logfiles and start over, while making sure that the logs are "eating their own tail" to ensure a maximum size. This strategy can usually be regulated either through time or size parameters—in some cases, both criteria can be applied.
Often these are local logfiles unless these are being pulled out using SNMP (or another) mechanism and subsequently stored. Of course, this means that you have to be able to access the router first.
Call tech support
A support contract is usually not valued until it’s needed, so in preparation for an event like this, please make sure you know where the support contract is. Also, make sure that you have easy access to important details like the contract number, expiry date, what class of incident you are reporting, and if the contract requires the names of those allowed to call in, make sure that the list is up-to-date.
If the supporting organization or company lacks access to the router, you will end up explaining what’s happening to them screen-by-screen. This process is risky, slow, and prone to even more errors due to misunderstandings. Make sure that you can either share your screen or send screenshots to support. That way, they can see exactly where you are and what options are available.
Make sure that you test this procedure before you are stuck in the middle of an incident. Practice makes perfect, so a regular check with the support organization to ensure smooth communication is of the essence.
Also, make sure that you speak the same language, as in nomenclature. For example: "I have a product connected to the router." The support organization assumes you are talking about a technical device, but you are perhaps talking about one of the products your company sells where the data for that product is passing through the router. Valuable time can be lost due to these misunderstandings, and the wrong remediations can be applied and cause new issues.
Remote access is faster, but not without risk. When you have someone from support accessing the router, make sure they explain every step of what they’re doing. You need to understand and take notes regarding any change they invoke. Again, this process is your lifeline to back out in case things go from bad to worse. Documentation is also what will support you once the incident has been resolved.
Applying preventative measures
Most of us have taken part in fire drills or first aid training, but how many of you have trained a technical scenario for router and network incidents. Being prepared and having trained for these procedures before they happen is valuable and will save you time and frustration once a real incident occurs. Training on communication and checking your support contracts will allow you and your team to correct, improve, and update the procedures and supporting documentation.
Split up into different rooms, set up a network, and mock an incident. Also, arrange to have someone disturb the recovery work by claiming they need the room for something else, just to create more confusion and see how the team reacts.
Having already done what you and your team can do for the incident, it could be that you reach the limit of your capacity and need to call support for further assistance, or perhaps you need to purchase a new router because the hardware is broken. Don’t be alarmed or shy once you realize that you need assistance. If you followed a documented procedure and are still unable to resolve the issue, then it is time to call support. Continue documenting and ensure that you have logs to support what was done, which will help then your team the next time there is an incident.
[Want more on networking topics? Check out the Linux networking cheat sheet.]