Once a router incident is resolved, it is time to collate and summarize the work that was done so that you and the team can gain valuable lessons ahead of the next incident. If you are in a regulated industry, this step is especially crucial because there are legal requirements to always have proof of what was done, when, and by whom.
From experience originating in the IT corridors of corporate networking teams, this is the fifth and last article in a series targeting network router challenges that most system administrators face. This guide assumes that you are familiar with the commands and interface for the type of router you are operating, so the focus is on ways of thinking and acting when it comes to tracking down and resolving issues.
Practice makes perfect
This can't be said enough. Since you probably don’t have a network or router crisis right now, have you ever arranged network or router disaster training? Now could be a good time to do so. Having to face a network outage without any training leads to (most likely) longer resolution time and weaker overall results in the end. During this training, you will have the opportunity to write down and practice procedures like the ones outlined here. This process will inspire confidence and promote collaboration so you are as ready as you can be when (not if) a high impact incident occurs.
You don’t want to put yourself into a situation where you have to explain to management: "I called someone from support, and they did something, and now it looks like it’s fixed." As mentioned before, if you work in a regulated industry (e.g., defense or medical) not having proper documentation is frequently regarded as a transgression.
Let’s take a look at the key things to consider during the post-mortem documentation phase. First, properly documenting all steps is of the essence. Everyone will learn from this process. Documentation also forces people to work slower and think more, which is beneficial when it comes to avoiding panic.
Also, chat logs are splendid. You get time stamps and can see who was involved and how events unfolded. Having a shared chat is good and efficient, but you need a moderator, especially in times of high impact incidents. The moderator will ensure that options are explored sequentially and not in parallel. They will also ensure that everyone is involved and can have their say. Remember that the person who can type the fastest is not necessarily the person that has the best solutions. Find a balance.
Finding the incident’s root cause is important, or the issue is likely to reoccur at an equally inconvenient time. Remember that a reboot is a temporary workaround, not a permanent solution. If rebooting is the only thing that resolves the issue, then you should replace the hardware.
The actual root cause is most likely found in one of the following categories.
Most of the time, systems do what we tell them. Unfortunately, there are times that we (by mistake or insufficient training) tell them to do the wrong thing. Situations such as "opposing admins," where two colleagues implement different things unaware of the other, can also cause incidents.
Stress and fatigue are probably the most common cause of manual errors. Hence, it is much more effective to use a tool like Red Hat Ansible to script all changes, which also makes it easy to revert back to a previous state—plus, you always see who deployed what, and where.
With so many updates, the opportunity for bugs to enter the stage is always present. Many bugs are insignificant, and we will never notice them, but every now and then something bad happens and there is an incident. Always make sure you can safely and easily revert back to a previous state when it comes to both configuration and firmware.
At times, network traffic places a great strain on your network components, and a router is one of those that can become overloaded. The problem could come from the sheer volume of connections and data being transmitted. A malicious attack can have this result, but the same thing could come from a temporary surge in requests that will soon revert to normal. Be patient and check carefully in these situations to see if you need to do something or just wait it out.
I heard something like, "90% of all the data ever created has come about in the last three years." We all know that this growth is exponential, which means network components like routers are under ever-increasing pressure, so make sure you monitor the health of your routers so you can address issues around overload in advance by re-routing traffic or upgrading the routers
In my opinion, outdated hardware often results in what I described in the previous section: permanent overload. However, there is also the situation where a piece of hardware is not compatible with a newer function or protocol. This problem leads to the hardware not having enough capacity to be reliable, and it starts causing incidents.
Don’t leave upgrading the network components for last.
Overheated or out-of-warranty components finally break. If the router is component-based, you might be able to just replace the faulty interface. But, if it is more integrated—or perhaps old enough to be replaced—then go for a newer product.
Yes, I have worked in network environments where we were not able to establish the root cause, nor were we able to replicate the behavior that caused the incident. Deleted logs prevented us from viewing what happened, and the hardware or software providers could not guide us any further.
So, having done what you can, as long as the network or router is operational again, then there is not much more you can do except perhaps tweaking the monitoring and logging in case the problem recurs in the future.
Applying preventative measures
It’s best to be prepared. Create documentation and make sure it is available in an easily accessible form. Have procedures in place regarding how to go about an incident. Keep your support contracts available and up-to-date. Arrange continuous training and learning that lead you and your team to hold relevant certifications, along with incident training for you and your team. Be active in relevant communities where you can share and pick up useful information.
These are the best ways to stay ahead of the game.
Hopefully, this series has left you feeling more comfortable with the idea of an incident happening. Let’s review this final stage.
Documentation, including logs from both the router and chat rooms, is the most valuable when it comes to the wrap-up. Graphical interpretations of data transmissions that show before and after the resolution are useful. In the documentation, always refer to the involved people using full names, plus relevant email addresses and other contact details.
Document everything in a way that a new team member can take of a procedure and know what to do. Be clear—it will help everyone the next time around.
[Want more on networking topics? Check out the Linux networking cheat sheet.]