Networking for sysadmins: Recognizing router problems
From my experience in the IT corridors of corporate networking teams, this article is the first in a series targeting network router challenges that most system administrators face. This guide assumes that you are familiar with the commands and interface for the router you are operating, so the focus is on what to consider and do when it comes to tracking down and resolving issues.
Investigating connectivity issues
So, how do you know which network device is causing the issue? Don't feel discouraged if you have trouble answering this question because it is easy to get mislead. The fastest check is to use the good old ICMP echo request protocol. If you are on the same subnet as the router and cannot get a response, you are already closing in on the issue.
There is a great article by Anthony Critelli on network troubleshooting where he describes how to use
traceroute, and other useful commands. You can use these tools to investigate until you find the problem.
Let’s take a look at what you need to keep in mind.
If an issue has occurred with a router that controls that path to or from the internet, the effect is immediate. There are few network monitoring systems faster than your users to detect if the internet connection is down.
Even if your users don’t send word, your monitoring systems will most likely send alerts—if not, then you have some configuring to do. Another dependable indicator that your internet-connecting router is down is when outbound company email starts to queue up.
Any connection involving the internet usually entails a firewall or two, so make sure you are troubleshooting the device that is the actual cause. Also, keep in mind that if a troublesome router is in front of the corporate directory server and the error occurs during a weekend or toward the end of a working day, the error might not be immediately obvious since the directory server is only used when sessions initiate or re-initiate.
And please remember that outbound connectivity is just one side of the challenge.
This issue is equally important to assess when determining if the failure is complete or partial, so you can best direct your efforts. A quick check for inbound connectivity is to send a message from your private internet-based account to your company account. If you’ve already determined that your outbound connection is down and this check also fails, you know that the connection is down both ways. By looking at the email header (MIME trace), you can determine where the email failed.
Checking the company website from outside the firewall can also be useful. You probably even have a web monitoring service already in place, and there are plenty to go around—if you don’t have one, this starting point gives advice and also offers tools you can use immediately.
Of course, there are other potential culprit candidates (not explored in this document) that can cause or contribute to connectivity incidents:
- Cables (e.g., physically disconnected or broken)
- Domain Name Servers (e.g., down or configuration issues)
- Firewalls (e.g., down or configuration issues)
- Switches (e.g., broken or not connected properly)
- Network Interface Cards (e.g., broken or outdated NIC drivers)
- Internet Service Providers (e.g., down or configuration issues)
Investigating the router itself
If at this point you strongly believe the router is the issue, try to SSH into the router to determine if there is a slow connection or no connection. Many times, a bit of patience can pay off and avoid unnecessary actions. The router could be running a high CPU load, which makes it take "forever" to respond, making it difficult to log in.
A good start is to understand what the SSH protocol is and how it works. It is not uncommon for there to be key issues in an organization due to using anonymous "admin" accounts to access network components and servers. You can find information on how to configure SSH for Red Hat Enterprise Linux 7, specifically, here.
It is good to know if you are going through a firewall when attempting to connect to the router. The rules in the firewall could have been changed without you knowing. If you can’t SSH into the router at all, ensure that you can connect to the router on port 22 and get a response. It is not impossible that port 22 has been blocked, and this problem requires that you connect to the router via another port, such as 80 or 443. The classic CLI is, of course, truth, but a graphical tool can be useful to share results with other admins and managers (who inevitably show up at this point).
Once you establish that port 22 is available, the next thing to check is the user credentials. In this case, the assumption is that you connect as root:
$ ssh firstname.lastname@example.org
Check if SSH is enabled on the router by issuing the command:
$ sh 126.96.36.199 ssh SSH Enabled - version 1.99 # Authentication timeout: 120 secs; Authentication retries: 3
However, when you try to connect, you could get an error message similar to:
ssh: Could not resolve hostname corporatebigrouter.com: Name or service not known
To ensure that this error is not a DNS issue, use the router’s IP address to connect. If the host is available but the port is not, you might get an error message like this:
ssh: connect to host 188.8.131.52 port 22: Connection timed out
If the port is open but your SSH key is not accepted, the error message would be this:
ssh: connect to host 184.108.40.206 port 22: Connection refused
Also, most routers have a graphical web interface available via port 80, so that is yet another way to connect and check if you can access the router and get a response. Most of us would agree that the graphical interface can play tricks and is perhaps not as reliable as the CLI, but even so, it is another way to connect.
It might be worth mentioning port forwarding because this method might have been used and could be the reason for failing to connect. A common example is port 80 forwarded to port 180 and 443 forwarded to port 1443. Be careful not to use ports that are already reserved for something else. A list of well-known ports can be found here.
Understanding network alerts
There will surely be network monitoring tools that start to complain and send error messages. You need to understand what the messages really mean and if they are triggered for the correct reason.
Modern network devices will, of course, generate log files with SNMP traps that can be picked up by a monitoring tool. The SNMP trap protocol operates over ports 161 and 162, or in the case of secure SNMP, the ports 10161 and 10162. These messages give you additional opportunities to check if the router is available or not.
There also might be what I refer to as "silly devices" with rudimentary functionality such as "round-robin," older switches, or even network hubs. These devices can cause failures that only trigger low-level alarms, or in general obscure the network’s status.
Some components are used for failover, which means that the connection could still be up but the primary or preferred route is not being used. In this case, the traffic now takes a different route that can be slower or even not intended to be used. Load balancing could generate similar results if your network monitoring is not set up to detect these issues.
Applying preventative measures
Preventing a router from having issues comes down to maintenance, routines, and documentation. Make sure the firmware remains up to date, that the router and its firmware are still supported, and keep track of your support contract. Script all software changes using a tool like Red Hat Ansible. Document what you plan to do, use logs to document what was done, and comment on the outcome.
It is always preferred to have a backup router, especially if the service is critical. If you are on a smaller budget and in a smaller location, there is still an argument to keep the business running. Even a simpler router might keep you going until help arrives.
And finally, practice dealing with outages. You and your team must learn how to collaborate in challenging situations.
When there is a high-impact incident, make sure you work in a structured way. Consider the alarms you are getting and use multiple angles when checking both outbound and inbound connectivity. A router has several interfaces and often more than one port active, so you have options. Other network components can play tricks and divert your attention. Make sure you address the actual issue. Practice makes perfect—mock an incident and work on it together with your team.
Want more on networking topics? Check out the Linux networking cheat sheet.