With router issues confirmed, and after making sure you approach your network issue in an efficient way, here is a guide on how to troubleshoot the problem. This technique can also be used as a template for a basic plan for structuring your own work, as well as communicating to your colleagues who want to know what’s going on.
This guide assumes that you are familiar with the commands and interface for the type of router you are operating, so the focus is on ways of thinking and acting when it comes to tracking down and resolving issues.
Physical changes
Start by examining physical changes. If cables or network interfaces were changed, the router was replaced, or there was any other physical movement of the equipment, there is a good chance that one of the cables is not connected properly or is broken.
Try to resist the urge in the midst of an incident to tidy up a crow’s nest of patch cables connecting to the router, unless you have a fresh set of cables next to you and know-how to ensure full functionality (half-duplex is not full).
Hardware failure
Oh, blasted! At an awkward moment Darth Vader noticed that his light saber was out of battery.
A quick and surprisingly accurate way of determining if there is a hardware issue is to check the LED lights on the router. Red is usually bad, but not always, so check the manual for your particular router to find out which LED lights should be active, what color they should have, and if they should be solid light or blinking.
Also, if you can connect to the router through one interface, you can view the router’s general status. That action will show you if any of the other interfaces have a hardware failure. See Anthony Critelli’s article, A beginner’s guide to network troubleshooting in Linux, for more details.
Router basics
Once you are connected to the router and have looked for hardware failures, check the other basics and then work from there. Here are the commands used by CISCO, as an example:
| Command | Description |
|---|---|
show version |
Provides an overview of the router. |
show interfaces |
Provides an overview of all interfaces in the router. |
show logging |
Tells you what kind of logging is configured. |
show tech-support |
Provides information about CPU and memory utilization through a combination of commands. |
Note: More Cisco router commands can be found here.
So, now you should know if the software is up-to-date, if the interfaces are configured and active, and if the router is overloaded or not. This is a good start.
Firmware update
If you have verified that the firmware is out-of-date, don’t just slap on the latest version unless you have a verified plan of how to return to the previous state. During a high profile incident, you are probably in no fit state to read pages of release notes, leaving you unable to assess what other issues might be introduced with the new firmware. Should something go wrong and you don’t know how to back out, the problem just got bigger.
The worst scenario is if the router updates, and then on restart, refuses to come back up. In a big company scenario, you typically have double routers and failover, so you can take the misbehaving router offline while troubleshooting and have the secondary router manage the full load. This situation also means that you must ensure that the secondary router has the latest routing tables.
Never assume. Always check first.
Firmware updates also might reset or changeset values, which means (again) that you need a plan regarding how to reapply the current configuration or return to a previous state. This is where software like Red Hat Ansible comes in handy. Ansible can version, store, and apply configurations and software/firmware for all infrastructure components. Doing this will save you a lot of time and trouble. It also will provide sufficient logs as part of the documentation to show what was changed, by whom, and when.
Another anonymous admin
Larger companies usually have more than one sysadmin managing network components, and with accounts like sudo and admin, it is not always clear who did what, even with log analysis.
With anonymous accounts and a high-pace workload, plus the added pressure of the internet not being available, it’s even harder to remember who did what and when. If your company is not already using a change management process this is a good opportunity to consider doing so. In the simplest form, change management is a document where you write down which component you are working on, what date it is, and then start each line with a timestamp and write what you did. This simplest form of roadmap is better than none.
With the help of documentation, you can go back and check if any change was completed recently that could potentially cause problems. Even if you are the only admin, I would recommend that you use a change management process (however simple) and document what you plan to do, what you did, and what the outcome was.
Primary router disagreement
Most corporate networks have more than one router, and close to the internet connection, there is usually fail-over functionality to ensure high availability. If incorrectly configured, this setup can cause arguments among the routers (trust me) about which is the primary router, and if a change is implemented in one router the other might not accept it. A classic example is when the primary router is taken off-line and the secondary rises to power with obsolete routing information.
Protocol perception
There are several protocols routers can use, such as OSPF, RIP, EIGRP, and BGP. If routers are by configuration error using different protocols, this setup will cause issues that could be of catastrophic or intermittent nature, so make sure you are using one protocol according to your standard.
Security breach
This is a scary one and should send you off to change passwords and SSH keys immediately to prevent additional damage and block the risk of being locked out of the system. Storing configurations in GitHub and using Ansible to retrieve them, and enforcing the desired state on network components, is a great way to prevent bad configurations from gaining a foothold in the network.
A former employee—perhaps even a sysadmin—with a grudge might have the ability to wreak havoc in the network unless you have a policy of regularly changing passwords and keeping strict control of user accounts. Many years ago, a company had the generic account "admin" and the secret password
"penguin" on all network components, and guess what? Operations were disrupted for almost 24 hours, and thanks to everything being 100% manual, it took almost six months to weed out the old admin password from all components.
Communicate with the team, align with the security officer, gather evidence (document), and follow company routines (which, most likely, involve a police report).
Physically disconnect different segments of the network to contain the damage. Doing so gives you time to assess the damage and work out a plan of action to restore operations. Remember that panic is also your enemy, especially in these situations.
Router update
Any change management record that involves a router and is listed as "zero impact" should never be allowed or trusted. Updating a router is by default “high impact” because it has the same destructive potential as an excavator going through a porcelain shop. Make sure you have a working backup of the router configuration before attempting any sort of configuration change.
However, if a change is implemented and the network goes belly up, and you have no backup and are not sure what the configuration looked like before the change, you have to adopt a "simpler is better" strategy. Work your way toward a basic level of routing and then take it from there. A full restore in this scenario will most likely take time. It should involve more than one admin and have both backup and documentation as part of the result.
Preventative measures
A tool like Ansible can keep track of which firmware and configurations are deployed, by whom, and at what time. You can use Ansible and enforce that all changes go through this tool.
Ansible also lets you keep a component in a "desired state," meaning that if someone tries to manually change the configuration by logging into the device, Ansible will restore the intended configuration within a defined time (e.g., 60 seconds).
Roundup
You can use this document as a checklist and work your way through it in order to avoid unstructured troubleshooting, which can lead to even more issues. It is essential to avoid adding stress and confusion to an already stressful situation. Make sure that the work you and your team perform is well structured and closely aligned. The worst scenario is, of course, a security breach, in which case you need to contain the damage, which can be done by physically disconnecting networks to minimize the damage.
Want more on networking topics? Check out the Linux networking cheat sheet.
저자 소개
Member of the Red Hat Accelerators and Red Hat Chapter Lead at Capgemini. More than 30 years of international career in the IT industry starting with servers and workstations, software development and later moved on to system administration managing a global network spanning +100 countries with more than 70.000 users. Experience from both setting up and shutting down data centers, migrating data and users between platforms. Many years of experience in recovering broken or unfinished IT projects.
유사한 검색 결과
Implementing best practices: Controlled network environment for Ray clusters in Red Hat OpenShift AI 3.0
Friday Five — December 12, 2025 | Red Hat
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래