Linux troubleshooting 101: System performance
Busy systems on a network used by multiple local users (or thousands of web users) experience performance problems during their lifecycles. Only systems that aren't busy are immune to the performance issues that plague us all. This article explores the usual suspects for finding and fixing performance problems.
What follows are generic guidelines, a basic summary of "places to start." Each problem is different, but as you gain more experience, you'll have a better idea of where and how to start looking for a particular problem. I believe that you can be taught troubleshooting basics but you can't be taught experience or intuition. Those both come with time. Also, note that some problems manifest themselves in such a way that you start down one path and are often led to another. This factor is frustrating but normal. For example, certain disk problems can cause your CPU use to spike, and memory problems can mask themselves as disk performance issues. Start with the easy stuff first and then work your way into the more complex. Don't complicate your life more than is necessary. Sometimes you just need to replace a network cable or reboot a system. Simple, but effective.
Reversing recent changes
Making changes in a production environment is necessary. Documenting those changes is mandatory. You'll be glad you did when something goes wrong, and it will. The odd thing about making changes on Linux (or any other system) is that the change itself might work perfectly when you make it, but in a day or two, your system performance suffers. Before you do anything else, check your change documentation to see if any recent changes were made to the system. Changes include software patches, updates of any kind, hardware replacements or upgrades, driver updates, firmware updates, code pushes, new software installs, and configuration changes.
When you check your change documentation, compare recent changes with the problems you're having. After making the usual system checks, you should reverse your changes one at a time to see which one can be traced to your performance root cause. Sometimes, you'll find that certain update "clusters" are not compatible, or must be installed or applied in a particular order. Always check your vendor documentation to see if this is the case.
Update, update, update
You can avoid performance problems associated with software and hardware bugs by keeping everything updated, especially when it comes to server-side software (rather than client-side, like a web browser). Client-side should be updated too, of course, but that's a different discussion.
Yes, it's a full-time job to keep all of your systems updated. There's always something that needs to be updated on a system: BIOS, firmware, drivers, the operating system, applications, agents, security software, databases, backup software, and so on. This task never ends. Decide how often you need to update, or comply with your organization's patching policy to plan, schedule, and apply those updates. At one of my jobs, we patched once a week. Doing so was a pain. It required us to pull an all-nighter once a week, which gets old fast. There's no avoiding doing so regularly, though. You have to update to be sure that your systems are secure and have the latest stability patches.
If your systems are up-to-date and there are no newer updates available, you can generally rule out updates and patches as a performance problem root cause.
Hardware limitations and failures
In my experience, everyone (programmers, network administrators, management, and vendors) wants to blame the infrastructure for all performance problems. They all collectively believe that infrastructure is the weakest link and that's where the breaks are most likely to occur, so you'll have to prove that it isn't your hardware causing the problem before anyone will take action. I agree to a point, but it's a little annoying when that's the first assumption, rather than one that's investigated simultaneously with other potential causes.
There are generally four hardware components that can either fail or reach limitations that can cause you problems: CPU, network, memory, and disk. There are other components that can fail too, such as power supplies, but these "big four" are the most common culprits and the first places you should look when you have a problem.
These days most server systems have multi-core, multi-processor CPU banks. If you have a CPU problem, it might be caused by a defect in the CPU itself. Finding the specific CPU that's giving you a problem is beyond the scope of this article. If you suspect an actual CPU failure or anomaly, call your system vendor for advice. It's likely that they have diagnostic routines you can run that will identify the problem CPU. Beyond that, they'll send out a technician to replace one CPU or all of them.
So, other than a flat-out CPU failure, what do you look for when you suspect a CPU problem? Check
top to see if any processes are overloading your CPU(s). To sort
top for CPU, run
top and then type
P (Shift+P). Look at the processes burning up your CPU cycles. Are the ones at the top of the list system-related or applications? If they're system processes, check your uptime. The uptime shouldn't be extremely high because of regular rebooting.
If you find a particular application using an abnormal amount of CPU cycles, restart the application to see if the problem persists. If the process is system-related, try restarting the process if possible. If not, reboot the system. Yes, reboot the system.
Troubleshooting bonus (rebooting)
Yes, you need to reboot at least once a month. I know there is a barrage of arguments about this practice, but to rule out a lot of issues, a good reboot solves a lot of problems and helps you diagnose hardware problems with minimal effort. Powering off the system occasionally is also good practice, because bringing a system up from a cold boot can identify a lot of hardware problems that might hide on a running system. You'll also be able to narrow down issues if the performance problem persists after a reboot.
The next most obvious place to look when troubleshooting performance is memory use. Memory problems can manifest themselves in different ways that obscure the fact that memory is indeed the problem. If you find that during the course of a day your system's memory is drained off, the first thing to check is your logging. I know it sounds crazy, but capturing logs almost cost a company I used to work for millions of dollars. I noticed in the performance reports that our cluster system's memory was being drained off during the day. There were many gigabytes of memory available, so this problem shouldn't have been happening. Additionally, performance got worse as the day wore on. Every night at midnight, everything would come back. What happened at midnight, you ask? Log rotation. Apparently, someone had turned on debugging for logs, which meant that tens of gigabytes per day were being collected, backed up, and stored unnecessarily. And, it was draining our memory. Once discovered and fixed, the performance came back in full force and alleviated the need to spend millions of dollars on additional systems for this huge cluster.
You should also look at swap space if you suspect a memory problem. In this output, my system is idle so the result isn't dramatic. Use the
free -m command to check physical and virtual (swap) memory usage:
$ free -m
total used free shared buff/cache available
Mem: 821 200 288 10 333 484
Swap: 0 0 0
If you're using a lot of swap, your system might be doing what *nix administrators call "thrashing." Thrashing, contrary to what skateboarders do, is a bad thing for us. You don't want your system to thrash. Thrashing can also appear as a disk problem if it's severe enough. If your system is so busy paging in and out that it affects disk performance, you need to act immediately by restarting the offending process. Now, don't get me wrong. Swap is set up and configured for paging things to disk, but when it causes a performance problem, this issue needs to be fixed.
A lot of modern systems have so much memory that disk-based swap isn't used at all. Some administrators feel that it's a waste of disk space. For me, whether I configure swap depends on the system's purpose and the amount of RAM it has. Swap considerations are really for another article, but I'll say that how you handle swap is up to you. I don't think the old rule of 1.5x RAM is a good formula anymore. Think about it. If your system has 128GB RAM, that would mean that you configure 192GB of RAM for swap space. Ridiculous. I might set up 16GB at most for that system if I configured swap at all.
In rare cases, your RAM can be bad, or go bad. I've had it happen. You should also be careful as to which type of RAM you purchase for a system if you're upgrading. Match what you have or replace it all if you can't match it. Don't mix speeds, caches, or brands. Also, use the recommended RAM type for your system. Using off-brands or mismatched RAM is a disaster waiting to happen.
Finally, errant programs can cause memory issues. Java-based programs have historically caused me the most grief. Some Java programmers don't program correctly for garbage cleanup or memory release, and problems ensue when loads are high or when certain calls are made. I always start by restarting the process. My next option is to check
top for the amount of memory consumed by the program. If all my checking and process restarting doesn't work, I reboot the system. If the problem begins again, I'll go to the programmer and complain and provide my reports.
Disks fail. That is a strong but true assertion. Even SSDs fail at some point, so prepare for disk failure. Remember that RAID is not the same as a backup, and that disks and partitions fill up, which makes them behave with less than optimal performance. If you suspect a disk is your performance killer, the first thing to look at is available space with a quick
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 397M 0 397M 0% /dev
tmpfs 411M 0 411M 0% /dev/shm
tmpfs 411M 11M 400M 3% /run
tmpfs 411M 0 411M 0% /sys/fs/cgroup
/dev/sda2 16G 1.8G 14G 12% /
/dev/sda1 495M 152M 344M 31% /boot
tmpfs 83M 0 83M 0% /run/user/1000
You can see above that there are no full or almost full filesystems on my server.
The next item to check is if your filesystems are full or almost full. If none are, then you have a failed disk. I can't simulate a disk failure, but some server systems let you know when they have failed disks. For example, some of my old servers showed an amber light rather than a green light when something was wrong. Pay attention to your hardware indicators. I also had servers that had a small LCD screen that notified me of failures and errors. These tools were helpful when the operating system didn't notify me that there was a problem.
A failed disk impacts performance, regardless of configuration. RAID configurations don't guarantee performance should a member disk fail. Instead, they guarantee safety because of redundancy. In other words, your data is intact, but your users and customers will be unhappy due to sluggish performance. Expect performance issues when a member disk fails.
If you have a sluggish system, check the physical server and all of its components, alerts, and messages. This step is for those who have access to physical servers. So many system administrators have to deal with remote or hosted systems and therefore do not have this kind of access.
Network problems due to hardware are somewhat rare, but they do happen. A jabbering NIC, a bad cable, or a failed switch or switch port can be the source of much frustration for a system administrator. And, if you add in switch port or network misconfiguration on the host itself, you now have a recipe for a lot of hair-pulling. Sometimes it's hard to find the source of a network problem because the issue can be local, at the switch, or somewhere beyond the switch. You have to look at each level separately to find the problem.
Check your other hosts for comparison. Is the problem localized to a single host, is it confined to a single group, or is it system-wide? This check will help you identify whether the problem is local, if it's confined to a single switch, if it affects an entire rack or row, or if the problem is more widespread.
Check your local network configurations. Check changelogs to see if something has recently changed. Next, do a physical check of your NIC. Do the lights look correct to you? Does the cable look good and does the plug appear undamaged? Does the wire configuration look correct? Check the entire length of the cable for physical damage, if possible. Check the physical switch and the cable terminator in the switch for physical defects.
Either check the switch configuration yourself or ask a network admin to do so. Physically check the switch location or refer to your documentation to find the correct port to report to the network admin. If the configuration looks good, have the network admin perform a quick reset on the port. Also, ask the admin about the last switch update and last reboot date.
Depending on your job and where you work, you might not have control or visibility beyond your switch. Work with network admins, ISPs, or hosting providers to further locate a network performance problem. Personal experience tells me that unless a network problem is widespread, network admins want proof of what you've checked that led you to blame the network. For this reason, I placed network troubleshooting last in the list. I can't count the number of times I've heard those frustrating words: "It's not the network, man. It must be infrastructure." And then a dial tone.
There are no shortcuts to gaining troubleshooting knowledge. You can learn and be prepared, but unfortunately, experience is the best teacher because you have to experience failures before you get a real feel for troubleshooting in the trenches. Even simulated failures don't give you the same experience as a real failure, with real users asking when things will be fixed, and real managers looking at you like it's your fault that the company is losing money, and irked that your keyboard isn't making any noise.
Troubleshooting problems isn't the fun part of being a sysadmin, but it is a necessary part. In fact, I'm not sure if there are any fun parts, and they're all necessary. Being a sysadmin is stressful, and troubleshooting problems is a large part of that stress. I've given you pointers in an attempt to lower that stress, but it's still up to you to gain experience and confidence in putting them to use.