Sysadmin careers: The 5 steps of problem solving
In the previous article, 4 problem-solving strategies for sysadmins, we looked at methods of reasoning about problems that relate to computer hardware and software. We saw that problem-solving approaches like MAPs and other symptom-fix methods have significant limitations. It is also clear that proprietary, closed software systems do not lend themselves to reasoned approaches, while open systems like Linux and open source software, in general, are intimately knowable and thus tractable to reason and logic.
One of the best things that my mentors helped me with was the formulation of a defined reasoning process that I could always use for solving problems of nearly any type. That process, the algorithm, is very closely related to the scientific method and is what we will cover in this article.
During the research for my book, The Linux Philosophy for SysAdmins, I discovered a short article titled, How the Scientific Method Works, that describes the scientific method using a diagram very much like the one I have created for my "five steps of problem-solving."
Solving problems of any kind is art, science, and—some would say—perhaps a bit of magic, too. Solving technical problems, such as those that occur with computers, requires a good deal of specialized knowledge as well. Any approach to solving problems of any nature —including problems with Linux—must include more than just a list of symptoms and the steps necessary to fix or circumvent the problems that caused the symptoms. This so-called "symptom-fix" approach looks good on paper to many managers, but it really sucks in practice. The best way to approach problem-solving is with a large base of knowledge of the subject and a strong methodology.
The five steps of problem-solving
There are five basic steps that are involved in the problem-solving process, as shown in Figure 1. This algorithm is very similar to that of the scientific method but is specifically intended for solving technical problems.
You probably already follow these steps when you troubleshoot a problem but do not even realize it. These steps are universal and apply to solving most any type of problem, not just problems with computers or Linux. I used these steps for years with various types of problems without realizing it. Having them codified for me made me much more effective at solving problems because, when I became stuck, I could review the steps I had taken, verify where I was in the process, and restart at any appropriate step.
You may have heard a couple of other terms applied to problem-solving in the past. The first three steps of this process are also known as problem determination, that is, finding the root cause of the problem. The last two steps are problem resolution, which is actually fixing the problem.
The next sections cover each of these five steps in more detail.
Knowledge
Knowledge of the subject in which you are attempting to solve a problem is the first step. All of the articles I have seen about the scientific method seem to assume this as a prerequisite. However, the acquisition of knowledge is an ongoing process, driven by curiosity and augmented by the knowledge gained from using the scientific method to explore and learn more through experimentation. You must be knowledgeable about Linux at the very least, and furthermore, you must be knowledgeable about the other factors that can interact with and affect Linux. Hardware, the network, and even environmental factors like temperature, humidity, and the electrical environment in which the Linux system operates can affect it.
Knowledge can be gained by reading books and web sites about Linux and those other topics. You can attend classes, seminars, and conferences. You can also set up a number of physical or virtual Linux computers in a networked environment. And, of course, there is much to learn through interaction with other knowledgeable people. You learn when you resolve a problem or discover a new cause for a particular type of problem, even when an attempt to fix a problem results in a temporary failure.
Classes are also valuable in providing us with new information. My personal preference is to play—uh, experiment—with Linux or a particular piece such as networking, name services, DHCP, Chrony, and more. Then I take a class or two to help me internalize the knowledge I have gained.
Remember, "without knowledge, resistance is futile," to paraphrase the Borg. Knowledge is power.
Observation
The second step in solving the problem is to observe its symptoms. It is important to take note of all of the problem symptoms, but also to observe what is working properly. This is not the time to try to fix the problem; merely observe. Another important part of observation is to ask yourself questions about what you see and what you do not see. Aside from the questions you need to ask that are specific to the problem, there are some general questions to ask:
- Is this problem caused by hardware, Linux, application software, or perhaps by lack of user knowledge or training?
- Is this problem similar to others I have seen?
- Is there an error message?
- Are there any log entries pertaining to the problem?
- What was taking place on the computer just before the error occurred?
- What did I expect to happen if the error had not occurred?
- Has anything about the system hardware or software changed recently?
Other questions will reveal themselves as you work to answer these. The important thing to remember here is not these specific questions, but rather to gather as much information as possible. This increases the knowledge you have about this specific problem instance and aids in finding the solution.
As you gather data, never assume that the information obtained from someone else is correct. Observe everything yourself. This can be a major problem if you are working with someone who is at a remote location. Careful questioning is essential, and tools that allow remote access to the system in question are extremely helpful when attempting to confirm the information that you are given.
Tip: When questioning a person at a remote site, never ask leading questions; they will try to be helpful by answering with what they think you want to hear.
At other times the answers you receive will depend upon how much or how little knowledge the person has of Linux and computers in general. When a person knows—or thinks they know—about computers, the answers you receive may contain assumptions that can be difficult to disprove. Rather than ask. "Did you check…," it is better to have the other person actually perform the task required to check the item. And rather than telling the person what they should see, simply have the user explain or describe to you what they do see. Again, remote access to the machine can allow you to confirm the information you are given.
The best problem solvers are those who never take anything for granted. They never assume that the information they have is 100% accurate or complete. When the information you have seems to contradict itself or the symptoms, start over from the beginning as if you have no information at all.
In almost all of the jobs I have had in the computer business, we have always tried to help each other out, and this was true when I was at IBM. I have always been very good at fixing things, and there were times when I would show up to support another CE who was having a particularly difficult time finding the source of a problem. The first thing I would do is assess the situation. I would ask the primary CE what they had done so far to locate the problem. After that, I would start over from the beginning. I always wanted to see the results myself. Many times that paid off because I would observe something that others had missed. In one very strange incident, I fixed a large computer by sitting on it.
This took place while I was an IBM CE in Lima, Ohio, in about 1976. Two of us were installing an IBM System/3, which was smaller than an IBM mainframe, like a 360 or 370, but still large enough to need a room of its own, high voltage power, and significant air cooling.
We had assembled the main CPU and had started to attach the IBM 1403 line printer controller when we ran into the problem. The printer controller was contained in a slightly lower than desktop-height unit to the left of the CPU. That nice large work surface is just the right height to sit on.
We had just bolted the printer controller to the frame of the CPU and were doing one of the very many checks built into the installation instructions. We connected the leads of an Ohm meter between the frame of the CPU and a specific terminal on the power supply of the printer controller. The result was supposed to be an open circuit, that is, infinite resistance, which would indicate that the hot leads of the power supply were not shorted to the frame. In this case, there was a short—zero resistance—which was bad.
There would not have been a spectacular display of noise and fireworks like you see on TV, but it would have been a problem as it would prevent the computer from powering up. Best to catch this while it was still being assembled rather than later. After an hour of trying to find the problem, we were unable to do so. We called the support center for the System/3 in Boca Raton, Florida, and were guided through several further problem determination steps that were unsuccessful.
A bit frustrated, I sat on the printer control unit. Out of the corner of my eye, I saw the needle on the Ohm meter swing to indicate an open circuit. I mentioned this to the other CE and to Vern in Boca Raton, who would later be one of my own mentors when I went down there for a few years as a Course Development Representative (CSR).
We removed the top, where I had perched, from the controller, and with a bit of luck, found that one of the bolts holding the top to the frame of the printer controller had come loose and fallen into the power supply and caused the short. When I sat on the top of the controller, the frame moved just enough to cause the bolt to no longer make the contact required to produce the short. Removing that loose bolt from the power supply fixed the problem.
Vern, who was responsible for the System/3 support at that time, made some changes to the instructions to cover this problem in case it happened again. He also worked with the manufacturing people to ensure that it did not happen again, putting in place a check to ensure that the bolt was properly tightened during the build process.
The thing to remember is to really observe what is going on in all parts of the system. Pay attention to everything, and don't ignore the slightest clue. Sometimes watching top, htop, glances, or one of the other utilities used to monitor the internal functioning of the kernel or the network can provide a momentary glimpse of something—a clue—that gets us started in the right direction.
And sometimes it takes just a bit of luck, like sitting on the printer control unit.
Reasoning
Use reasoning skills to take the information from your observations of the symptoms, your knowledge to determine a probable cause for the problem. We discussed the different types of reasoning in some detail in my previous article Sysadmin careers: 4 problem-solving strategies. The process of reasoning through your observations of the problem, your knowledge, and your past experience is where art and science combine to produce inspiration, intuition, or some other mystical mental insight into the root cause of the problem.
In some cases, this is a fairly easy process. You can see an error code and look up its meaning from the sources available to you. Or perhaps you observe a symptom that is familiar, and you know what steps might resolve it. You can then apply the vast knowledge you have gained by reading about Linux and the documentation provided with Linux to reason your way to the cause of the problem.
In other cases, it can be a very difficult and lengthy part of the problem determination process. These are the types of cases that can be the most difficult—symptoms you have never seen or a problem that is not resolved by any of the methods you have used. It is these difficult ones that require more work and especially more reasoning applied to them.
It helps to remember that the symptom is not the problem. The problem causes the symptom. You want to discover the true problem, not just the symptom.
Action
Now is the time to perform the appropriate repair action. This is usually the simple part. The hard part is what came before—figuring out what to do. After you know the cause of the problem, it is easy to determine the correct repair action to take. The specific action you take will depend upon the cause(s) of the problem.
Remember, we are fixing the root cause, not just trying to get rid of or cover up the symptom.
Make only one change at a time. If there are several actions that can be taken that might correct the cause of a problem, only make the one change or take the one action that is most likely to resolve the root cause. The selection of the corrective action with the highest probability of fixing the problem is what you are trying to do here. Whether it is your own experience telling you which action to take, or the experiences of others, move down the list from highest to lowest probability, one action at a time. Test the results after each action.
Test
After taking some overt repair action, the repair should be tested. This usually means performing the task that failed in the first place, but it could also be a single, simple command that illustrates the problem.
We make a single change, taking one potential corrective action, and then testing the results of that action. This is the only way we can be certain which corrective action fixed the problem. If we were to take several corrective actions and then test one time, there is no way to know which action was responsible for fixing the problem. This is especially important if we want to walk back those ineffective changes we made after finding the solution.
If the repair action has not been successful, you should begin the procedure over again. If there are additional corrective actions you can take, return to that step and continue doing so until you have run out of possibilities or have learned with to a certainty that you are on the wrong track.
Be sure to check the original observed symptoms when testing. It is possible that they have changed due to the action you have taken, and you need to be aware of this in order to make informed decisions during the next iteration of the process. Even if the problem has not been resolved, the altered symptom could be very valuable in determining how to proceed.
Iteration
As you work through a problem, it will be necessary to iterate through at least some of the steps. If, for example, performing a given corrective action does not resolve the problem, you may need to try another action that has also been known to resolve the problem in the past. Figure 1 shows that you may need to iterate to any previous step in order to continue. It may be necessary to go back to the observation step and gather more information about the problem. I have also found that sometimes it was a good idea to go back to the knowledge step and gather more basic knowledge. This includes reading or rereading manuals and man pages, using search engines, whatever is necessary to gain the knowledge required to continue past the point where I was blocked.
Be flexible, and don't hesitate to step back and start over if nothing else produces some forward progress.
Concluding thoughts
In this article, we have looked at one way to approach fixing problems that applies to many non-technical things as well as to computer hardware and software. What we have discussed here is one algorithm for problem-solving that can be used with the reasoning methodologies we explored in the first article. The flexibility of this particular combination is extremely powerful.
I am not telling you that you "should" use this method. However, if you go all Zen and analyze your own method for solving problems, you will very likely find that it is already very close to the algorithm I describe here. I suggest that you do take the time to analyze your own methods. I think you will find it a productive use of time that will be quite enlightening.
Resources
-
Skills You Need website, Critical Thinking Skills
-
Wikipedia, Reason
-
Butte College, Deductive, Inductive, and Abductive Reasoning
-
Harris, William, How the Scientific Method Works
-
Both, David, The Linux Philosophy for SysAdmins, Ch23.
[ Want to test your sysadmin skills? Take a skills assessment today. ]
David Both
David Both is an open source software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh, NC. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for over 50 years. More about me