4 problem solving strategies for sysadmins
As a sysadmin, part of my responsibility in many of my jobs has been to assist with hiring new employees. I participated in plenty of technical interviews with people who had passed many certification tests and who had fine resumes. I also participated in many interviews in which we were looking for Linux skills, but very few of those applicants had certifications. This was at a time when Microsoft certifications were the big thing but still during the early days of Linux in the data center. So very few of the Linux applicants were yet certified.
We usually started these interviews with questions designed to determine the limits of the applicant's knowledge. Then we would get into the more interesting questions, ones that would test their ability to reason through a problem to find a solution. I made some very interesting observations. Few of the Windows certificate owners could reason their way through the scenarios we presented while a very large percentage of the Linux applicants were able to do so. Their standard approach to solving problems was to reboot the computer without doing any real problem analysis. Then their normal methodology was to use a specific set of scripted actions in a sequence designed to, hopefully, resolve the problem based on a set of probabilities that specific symptoms would be fixed by specific actions. There never was any attempt to understand the reasoning behind why specific actions should be taken or to locate the root cause of the problem.
I call this the "symptom—fix" method. It is basically a MAP—a series of choices—that can be followed with little knowledge of how the underlying systems work.
I think that result was due in part to the fact that obtaining the Windows certificates relied upon memorization rather than actual hands-on experience, and the fact that Windows is a closed system that prevents sysadmins from truly understanding how it works. I think that the Linux applicants did so much better because Linux is open on multiple levels, and logic and reason can be used to identify and resolve any problem. Any sysadmin who has been using Linux for some time has had to learn about the architecture of Linux and has had a decent amount of experience with the application of knowledge, logic, and reason to the solution of problems.
The vast majority of the applicants who were able to reason through the trouble scenarios we set for them tended to have significant experience with Unix and Linux. In my opinion, this is because Unix and Linux users and sysadmins think about solving problems differently from those who use more restrictive operating systems. Using and administering Unix and Linux systems require a higher level of reasoning skills. The unconstrained natures of Linux also invite us to learn and improve those skills. Armed with a deep knowledge of a powerful operating system, a thorough understanding of the available tools, and well-developed critical thinking skills, Linux sysadmins are capable of resolving problems quickly and with great freedom in their choice and use of tools.
Please do not misunderstand me. There are many very smart sysadmins who work with Windows and other closed and proprietary operating systems. All of these very smart sysadmins also use critical thinking and reasoning to solve problems. The real issue is the closed nature of the systems on which they work and that it severely restricts the possibilities that are available to them.
IBM has used MAPs—Maintenance Analysis Procedures—for many years to assist its employees and customers in problem-solving for both computer hardware and software. I frequently used the MAPs while I was a Customer Engineer (CE) at IBM in the late 1970s. These were well-designed MAPs that, when used by properly trained CEs, were able to identify the problem and suggest a few possible fixes in order by highest probability.
If you want to see a modern version of IBM's MAPs, check out the IBM Support Knowledge Center.
If the MAPs did not work for a particular problem, we had to use our knowledge and reasoning skills to dig down even deeper. We could at least be sure that the MAPs had gotten us into the right area.
Many help desks use similar procedures—many of them with less than stellar results. How many times have you been told to reboot your computer when it did not fix the problem? Sometimes rebooting a device brings it back to a working state, but it never fixes the root cause of the problem. Rebooting is a common approach to fixing broken computers and other devices when restrictive systems are involved. Many times it is the only way that works because the restrictive and closed systems cannot be truly known in the same way as open systems, particularly operating systems like Linux.
In my opinion, the real problem is that proprietary software and hardware are unknowable. There is no way to use logic and reasoning to resolve problems on complex systems that are unknowable. Things that are unknowable must use MAPs because there is no other way.
One of the causes of this is that many devices cannot be repaired at all. For example, my former ISP changed my router out at least four times because they started failing in one way or another. Besides, it is faster and far less expensive to swap out a cheap bit of hardware than it is to open it up and figure out what is wrong with it—and most of the time, these are single-board devices, so there is really nothing to replace except that board.
One can not use reason for unknowable things. Only knowable systems can lend themselves to logic and reason.
Critical thinking is a key component of what makes Linux and Unix sysadmins so good at what we do. It gives us the ability to look at the symptoms of the problem, to determine what is important and what is not, to connect those symptoms to previous experiences or knowledge we have, and to use that to determine one or more possible root causes of the problem.
One common ability that all sysadmins use to solve problems is reasoning. After our critical thinking has enabled us to look at the symptoms of a problem, we can use different forms of reasoning to determine some possible root causes of the presenting symptoms in order to determine the next steps.
There are four widely recognized forms of reasoning, and we sysadmins use all of them to help us resolve problems. We use inductive, deductive, abductive, and integrated reasoning to lead us to a conclusion that points to one or more possible causes for the observed symptoms. Let's briefly look at these forms of reasoning and see how they apply to problem-solving.
This is the most common form of reason, of which most of us are aware. It is used to draw conclusions about specific instances from large numbers of more general observations that result in a general rule. For example, the following syllogism illustrates deductive reasoning—and its primary flaw.
General rule: Elevated temperatures in a computer are caused by the failure of a mechanical device, a fan.
Observational instance: My computer is overheating.
Conclusion: The fan in my computer has failed.
Many times, this line of deductive reasoning can be successful at resolving problems with overheating. However, the conclusion is entirely dependent upon the accuracy of both the rule and the current observation. Consider other possibilities. The ambient temperature in the computer room may be extraordinarily high, resulting in higher temperatures inside the computer. Or the heat radiator fins on the CPU may be clogged with dust, which reduces the airflow, thus reducing the efficacy of the cooling system. I can think of other possible causes as well.
There is a huge fallacy in this syllogism as in all deductive reason. The rule and the assertion must always be correct for the conclusion to be true. This fallacy does not make it wrong to use this type of reasoning, but it does inform us that we do need to be careful.
Inductive reason flows in the other direction. The conclusions are arrived at to create a general rule from a few observations, sometimes only a single one. This sample of inductive reason also shows the potential for built-in fallacies.
Observations: The failure of a fan caused my computer to overheat.
Conclusion: Computers always overheat because of fan failures.
Actually, there are more equally bad conclusions that could be drawn from this. One is that all fan failures cause computers to overheat, which is also not true. Another is that all computer fans fail. Yet another is that all computers will overheat due to fan failures. Here again, we must be careful of the conclusions we reach. In this type of inductive reasoning, we are very likely to synthesize a general rule that can lead us astray when we apply the rule as an assertion in a deductive syllogism.
Both deductive and inductive reason contain the seeds of their own failure due to the incorrect assumption that all of the evidence is available and that all of the assertions are true. Both of those types of reasoning are rigid and inflexible. Neither deductive nor inductive reasoning allows for possibility, probability, incomplete data, incorrect assertions, randomness, intuition, or creativity.
Let's explore this for a moment. First, I stipulate that, in this thought experiment, we have no experience or training of any kind to help us determine the cause of the problem.
My computer is overheating. I can feel the top of the case, and it is much hotter than it ever has been in my past experience. I turn the computer off, and after opening the case, I turn it back on for a moment. I can now see that a large case fan is not rotating. Because I have no basis on which to reason that the failing case fan is the problem, I just take a chance and replace it with a new working one. This fixes the problem, and the computer no longer overheats.
I use a bit of inductive reasoning as follows.
Observation: I fixed an overheating computer by replacing the case fan.
Rule: Replacing the case fan will fix computer overheating problems.
I have taken a single instance and generalized it into a rule. Now let's look at another problem. In this case, a different computer is overheating. Here is my deductive logic.
Rule: Replacing the case fan fixes computer overheating problems.
Assertion: The computer is overheating.
Conclusion: I should replace the case fan.
In this bit of deductive logic, I have taken the rule I created from my single experience with overheating and applied it to the second instance of a computer overheating. I have taken a general rule and applied it to a specific instance. The logic is correct. There is no fault in the logic, but replacing the case fan does not solve the problem. Why? Because in this second instance, the power supply is overheating because the air intake is clogged with dust from the environment.
The difficulty here lies first in the fact that the rule we generated by our first overheating experience was faulty because it was too general. The second problem is that, based on this single faulty rule and the rigidity imposed by this form of reasoning, if forced me to a conclusion that could be the only possible cause of the problem and so I stopped looking for other root causes. The logic caused me to not even bother to check to see if the fan was working or not.
Another problem with this set of rigid logic is that there is no flexibility for other possibilities. Our ruleset was too limited to solve the problem. This raises the questions of whether we can ever have a rule set large enough to solve all possible problems or that any single rule can be complex enough to resolve even a single symptom all the time.
You see where I am going with this?
Abductive reasoning is a third recognized form of reasoning, and it is more complex while being more flexible. It allows for incomplete information and probabilities that specific relationships are present. It also allows that sometimes the best way to proceed is with an educated guess based on the available information.
Abductive reasoning takes the full body of whatever data is available—our observations—and allows us to draw conclusions that point to one or more of the most likely root causes of the observed symptoms. Abductive reasoning works regardless of whether we have all of the information or not. It allows us to draw conclusions based on the best information we have on hand. It allows flexibility because any rules we have put in place from previous inductive reasoning and any conclusions that we draw from those rules using deductive reasoning are not rigidly enforced.
With abductive reasoning, we need not accept the conclusion as the only possible result as inductive and deductive reason do. We are then free to adjust our body of rules, to restart our reasoning process with new data, that is, that the previous line of reason was incorrect—in this case. Thus, the freedom we now have to reason is the foundation for integrated reason.
I believe that sysadmins use all three of those previously discussed forms of reasoning to resolve problems. In fact, we do it so seamlessly that it is difficult to identify the specific portions of our thought processes that represent one of the three recognized forms of reasoning. In fact, this type of combinatorial reasoning is what successful sysadmins use rather than a single style. This is called Integrated reason.
For example, I already have rules in place about overheating that I use to deduce possible causes. That example illustrates flexibility and the use of limited information to analyze the problem and use additional testing to obtain more data. It also allows for the inductive process that can add more rules to the ruleset we use in our deductive process. It is also possible to disregard and discard rules that are clearly incorrect, outdated, or no longer needed.
Integrated reasoning feels seamless to me, and perhaps it seems that way to you as well. I barely know that I am doing it, and there is little or no indication when I switch from deductive to abductive reasoning, for example, as I progress in the process of problem-solving. Integrated reasoning, intentional or not, conscious or not, helps me to avoid the pitfalls of "should." Not always but certainly most of the time. By understanding my own reasoning process, I can more easily recognize when I do get stuck in the "should" trap and more easily find my way out of it. For our overheating computer, this might mean a reasoning process more like this.
The computer is overheating, and I know from previous experience that there are at least two possible causes. I check over the computer and discover that none of the fans are failing and that the power supply is not overheating. Since neither of the two possible causes that I already know about is not the source of the current problem, I do some further checking using both the hddtemp command and the touchy-feely method, both of which show the fact that one hard drive is very hot.
I could replace the hard drive, but I noticed that there is no airflow around that hard drive. Further exploration reveals that there is a place to install a fan that would create a cooling flow of air over that hard drive. I install a new fan. I then check the hard drive, and its temperature is now much cooler.
In the case of this actual problem, I did not just blindly replace the overheating component and then assume that the problem was resolved. The hard drive itself was not the cause of the problem despite the observable fact that it was very hot. The lack of a fan to provide cooling airflow was a culprit, and there were other contributing factors. Even though the fan provided enough airflow to cool the drive down to near-normal levels, I was curious about why it was so hot, so I checked its usage patterns using System Activity Reporter—SAR. The SAR logs showed that the drive was in constant heavy use.
Additional investigation using htop and glances showed that the /home filesystem was being heavily accessed by a program called Baloo. I did some research on Baloo, which turns out to be a file indexer that is a part of the KDE desktop environment—a program that clearly uses a lot of system resources with no discernible positive result.
My filesystems were spread out over two physical hard drives, but the two most used, /home and /var were on the same drive. My first step, in order to reduce the stress on that hard drive, was to install a new hard drive as a means to spread the load and moved the most heavily used /home filesystem to that new drive.
I also figured out how to turn off Baloo, and that reduced the disk activity in /home to nearly zero except for my own productive work.
In reality, there were multiple causes for this single symptom of overheating, and all of the fixes I implemented were appropriate. The root cause was a rogue program that produced heavy activity in a single filesystem. This caused a high level of disk activity, which overheated the disk drive. The lack of airflow over the drive due to the absence of a cooling fan only exacerbated the problem.
Yes, this was a real incident and not especially uncommon. Following the rigid logic forms could never bring us to the place where we would truly solve that problem and reduce the chances of it occurring again. Abductive reasoning allows us to be logical as well as creative and to think outside the alleged box. It also allows us to take preventative measures to ensure that the same or related problems do not recur.
Abductive reasoning allows us to learn from our experiences. This is true not just when things go right, and we solve the problem but also, and especially, when things go wrong, and we do not get it right.
In this article, we have looked at forms of reasoning about problems that apply to many non-technical things as well as to computer hardware and software. We have seen in this article that problem-solving approaches like MAPs and other symptom-fix methods have significant limitations. Specific reasoning methodologies can be used within the framework of an algorithm for problem-solving. Open source software lends itself better to reasoning due to its openness and knowability than does closed, proprietary software.
Of course, these defined styles of reasoning are artificial structures that are intended to enable philosophers, psychologists, psychiatrists, and cognitive scientists to have a vocabulary and common structural referents to enable discussion and exploration of how we think. These purely artificial structures should not be construed as limits on how sysdmins should work. They are merely tools to enable us to understand ourselves and how we think.
In the next article in this series, we will explore a problem-solving algorithm based on the scientific method.
Skills You Need website, Critical Thinking Skills
Butte College, Deductive, Inductive, and Abductive Reasoning
Harris, William, How the Scientific Method Works
Both, David, The Linux Philosophy for SysAdmins, Chapter 23
[ Want to test your sysadmin skills? Take a skills assessment today. ]