As car owners, it is only a matter of time before we find ourselves in an automotive repair shop attempting to explain what is wrong with the vehicle in question. At best, the mechanic can visibly see, hear or experience the problem first hand. Second best is to make some ridiculous vocal sound to simulate the issue in hopes that the mechanic will understand exactly the sound being mimicked. My personal success rate with these vocal imitations are, unfortunately, lower than I would like.
Typically the mechanic will need to begin by understanding some basics: the car’s make, model, engine, other notable features, followed by a clear description of what the problem is. Routinely a test drive or physical inspection is next. It might even be required to connect the car to a diagnostic computer to analyze the internal computer module’s logs or error codes. As frustrating, time consuming and expensive as this can all be, it is a required process to determine how to fix the issue.
Switching gears (pun intended) from cars to computer systems, I can fantasize about how great (and humorous) it would be for a vocal sound to give me that moment of enlightenment enabling me to respond with, "Oh, yeah! Here’s your problem!" Sadly, database, application and web servers do not make sound. Maybe a failing server cooling fan or dying hard disk, but those make up only a small fraction of support cases.
Thus, the focus of this post is how to provide quality input for a Red Hat Support Case to achieve the desired answer or root cause analysis as quickly as possible. Unfortunately, a description of "Application is slow" or "Server crashes" with no additional insight into the system typically leads to a series of time-consuming questions and requests for diagnostic data. Inaccurately described issues (and even worse, incorrect assumptions) can result in the wrong solution or remedy being provided. I often see premature assumptions cause much wasted time as tunings are applied which do not fix the problem. My favorite phrase is, "let’s collect good diagnostic output and let the data lead us."
Familiarity of the following tools in advance are highly recommended. It is not good to learn a new tool in the middle of an outage and could lead to missing an opportunity to collect crucial diagnostic data. This Knowledge Solution Proactive data collection and analysis tools for Red Hat Enterprise Linux serves as a great landing page for this topic, but I would like to comment on the importance a bit further below, as well as demonstrate the Red Hat Support Tool.
Kdump and vmcores
Kdump is a utility that, when the Linux kernel officially hangs, panics or "crashes", can dump the memory state at that moment into a vmcore file. It is very frequent that a final Root Cause Analysis cannot be provided without this file because logs simply do not provide enough information. The important thing to note is that kdump must be pre-configured as part of your standardized default configuration. Your support resources should be familiar with it so that when there are problems, they know how to use this tool to collect the data needed. If the machine "hangs" and it is power cycled or kdump was not preconfigured, it’s too late. You’ve missed the chance to collect this valuable data.
The aforementioned collection of proactive resources has a link to tell you everything you need to know about kdump. However, here are a few of my tips and myth busters on kdump:
- Vmcore files can be very large, but with the default compression enabled, they typically are only a few GB in size. They should not be equal to the memory size of the server unless you configure it to collect all memory, which is rarely ever needed. The options "-c -d 31" should keep them nice and small. However, they are often too large to attach to support cases, so this fine Knowledge Solution will tell you how to upload large files.
- Use a target dump server (usually NFS) as a standard location to dump vmcore file. Usually 200-500 GB of shared space is sufficient. This is especially helpful with virtual guests where there is a need to minimize the virtual guest disks. Just be mindful of how long it will take to write a vmcore file over the network and test this out in advance.
- Understand how to issue SysRq or NMI events to trigger a core dump if a system "hangs". If you reboot the server, you have lost the opportunity to capture data of the "hang". This guide will explain how to configure remote console tools such as HP iLO, Dell DRAC, IBM IMM, and VMware virtual consoles.
I LOVE this tool. I actually use it daily to interact with support cases. It is a CLI that allows you to interact with the Red Hat Customer Portal and Cases directly from a Red Hat Enterprise Linux system. No web browser required. You can use this tool to:
- Query for knowledge solutions: Diagnose a File, Error Code, or a String
- Open, edit and attach files to a support case
- Global or per user configurations (such as to your automounted /home/user)
- Prompts to save Red Hat Customer Portal login and/or password for convenience
- And more, such as diagnostic analysis...
Below is an example of how to install and initially use it, although the official Red Hat Support Tool article will give you much greater detail. I will manually set the User, Password and Proxy so that it does not prompt me for this information. I am saving credentials for my own convenience but you will likely want to choose differently to meet your own security requirements.
[root@rhel72 ~]# yum install -y redhat-support-tool [root@rhel72 ~]# su - tbowling [tbowling@rhel72 ~]$ redhat-support-tool config user mylogin-tbowling [tbowling@rhel72 ~]$ redhat-support-tool config password mysekret [tbowling@rhel72 ~]$ redhat-support-tool config proxy_url http://proxy.example.com [tbowling@rhel72 ~]$ redhat-support-tool search sosreport Title: Sosreport hangs ID: 83473 State: Verified: This solution has been verified to work by Red Hat Customers and Support Engineers for the specified product version(s). URL: https://access.redhat.com/solutions/83473 ———————————————————————————————————————- Title: [Troubleshooting] Sosreport debug ID: 41884 State: Unverified: This solution has not yet been verified to work by Red Hat customers. URL: https://access.redhat.com/solutions/41884 ———————————————————————————————————————- Title: What kind of information does sosreport get? ID: 1149933 State: Verified: This solution has been verified to work by Red Hat Customers and Support Engineers for the specified product version(s). URL: https://access.redhat.com/solutions/1149933 ———————————————————————————————————————- Title: What is a sosreport and how to create one in Red Hat Enterprise Linux 4.6 and later? ID: 3592 State: Verified: This solution has been verified to work by Red Hat Customers and Support Engineers for the specified product version(s). URL: https://access.redhat.com/solutions/3592 ———————————————————————————————————————-
Next, let’s demonstrate using it to open a case using the interactive prompt and attaching a sosreport. Note that you can provide answers using short/long options or it can prompt you. I am providing some options to save space.
[tbowling@rhel72 ~]$ redhat-support-tool Welcome to the Red Hat Support Tool. Command (? for help): opencase —severity=4 —product="Red Hat Enterprise Linux" —version=7.2 Please enter a summary (or 'q' to exit): Testing redhat-support-tool Please enter a description (Ctrl-D on an empty line when complete): This is just a test of the awesome redhat-support-tool. Would you like to assign a case group to this case (y/N)? n Would you like to see if there's a solution to this problem before opening a support case? (y/N) n ———————————————————————————————————————- Support case 01707773 has successfully been opened. Please attach a SoS report to support case 01707773. Create a SoS report as the root user and execute the following command to attach the SoS report directly to the case: redhat-support-tool addattachment -c 01707773 <path to sosreport> Would you like to attach a file to 01707773 at this time? (y/N) y Please provide the full path to the file (or 'q' to exit): sosreport-rhel72beta-20160920164443.tar.xz Please provide a description or enter to accept default (or 'q' to exit): sosreport Uploading sosreport-rhel72beta-20160920164443.tar.xz to the case ...completed successfully.
Unfortunately, problems will arise. Things will break. Applications will fail. When they do, quickly providing clear descriptions of the problem and environment can greatly save time in getting the answers desired. Here are some tips that can reduce the time it takes to for a support engineer to reach quality conclusions:
A good clear description of the environment and the issue can save a tremendous amount of time and confusion when troubleshooting. Descriptions such as "System Hung" or "Application Slow" are very vague and not helpful. More information is needed to guide support engineers on where and what to look for.
What are the systems (IP Addresses/Hostnames) and applications involved?
What is the nature of the application (batch job processing, Java app, database, etc)?
Were any changes made to the environment recently that could have changed the behavior?
Can the issue be reproduced? If so, clearly defined reproducible steps can be extremely helpful to understand the specific issue.
If hung or crashed, did kdump collect a vmcore file?
Is the system part of a cluster or other complex management system? If so, those logs and vendor must also be consulted. For example, a rebooted cluster node always needs to have the cluster logs reviewed.
Document all discussions and explanations in the case so that other support resources can quickly understand and not miss critical information when brought in to help.
Is there anything else that could help a support engineer understand the environment and give clues as to where and what to look for?
SOSReports provide support engineers an excellent understanding of the system, the current state and configuration, and logs to aid in troubleshooting. Chances are that a sosreport will be requested so providing it early can save time.
A good patching policy to keep your environment up to date can prevent many unplanned outages due to known bugs, as well as simplify urgent security patching. Not every bug or security fix is eligible for backporting to all versions, so an updated environment makes it much easier to stay stable and current on available security errata updates. And yes, proactive patching policies prove to be much easier and cost effective than the urgent scrambling to patch a large environment.
There is no perfect solution when it comes to troubleshooting. Troubleshooting can often feel like an artform. Sometimes very strange things happen that are difficult to understand or reproduce. Quality communication and diagnostic data are currently our best ways to diagnose an issue. AND if you can vocally generate the sound of dropped network packets, data corruption or a script failing, I WANT TO HEAR IT!
Terry Bowling is a TAM in the NA Central region. He has been migrating workloads from UNIX to Linux since 1998 and has supported business environments for major Telecom and Pharmaceutical companies. Most recently he has been focused on enabling our customers to migrate to container platforms using Red Hat Atomic/OpenShift, as well as adopting SAP HANA.
Twitter: https://twitter.com/terry_bowling #RHTAM
A Red Hat Technical Account Manager (TAM) is a specialized product expert who works collaboratively with IT organizations to strategically plan for successful deployments and help realize optimal performance and growth. The TAM is part of Red Hat’s world class Customer Experience and Engagement organization and provides proactive advice and guidance to help you identify and address potential problems before they occur. Should a problem arise, your TAM will own the issue and engage the best resources to resolve it as quickly as possible with minimal disruption to your business.