Kubernetes troubleshooting: 6 ways to find and fix issues

2022년 9월 15일Satyajit Das5분 읽기

Kubernetes is the most popular container orchestration platform. It boasts a wide range of functionalities including scaling, self-healing, container orchestration, storage, secrets, and more. The main issues with such a rich solution are mostly due to its complexity. You must be aware of many of its key features to use it adequately. I will attempt to cover some of its less obvious aspects to improve your experience while using Kubernetes in production.

This article (and an accompanying YouTube video) goes over some common troubleshooting approaches using real-world scenarios. I'll assume you have basic knowledge of Kubernetes. If you are a beginner, first read What is Kubernetes? and then come back to get a quick overview of Kubernetes capabilities and troubleshooting.

[ Getting started with containers? Check out this no-cost course. Deploying containerized applications: A technical overview. ]

1. You get an OOMKilled error

Imagine you have incorporated Prometheus and Grafana monitoring tools in your Kubernetes cluster, and you create a rule that identifies when pods become consistently unavailable. It sends a notification through an automated phone call or chat message informing you when your pods are not available.

If you run kubectl get pods and see that some pods are being restarted, the next thing to do is to check why. You can do this by using:

kubectl describe pod myPodName -n myNamespace

You may see a message that looks something like this:

State:  Running
Started: Sun, 16 Feb 2020 10:20:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Sun, 16Feb 2019 09:27:39 +0000
Finished: Sun, 16 Feb 2019 10;20:08 +0000
Restart Count: 7

OOMKilled means that the pod reached its memory limit, so it restarts. You can see the restart count when you run the describe command. The obvious solution is to increase the memory setting. This can be done by running kubectl edit deployment myDeployment -n mynamespace and editing the memory limit.

This might have occurred because of a memory leak due to a bug in your application. It's important to look at your logs to see whether there is a valid reason for memory overload (for example, if the number of requests increased).

2. You see sudden jumps in load and scale

It's important to create a metric (and display it on a dashboard) to track the number of requests your application receives per second. This gives you a sense of what's normal for your target.

Sometimes there can be a sudden jump in load. You might be notified about this if you track your service's success rate. You could also detect this by comparing your current load against historical data. You want to be notified by a Prometheus rule when a problem like this occurs.

In this scenario, you have two choices: You can either increase the CPU, using the same steps as you used for memory allocation, or you can increase the number of instances for your pods. The recommended method is to increase the number of instances. For example, increase the number of replicas to five:

kubectl scale deployment myDeployment –replicas=5

If you are already tracking the number of requests per second, then you have a rough idea of how many instances you need to handle extra requests.

If you use an autoscaler, you can automate this process.

3. Roll back a faulty deployment

If a recent deployment causes issues, one of the fastest and easiest ways to remedy this is by using rollbacks. To see the deployment history:

kubectl rollout history deployment myDeployment

The output looks similar to this:

kubectl rollout history deployment myDeployment
 deployment.extensions/myDeployment
REVISION CHANGE-CAUSE
1               <none>
2               <none>
3               <none>
4               <none>

The most recent deployment is the one with the highest number (4, in this example). Perform a rollback to the previous deployment (3):

kubectl rollout undo deployment myDeployment –-to-revision=3

You will see new pods created.

It's wise to set your deployment history to save a specific number of versions. Do so by setting revisionHistoryLimit.

For example:

spec:
replicas:2
revisionHistoryLimit:20

This saves 20 recent deployment configurations.

[ Learn the basics in the Kubernetes cheat sheet. ]

4. Access a specific log

Assuming you have proper logging in place, you can look at your logs to identify the cause of trouble and when it occurred:

kubectl logs myPodName

However, it is possible that the pod's previous instantiation's logs are no longer the most recent ones. In this case, execute the following command to get the logs from a previous instance:

kubectl logs myPodName –previous

If you have multiple containers running inside the same pod, you must specify the container name to see its logs. If you're using a logging service, it usually takes a while to show the most recent log. In this case, it's often better to look at the logs by using the above commands than to rely on dashboards.

[ Want to test your sysadmin skills? Take a skills assessment today. ]

5. SSH into your pod

If none of the above tips worked, it might make sense to use Secure Shell (SSH) to get access inside the pod to perform some basic checks. For instance, you can determine whether you can see the files you expect in the filesystem and whether the log files are present. You can also check whether you're able to make a connection request to some other service directly from this pod. To SSH into a pod:

kubectl exec -it myPodName sh

This lets you access the pod through a shell window.

6. Troubleshoot CrashloopBackoff and ImagePullBackoff errors

You might have a monitoring tool like Grafana to monitor the number of instances in your service at any given point. Normally, you want to have a certain minimum number of instances running depending on the size of the load. If a minimum number isn't matched, then it triggers an alert. When the problem is CrashLoopBackOff (your pod is starting, crashing, starting again, and then crashing again), then your service doesn't return a 200 success code. If you're receiving errors, this can be an indication of a performance problem.

If a kubectl get pods command returns the following output, then you know you have a pod in a CrashLoopBackOff state:

kubectl get pods
NAME                   READY  STATUS            RESTARTS   AGE
myDeployment1-89234... 1/1    Running           1          17m
myDeployment1-46964... 0/1    CrashLoopBackOff  2          1m

There can be many reasons for this error. You may have to do kubectl describe pod to get to the root of this. Here's a summary of possible reasons and some tips:

Your Dockerfile doesn't have a command (CMD), so your pod immediately exits after starting. Kubernetes automatically restarts the pod when it's managed by a deployment or ReplicaSet.
You have used the same port for two containers inside the same pod. All containers inside the same pod have the same internet protocol (IP) address. They are not permitted to use the same ports. You need a separate port for every container within your pod.
Kubernetes can't pull the image you have specified and therefore keeps crashing. This is an example of ImagePullBackoff.
Run kubectl logs podName to get more information about what caused the error.
If you don't see anything useful in your logs, consider deploying your application with a sleep command for a few minutes. This might help you see some logs before the application crashes. It also might help you figure out whether your application has a code bug or mistake in its configuration. If it has a configuration problem, you may not see any code errors (because it fails before it reaches them).

Wrap up

Kubernetes is a great tool for automating many of the manual processes involved in deploying, managing, and scaling containerized applications. If anything goes wrong, try these six troubleshooting methods to find and fix the source of the problem.

This article is based on Kubernetes troubleshooting examples, originally published on the Tennexas blog, and is reused with permission.

저자 소개

Satyajit Das

Satyajit is a software engineer, based in Cambridge, UK. He mainly worka with Golang and a bit of React. His work also includes technologies like Kubernetes, Docker, gRPC, Elastic, and RabbitMQ, and he regularly posts videos on his YouTube channel about these topics.

He earned his bachelor's degree from IIT Kharagpur, his master's from Ecole Polytechnique, his master's thesis from Caltech, and his PhD from Cambridge University.

Read full bio

유사한 검색 결과

Blog post

채널별 검색

모든 채널 탐색