Kubernetes is the most popular container orchestration platform. It boasts a wide range of functionalities including scaling, self-healing, container orchestration, storage, secrets, and more. The main issues with such a rich solution are mostly due to its complexity. You must be aware of many of its key features to use it adequately. I will attempt to cover some of its less obvious aspects to improve your experience while using Kubernetes in production.

This article (and an accompanying YouTube video) goes over some common troubleshooting approaches using real-world scenarios. I'll assume you have basic knowledge of Kubernetes. If you are a beginner, first read What is Kubernetes? and then come back to get a quick overview of Kubernetes capabilities and troubleshooting.

[ Getting started with containers? Check out this no-cost course. Deploying containerized applications: A technical overview. ]

1. You get an OOMKilled error

Imagine you have incorporated Prometheus and Grafana monitoring tools in your Kubernetes cluster, and you create a rule that identifies when pods become consistently unavailable. It sends a notification through an automated phone call or chat message informing you when your pods are not available.

If you run kubectl get pods and see that some pods are being restarted, the next thing to do is to check why. You can do this by using:

kubectl describe pod myPodName -n myNamespace

You may see a message that looks something like this:

State:  Running
Started: Sun, 16 Feb 2020 10:20:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Sun, 16Feb 2019 09:27:39 +0000
Finished: Sun, 16 Feb 2019 10;20:08 +0000
Restart Count: 7

OOMKilled means that the pod reached its memory limit, so it restarts. You can see the restart count when you run the describe command. The obvious solution is to increase the memory setting. This can be done by running kubectl edit deployment myDeployment -n mynamespace and editing the memory limit.

This might have occurred because of a memory leak due to a bug in your application. It's important to look at your logs to see whether there is a valid reason for memory overload (for example, if the number of requests increased).

2. You see sudden jumps in load and scale

It's important to create a metric (and display it on a dashboard) to track the number of requests your application receives per second. This gives you a sense of what's normal for your target.

Sometimes there can be a sudden jump in load. You might be notified about this if you track your service's success rate. You could also detect this by comparing your current load against historical data. You want to be notified by a Prometheus rule when a problem like this occurs.

In this scenario, you have two choices: You can either increase the CPU, using the same steps as you used for memory allocation, or you can increase the number of instances for your pods. The recommended method is to increase the number of instances. For example, increase the number of replicas to five:

kubectl scale deployment myDeployment –replicas=5

If you are already tracking the number of requests per second, then you have a rough idea of how many instances you need to handle extra requests.

If you use an autoscaler, you can automate this process.

3. Roll back a faulty deployment

If a recent deployment causes issues, one of the fastest and easiest ways to remedy this is by using rollbacks. To see the deployment history:

kubectl rollout history deployment myDeployment

The output looks similar to this:

kubectl rollout history deployment myDeployment
 deployment.extensions/myDeployment
REVISION CHANGE-CAUSE
1               <none>
2               <none>
3               <none>
4               <none>

The most recent deployment is the one with the highest number (4, in this example). Perform a rollback to the previous deployment (3):

kubectl rollout undo deployment myDeployment –-to-revision=3

You will see new pods created.

It's wise to set your deployment history to save a specific number of versions. Do so by setting revisionHistoryLimit.

For example:

spec:
replicas:2
revisionHistoryLimit:20

This saves 20 recent deployment configurations.

[ Learn the basics in the Kubernetes cheat sheet. ]

4. Access a specific log

Assuming you have proper logging in place, you can look at your logs to identify the cause of trouble and when it occurred:

kubectl logs myPodName 

However, it is possible that the pod's previous instantiation's logs are no longer the most recent ones. In this case, execute the following command to get the logs from a previous instance:

kubectl logs myPodName –previous

If you have multiple containers running inside the same pod, you must specify the container name to see its logs. If you're using a logging service, it usually takes a while to show the most recent log. In this case, it's often better to look at the logs by using the above commands than to rely on dashboards.

[ Want to test your sysadmin skills? Take a skills assessment today. ]

5. SSH into your pod

If none of the above tips worked, it might make sense to use Secure Shell (SSH) to get access inside the pod to perform some basic checks. For instance, you can determine whether you can see the files you expect in the filesystem and whether the log files are present. You can also check whether you're able to make a connection request to some other service directly from this pod. To SSH into a pod:

kubectl exec -it myPodName sh

This lets you access the pod through a shell window.

6. Troubleshoot CrashloopBackoff and ImagePullBackoff errors

You might have a monitoring tool like Grafana to monitor the number of instances in your service at any given point. Normally, you want to have a certain minimum number of instances running depending on the size of the load. If a minimum number isn't matched, then it triggers an alert. When the problem is CrashLoopBackOff (your pod is starting, crashing, starting again, and then crashing again), then your service doesn't return a 200 success code. If you're receiving errors, this can be an indication of a performance problem.

If a kubectl get pods command returns the following output, then you know you have a pod in a CrashLoopBackOff state:

kubectl get pods
NAME                   READY  STATUS            RESTARTS   AGE
myDeployment1-89234... 1/1    Running           1          17m
myDeployment1-46964... 0/1    CrashLoopBackOff  2          1m

There can be many reasons for this error. You may have to do kubectl describe pod to get to the root of this. Here's a summary of possible reasons and some tips:

  • Your Dockerfile doesn't have a command (CMD), so your pod immediately exits after starting. Kubernetes automatically restarts the pod when it's managed by a deployment or ReplicaSet.
  • You have used the same port for two containers inside the same pod. All containers inside the same pod have the same internet protocol (IP) address. They are not permitted to use the same ports. You need a separate port for every container within your pod.
  • Kubernetes can't pull the image you have specified and therefore keeps crashing. This is an example of ImagePullBackoff.
  • Run kubectl logs podName to get more information about what caused the error.
  • If you don't see anything useful in your logs, consider deploying your application with a sleep command for a few minutes. This might help you see some logs before the application crashes. It also might help you figure out whether your application has a code bug or mistake in its configuration. If it has a configuration problem, you may not see any code errors (because it fails before it reaches them).

Wrap up

Kubernetes is a great tool for automating many of the manual processes involved in deploying, managing, and scaling containerized applications. If anything goes wrong, try these six troubleshooting methods to find and fix the source of the problem.


This article is based on Kubernetes troubleshooting examples, originally published on the Tennexas blog, and is reused with permission.


저자 소개

Satyajit is a software engineer, based in Cambridge, UK.  He mainly worka with Golang and a bit of React. His work also includes technologies like Kubernetes, Docker, gRPC, Elastic, and RabbitMQ, and he regularly posts videos on his YouTube channel about these topics.

He earned his bachelor's degree from IIT Kharagpur, his master's from Ecole Polytechnique, his master's thesis from Caltech, and his PhD from Cambridge University.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래