Kubernetes Troubleshooting Guide: A Complete Step-by-Step Approach

Last Updated: January 2026

Introduction

Kubernetes is a very powerful container orchestration platform but when something breaks on this platform, it feels a big task in How to Troubleshoot the issue. Issues like Pod crash, services not responding, deployments error and when tried to debug the logs don’t always share the full story.

If the platform breaks and you are the engineer managing the platform or the resources and if you have the following questions in your mind such as:

Why my Pod is not up and running?
Why my application is down although the pods are in healthy state?
How and through what steps should i perform to troubleshoot this issue?

Then, thanks me later as this is the exact guide for you.

This article Kubernetes Troubleshooting Guide: A Complete Step-by-Step Approach explains about How to Troubleshoot Kubernetes Issues so anyone reading this can get a clear picture of what resource in the kubernetes to look for when something breaks in the cluster.

This article works for:

Self-managed Kubernetes.
CLoud platforms which uses kubernetes services (GKE/EKS/AKS)
Production and non-production cluster.
Enterprize Kubernetes (Openshift)

“This will be the PART 1 of my multi-parts Kubernetes troubleshooting series”

Who can refer this Guide

DevOps engineers (From Begineer to Senior level)
Who just started learning Kubernetes
Developers who runs their apps on Kubernetes
SREs needing a structural troubleshooting approach.

Kubernetes Troubleshooting Feels Hard ?

Kubernetes Troubleshooting Guide: A Complete Step-by-Step Approach

Kubernetes issues are sometimes complex and hard beacuse the problems are distributed. The errors often happen at multiple layers. Kubernetes may show the symptoms of the error but not the root cause of the error.

Multiple Failure Layers

Kubernetes has multiple failure layers:

Application
Pod
Container
Network
Storage
Node
Control Plane

The Golden Rule

Never Jump to conclusion. Go through a systematic approach whenever issue occurs. You can follow the exact same top to bottom approach which i mentioned in the multiple failure layers part. do not try to troubleshoot randomly any resource. By answering four simple question you can solve most of the Kubernetes issues.

What resource is broken?
Where it is broken?
Why it is broken?
How to prevent it?

A Step-by-Step approach for Kubernetes Troubleshooting

Step 1. Scope of the issue identification

Ask to yourself that:

Is one pod/container affected?
Is one or multiple services affected?
Is the entire cluster been affected?

Commands to run for checking the issue:

- kubectl get pods -A
- kubectl get svc -A
- kubectl get nodes

If single pod is failing then it could be issue related to pod/containers and you should focus only on debugging the pod/container resource. If many pods across namespaces in cluster are failing then it cloud be the issue related with the cluster or node.

Step 2. Checking Pod Status

Pods are always the first place where the Kubernetes cluster shows errors in

Stage 3. Describing the pod (Mostly Problems get solved here)

When running command $Kubectl describe pod-name
Look for:

Event section for recent events related to pod health and errors
mount errors related to Volumes
Image manifest errors like imagepull errors.
Resource crunch errors and Health probe errors

Stage 4. Check logs of Containers

When running kubectl logs pod-name
Look for:

Logs lines related to missing app configurations.
Log lines realted to failed database connections.
Permission related error logs
OOMKILL logs.

Stage 5. Resources and limits configuration verification

Setting CPU/memory configuration more or less than the application container requirements can lead to silent failures.

kubectl describe pod <- provides details related to cpu and memory resource configuration
kubectl top pod <- provides a quick snapshot of CPU and Memory usage for pods
kubectl top node <- provides a quick snapshot of CPU and Memor usage for nodes

While debugging CPU and Memory related issues look for common errors like:

OOMKilled
Insufficient CPU
Insufficient Memory
Memory limits too low

Stage 6. Troubleshooting Node-Level issues

When the pods are failing on specific nodes, you should always check the clusters node health.
Commands to check node health:

- kubectl get nodes
- kubectl describe node node-name

Common issues observed with Nodes:

Kublet stopped
NotReady
DiskPressure
MemoryPressure

Stage 7. Troubleshooting Network issues

Sometimes pods are running fine but the app serving pods is having issue like notreachable
Commands to troubleshoot networking issues:

- kubectl get svc
- kubectl describe svc service-name
- kubectl get endpoints

Common Networking Mistaked Observed

Wrong Service slectors <- check the name of the selectors their may be high chance of selector mismatch
Mismatched Ports
Failed DNS
NetworkPolicy blocking traffic

Stage 8. Troubleshooting Storage

The common issue observed with pods having storage issues is when pods are stuck in pending state.
Commands to check storage issues:

- kubectl get pvc
- kubectl describe pvc pvc-name

Common storage errors observed:

PVC pending
Permission issue
CSI driver issue <- can be in not running state
Mount failure issuer

Most Common Kubernetes errors observed

Troubleshooting Checklist

- Pod status
- Pod events
- Container logs
- Container limits
- Resource limits
- Node health
- Services and Endpoints
- Pvc and Storage
- Cluster events

Kubernetes Best Practices to Avoid Issues

Setting Resource requests and limits
Using health probes
Implementing Centralized Logging and Monitoring
Using GitOps for continues change delivery
Always test changes in stage environment before implementing in Production

Most of the Kubernetes issue are not failures or bugs, They are all misconfigurations

Final Thoughts

With following the above structural approach in debugging the issue, 90% of the issues can be solved in minutes. This guide will definitely give you the foundation and upcoming parts will make you production ready.