Last Updated: January 2026
Introduction
Kubernetes is a very powerful container orchestration platform but when something breaks on this platform, it feels a big task in How to Troubleshoot the issue. Issues like Pod crash, services not responding, deployments error and when tried to debug the logs don’t always share the full story.
If the platform breaks and you are the engineer managing the platform or the resources and if you have the following questions in your mind such as:
- Why my Pod is not up and running?
- Why my application is down although the pods are in healthy state?
- How and through what steps should i perform to troubleshoot this issue?
Then, thanks me later as this is the exact guide for you.
This article Kubernetes Troubleshooting Guide: A Complete Step-by-Step Approach explains about How to Troubleshoot Kubernetes Issues so anyone reading this can get a clear picture of what resource in the kubernetes to look for when something breaks in the cluster.
This article works for:
- Self-managed Kubernetes.
- CLoud platforms which uses kubernetes services (GKE/EKS/AKS)
- Production and non-production cluster.
- Enterprize Kubernetes (Openshift)
“This will be the PART 1 of my multi-parts Kubernetes troubleshooting series”
Who can refer this Guide
- DevOps engineers (From Begineer to Senior level)
- Who just started learning Kubernetes
- Developers who runs their apps on Kubernetes
- SREs needing a structural troubleshooting approach.
Kubernetes Troubleshooting Feels Hard ?
Kubernetes issues are sometimes complex and hard beacuse the problems are distributed. The errors often happen at multiple layers. Kubernetes may show the symptoms of the error but not the root cause of the error.
Multiple Failure Layers
Kubernetes has multiple failure layers:
- Application
- Pod
- Container
- Network
- Storage
- Node
- Control Plane
The Golden Rule
Never Jump to conclusion. Go through a systematic approach whenever issue occurs. You can follow the exact same top to bottom approach which i mentioned in the multiple failure layers part. do not try to troubleshoot randomly any resource. By answering four simple question you can solve most of the Kubernetes issues.
- What resource is broken?
- Where it is broken?
- Why it is broken?
- How to prevent it?
A Step-by-Step approach for Kubernetes Troubleshooting
Step 1. Scope of the issue identification
Ask to yourself that:
- Is one pod/container affected?
- Is one or multiple services affected?
- Is the entire cluster been affected?
Commands to run for checking the issue:
- kubectl get pods -A
- kubectl get svc -A
- kubectl get nodes
If single pod is failing then it could be issue related to pod/containers and you should focus only on debugging the pod/container resource. If many pods across namespaces in cluster are failing then it cloud be the issue related with the cluster or node.
Step 2. Checking Pod Status
Pods are always the first place where the Kubernetes cluster shows errors in
Stage 3. Describing the pod (Mostly Problems get solved here)
When running command $Kubectl describe pod-name
Look for:
- Event section for recent events related to pod health and errors
- mount errors related to Volumes
- Image manifest errors like imagepull errors.
- Resource crunch errors and Health probe errors
Stage 4. Check logs of Containers
When running kubectl logs pod-name
Look for:
- Logs lines related to missing app configurations.
- Log lines realted to failed database connections.
- Permission related error logs
- OOMKILL logs.
Stage 5. Resources and limits configuration verification
Setting CPU/memory configuration more or less than the application container requirements can lead to silent failures.
- kubectl describe pod <- provides details related to cpu and memory resource configuration
- kubectl top pod <- provides a quick snapshot of CPU and Memory usage for pods
- kubectl top node <- provides a quick snapshot of CPU and Memor usage for nodes
While debugging CPU and Memory related issues look for common errors like:
- OOMKilled
- Insufficient CPU
- Insufficient Memory
- Memory limits too low
Stage 6. Troubleshooting Node-Level issues
When the pods are failing on specific nodes, you should always check the clusters node health.
Commands to check node health:
- kubectl get nodes
- kubectl describe node node-name
Common issues observed with Nodes:
- Kublet stopped
- NotReady
- DiskPressure
- MemoryPressure
Stage 7. Troubleshooting Network issues
Sometimes pods are running fine but the app serving pods is having issue like notreachable
Commands to troubleshoot networking issues:
- kubectl get svc
- kubectl describe svc service-name
- kubectl get endpoints
Common Networking Mistaked Observed
- Wrong Service slectors <- check the name of the selectors their may be high chance of selector mismatch
- Mismatched Ports
- Failed DNS
- NetworkPolicy blocking traffic
Stage 8. Troubleshooting Storage
The common issue observed with pods having storage issues is when pods are stuck in pending state.
Commands to check storage issues:
- kubectl get pvc
- kubectl describe pvc pvc-name
Common storage errors observed:
- PVC pending
- Permission issue
- CSI driver issue <- can be in not running state
- Mount failure issuer
Most Common Kubernetes errors observed
Troubleshooting Checklist
- Pod status
- Pod events
- Container logs
- Container limits
- Resource limits
- Node health
- Services and Endpoints
- Pvc and Storage
- Cluster events
Kubernetes Best Practices to Avoid Issues
- Setting Resource requests and limits
- Using health probes
- Implementing Centralized Logging and Monitoring
- Using GitOps for continues change delivery
- Always test changes in stage environment before implementing in Production
Most of the Kubernetes issue are not failures or bugs, They are all misconfigurations
Final Thoughts
With following the above structural approach in debugging the issue, 90% of the issues can be solved in minutes. This guide will definitely give you the foundation and upcoming parts will make you production ready.