Last Updated: January 2026
I’ll never forget the night our entire production cluster started evicting pods. 3 AM. Phone buzzing. Alerts screaming.
“Node disk full. Pods being evicted.”
I scrambled to my laptop, still half-asleep, to see 12 out of 15 nodes showing DiskPressure. Pods were getting killed left and right. Database connections dropping. API returning 503s. Full-blown outage.
The culprit? Docker images that nobody bothered to clean up. Over six months, we’d accumulated 180GB of unused container images on every node. Add application logs that weren’t being rotated, some stuck mounts from failed pods, and suddenly we went from 20% disk usage to 95% overnight.
Took me four hours to stabilize the cluster that night. Learned more about Kubernetes node storage than I ever wanted to know.
If you’re seeing “DiskPressure,” “eviction,” or “disk full” errors, you’re in the right place. Trough this article on Kubernetes Node Storage Errors Let me show you how to fix these issues and make sure they never wake you up at 3 AM.
Understanding Kubernetes Node Storage Errors (The Basics)
Before we dive into errors, here’s what fills up your Kubernetes nodes:
Container Images – Every image you’ve ever pulled sits there forever (until cleaned)
Container Logs – stdout/stderr from your apps, can grow massive
Container Layers – Writable layers from running containers
Ephemeral Volumes – emptyDir, temp files, caches
Persistent Volumes – Attached storage (usually separate, but not always)
System Files – OS, kubelet logs, system temp files
When any of these fill up, bad things happen. Let’s fix them.
## 1. Node Has DiskPressure (Most common Kubernetes Node Storage Error )
What You’ll See
$ kubectl get nodesNAME STATUS ROLES AGEworker-1 Ready,SchedulingDisabled <none> 45dworker-2 Ready,DiskPressure <none> 45dworker-3 Ready <none> 45d
That “DiskPressure” status means the node is running out of disk space. Kubernetes won’t schedule new pods there, and it might start evicting existing ones.
Check What’s Happening
# Describe the node to see the detailskubectl describe node worker-2# Look for this section:Conditions:Type Status Reason Message---- ------ ------ -------DiskPressure True KubeletHasDiskPressure disk usage exceeds threshold
Why DiskPressure Triggers
Kubernetes monitors two thresholds:
Soft eviction threshold (default: 90% disk)
- Warning state
- New pods won’t be scheduled
- Existing pods keep running
Hard eviction threshold (default: 95% disk)
- Critical state
- Kubernetes starts evicting pods
- Evicts lowest priority pods first
The Quick Fix
# SSH to the node (or use kubectl debug)kubectl debug node/worker-2 -it --image=ubuntu# Check disk usagedf -h# Common culprits:Filesystem Size Used Avail Use% Mounted on/dev/xvda1 100G 94G 6.0G 95% /# ↑ This is your problem
Solution 1: Clean Up Docker/Containerd Images
# List all images and their sizescrictl images | sort -k7 -h# Remove unused imagescrictl rmi --prune# For Docker (older clusters)docker system prune -a --volumes -f
This freed up 40GB for me that night.
Solution 2: Clean Up Old Container Logs
# Find large log filesfind /var/log/pods -type f -size +100M -exec ls -lh {} \;# Truncate huge logs (careful!)find /var/log/pods -type f -size +500M -exec truncate -s 0 {} \;# Or delete old pod logsfind /var/log/pods -type f -mtime +7 -delete
Solution 3: Configure Automatic Cleanup
Edit kubelet configuration:
# /var/lib/kubelet/config.yamlimageGCHighThresholdPercent: 85 # Start GC when disk at 85%imageGCLowThresholdPercent: 80 # Stop GC when disk at 80%imageMinimumGCAge: 2m # Don't GC images newer than 2minevictionHard:imagefs.available: "10%" # Evict pods if <10% freenodefs.available: "10%"evictionSoft:imagefs.available: "15%" # Soft warning at 15%nodefs.available: "15%"evictionSoftGracePeriod:imagefs.available: "2m"nodefs.available: "2m"
Restart kubelet:
systemctl restart kubelet
Solution 4: Increase Disk Size (Last Resort)
If you’ve cleaned everything and still hitting limits:
# For AWS EBSaws ec2 modify-volume --volume-id vol-xyz --size 200# Then expand the filesystemsudo resize2fs /dev/xvda1# For GCPgcloud compute disks resize disk-name --size=200GB
## 2. Node Disk Full (Disk Gets Full)
The Error
$ kubectl describe pod app-xyzEvents:Warning FailedCreatePodSandBox Failed to create pod sandbox:rpc error: code = Unknown desc = failed to create containerd task:write /var/lib/containerd: no space left on device
This is worse than DiskPressure. The disk is actually full – 100% used. Nothing can write.
Immediate Triage
# Check which filesystem is fullkubectl debug node/worker-2 -it --image=ubuntudf -h# Find what's eating spacedu -sh /* | sort -h | tail -20# Common hogs:# /var/lib/containerd (or /var/lib/docker)# /var/log# /tmp
My Emergency Cleanup Script
I keep this handy for emergencies:
#!/bin/bash# emergency-cleanup.shecho "=== Disk Usage Before ==="df -h /echo "=== Cleaning container images ==="crictl rmi --prune# or: docker system prune -a -fecho "=== Cleaning old logs ==="journalctl --vacuum-time=2dfind /var/log/pods -type f -mtime +3 -deletefind /tmp -type f -mtime +1 -deleteecho "=== Cleaning dead containers ==="crictl rm $(crictl ps -a -q --state=exited)echo "=== Disk Usage After ==="df -h /
Finding the Space Hogs
# Top 20 largest directoriesdu -ah /var | sort -h | tail -20# If containerd/docker is huge:du -sh /var/lib/containerd/*# ordu -sh /var/lib/docker/*# If logs are huge:du -sh /var/log/*
Real Example from My 3 AM Incident
$ du -sh /var/lib/containerd/*12G /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs156G /var/lib/containerd/io.containerd.content.v1.content# ↑ 156GB of container images!# Cleaned up unused images$ crictl rmi --pruneDeleted: sha256:abc123... (45GB)Deleted: sha256:def456... (32GB)Deleted: sha256:ghi789... (28GB)$ du -sh /var/lib/containerd/*12G /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs51G /var/lib/containerd/io.containerd.content.v1.content# ↑ Down to 51GB - freed 105GB!
## 3. Image Filesystem Full (Container Image Filesystem Gets Full)
The Error
$ kubectl describe node worker-2Conditions:Message: ImageGCFailed: wanted to free 12685876224 bytes, but freed 0 bytesReason: ImageGCFailed
The container image filesystem is full, and garbage collection isn’t helping.
Why This Happens
Problem 1: Images in use can’t be deleted
All your pods are using different images, so there’s nothing to GC.
# See what images are in usecrictl images | grep -v "SIZE"# Every image here is being used by at least one pod
Problem 2: Many large images
crictl images | sort -k7 -hr | head -10# You might see images like:# tensorflow:2.9 15GB# cuda-base:11.8 12GB# ml-model:v3 8GB
The Fix
Option 1: Use smaller base images
# BeforeFROM python:3.9# Image size: 915MB# AfterFROM python:3.9-slim# Image size: 122MB# Even betterFROM python:3.9-alpine# Image size: 47MB
Option 2: Set image pull policy
apiVersion: v1kind: Podmetadata:name: myappspec:containers:- name: appimage: myapp:v1.0.0 # Use specific tags!imagePullPolicy: IfNotPresent # Don't pull if exists
Option 3: Use a separate partition for images
Mount a larger volume for container images:
# Create and mount larger volume for /var/lib/containerd# This way image storage doesn't compete with system storage
Option 4: Implement image pruning cronjob
apiVersion: batch/v1kind: CronJobmetadata:name: image-cleanupnamespace: kube-systemspec:schedule: "0 2 * * *" # 2 AM dailyjobTemplate:spec:template:spec:hostPID: truehostNetwork: truehostIPC: truecontainers:- name: cleanupimage: busyboxcommand:- nsenter- --target- "1"- --mount- --uts- --ipc- --net- --pid- --- crictl- rmi- --prunesecurityContext:privileged: truerestartPolicy: OnFailurenodeSelector:node-role.kubernetes.io/worker: "true"
## 4. Ephemeral Storage Exceeded (Why My Pod Gets Evicted Suddenly)
The Error
$ kubectl describe pod myapp-xyzStatus: FailedReason: EvictedMessage: Pod ephemeral local storage usage exceeds the total limit of containers 1Gi
Your pod was using too much ephemeral storage (emptyDir, logs, temp files) and got evicted.
What Is Ephemeral Storage?
Everything that’s not a persistent volume:
- emptyDir volumes
- Container logs (stdout/stderr)
- Container writable layer (files created inside container)
- /tmp and /var/tmp inside containers
Check Current Usage
# See pod's ephemeral storage usagekubectl describe pod myapp-xyz | grep -A 10 "Ephemeral Storage"# Example output:Ephemeral Storage: 1.5Gi # UsedLimits: 1Gi # Allowed - EXCEEDED!
The Fix
Solution 1: Increase ephemeral storage limit
apiVersion: v1kind: Podmetadata:name: myappspec:containers:- name: appimage: myapp:latestresources:requests:ephemeral-storage: "2Gi" # Requestlimits:ephemeral-storage: "4Gi" # Limit
Solution 2: Use persistent volume instead of emptyDir
# Before - using emptyDirvolumes:- name: cacheemptyDir: {}# After - using PVCvolumes:- name: cachepersistentVolumeClaim:claimName: cache-pvc
Solution 3: Clean up temp files regularly
apiVersion: v1kind: Podmetadata:name: myappspec:containers:- name: appimage: myapp:latestlifecycle:preStop:exec:command: ["/bin/sh", "-c", "rm -rf /tmp/*"]- name: cleanerimage: busyboxcommand:- sh- -c- while true; do find /tmp -mtime +1 -delete; sleep 3600; donevolumeMounts:- name: tmpmountPath: /tmpvolumes:- name: tmpemptyDir: {}
Solution 4: Reduce log verbosity
env:- name: LOG_LEVELvalue: "INFO" # Instead of DEBUG
My Logging Lesson
We had a Java app that logged every SQL query in debug mode. In production. 50,000 requests per minute. Each request had 10+ queries. That’s 500,000 log lines per minute.
One pod filled up 10GB in ephemeral storage in under an hour. Got evicted. Restarted. Filled up again. Evicted. Loop.
Changed LOG_LEVEL to INFO. Problem solved. Felt stupid.
## 5. Kubelet Volume Cleanup Failed (The Volume is Still there)
The Error
$ kubectl describe node worker-2Events:Warning VolumeCleanupFailed Orphaned pod "abc-123" found, but error occurred during volume cleanup:error cleaning volume mounts: context deadline exceeded
Kubelet couldn’t clean up volumes from deleted pods. They’re stuck on the node.
Why This Happens
- Pod deleted but containers still running
- Volumes still mounted (busy)
- NFS server unreachable
- Device still attached to old process
Check for Orphaned Volumes
kubectl debug node/worker-2 -it --image=ubuntu# Check kubelet volume directoryls -la /var/lib/kubelet/pods/# You'll see directories for pods that don't exist anymore# Check what's mountedmount | grep kubelet# Look for orphaned mounts
The Fix
Step 1: Try graceful cleanup
# Restart kubelet (it will retry cleanup)systemctl restart kubelet# Wait a minute, check if cleaned upls /var/lib/kubelet/pods/
Step 2: Force unmount stuck volumes
# Find stuck mountsmount | grep kubelet | grep "abc-123"# Force unmountumount -f /var/lib/kubelet/pods/abc-123-def-456/volumes/kubernetes.io~csi/pvc-xyz/mount# If that fails, lazy unmountumount -l /var/lib/kubelet/pods/abc-123-def-456/volumes/kubernetes.io~csi/pvc-xyz/mount
Step 3: Clean up orphaned directories
# Get list of current podskubectl get pods -A -o json | jq -r '.items[].metadata.uid' | sort > /tmp/active-pods# Get list of pod directoriesls /var/lib/kubelet/pods/ | sort > /tmp/pod-dirs# Find orphanscomm -13 /tmp/active-pods /tmp/pod-dirs > /tmp/orphans# Remove orphaned directories (careful!)while read pod_uid; doecho "Removing orphaned pod directory: $pod_uid"rm -rf /var/lib/kubelet/pods/$pod_uiddone < /tmp/orphans
Step 4: Prevent future issues
Set proper termination grace period:
apiVersion: v1kind: Podmetadata:name: myappspec:terminationGracePeriodSeconds: 30 # Give time to cleanupcontainers:- name: appimage: myapp:latest
## 6. Mount Propagation Failed (A Tricky Error)
The Error
$ kubectl describe pod myapp-xyzEvents:Warning FailedMount MountVolume.SetUp failed:rpc error: code = Internal desc = mount propagation not set correctly
This is a tricky one. Mount propagation controls how mounts are shared between host and containers.
What Mount Propagation Does
- None – No propagation (default, usually works)
- HostToContainer – Host mounts visible in container
- Bidirectional – Mounts shared both ways
Why It Fails
Usually happens when:
- kubelet started without proper mount propagation support
- Node OS doesn’t support mount propagation
- systemd configuration wrong
Check Mount Propagation
# On the nodecat /proc/self/mountinfo | grep kubelet# Look for "shared:" tag# If missing, propagation isn't working
The Fix
Step 1: Enable mount propagation in kubelet
# Edit kubelet systemd servicevi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf# Make sure MountFlags is set correctly:[Service]MountFlags=shared# Reload and restartsystemctl daemon-reloadsystemctl restart kubelet
Step 2: Check Docker/containerd configuration
# For Dockercat /etc/docker/daemon.json{"storage-driver": "overlay2","exec-opts": ["native.cgroupdriver=systemd"],"features": {"buildkit": true}}# Restart Dockersystemctl restart docker
Step 3: Use correct mount propagation in pod
apiVersion: v1kind: Podmetadata:name: myappspec:containers:- name: appimage: myapp:latestvolumeMounts:- name: datamountPath: /datamountPropagation: HostToContainer # Explicitly set
When I Hit This
We were running Kubernetes on custom Linux distro. Mount propagation wasn’t compiled into the kernel. Took a whole day to figure out. Rebuilt kernel with proper flags. Never again.
## 7. Stale Mounts on Node (Hidden Volume KIller)
The Symptoms
# Pod gets stuck$ kubectl get podsNAME READY STATUS RESTARTS AGEmyapp-xyz 0/1 ContainerCreating 0 5m# Events show:Warning FailedMount Unable to attach or mount volumes:unmount failed: exit status 1Unmounting arguments: /var/lib/kubelet/pods/.../volumes/...Output: target is busy
Old mounts from deleted pods are still there, preventing new pods from starting.
How This Happens
- Pod deleted
- Volume unmount fails (maybe NFS was down)
- Mount point still exists
- New pod tries to use same volume
- Can’t mount (path busy)
Find Stale Mounts
kubectl debug node/worker-2 -it --image=ubuntu# List all kubelet mountsmount | grep kubelet# Compare with actual running podskubectl get pods -o wide | grep worker-2# Mounts for non-existent pods = stale
The Fix
Option 1: Unmount and clean up
# Find the stale mountmount | grep "pvc-abc123"# Kill any processes using itlsof /var/lib/kubelet/pods/.../volumes/.../mountkill -9 <PID># Unmountumount /var/lib/kubelet/pods/.../volumes/.../mount# If busy, force itumount -f /var/lib/kubelet/pods/.../volumes/.../mount# If still busy, lazy unmount (last resort)umount -l /var/lib/kubelet/pods/.../volumes/.../mount
Option 2: Restart kubelet
# Kubelet restart will retry cleanupsystemctl restart kubelet# Check if mounts cleaned upmount | grep kubelet
Option 3: Node reboot (nuclear option)
# Cordon node firstkubectl cordon worker-2# Drain podskubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data# Rebootsudo reboot# After reboot, uncordonkubectl uncordon worker-2
Prevention
Set shorter finalizer timeouts:
apiVersion: v1kind: Podmetadata:name: myappfinalizers:- kubernetes.io/pvc-protection # Ensures cleanupspec:terminationGracePeriodSeconds: 30
Monitor for stale mounts:
# Create monitoring script#!/bin/bash# check-stale-mounts.shACTIVE_PODS=$(kubectl get pods -A -o json | jq -r '.items[].metadata.uid')for mount in $(mount | grep kubelet | awk '{print $3}'); doPOD_UID=$(echo $mount | grep -oP '/pods/\K[^/]+')if ! echo "$ACTIVE_PODS" | grep -q "$POD_UID"; thenecho "Stale mount found: $mount"# Alert or cleanupfidone
## 8. Node Reboot Causing Volume Attach Failure
The Problem
# After node reboots$ kubectl get podsNAME READY STATUS RESTARTS AGEmyapp-xyz 0/1 ContainerCreating 0 5m$ kubectl describe pod myapp-xyzEvents:Warning FailedAttachVolume Multi-Attach error:volume is still attached to node "worker-2-old" but node is gone
Node rebooted (or crashed), but Kubernetes thinks volumes are still attached to the old node.
Why This Happens
- Node crashes or reboots unexpectedly
- Volumes attached to that node (EBS, PD, etc.)
- Cloud provider doesn’t know node is dead
- Volumes stuck in “attached” state to dead node
- Can’t attach to new node
Check Volume Attachments
# See volume attachmentskubectl get volumeattachmentNAME ATTACHER PV NODE ATTACHEDcsi-abc ebs.csi.aws.com pvc-xyz worker-2 true# If worker-2 is dead but ATTACHED=true, that's the problem
The Fix
Option 1: Delete stale VolumeAttachment
# Delete the attachmentkubectl delete volumeattachment csi-abc# Volume will detach from dead node# Then reattach to new node automatically
Option 2: Force detach in cloud provider
For AWS:
# Get volume IDkubectl get pv pvc-xyz -o jsonpath='{.spec.csi.volumeHandle}'# Force detachaws ec2 detach-volume --volume-id vol-abc123xyz --force# Delete VolumeAttachment in Kuberneteskubectl delete volumeattachment csi-abc
For GCP:
# Get disk namekubectl get pv pvc-xyz -o jsonpath='{.spec.csi.volumeHandle}'# Detach diskgcloud compute instances detach-disk old-node-name --disk=disk-name# Delete VolumeAttachmentkubectl delete volumeattachment csi-abc
Option 3: Delete and recreate the pod
# Force delete podkubectl delete pod myapp-xyz --force --grace-period=0# Recreate (if part of deployment, it happens automatically)
Prevention
Set proper timeouts:
# In CSI driver configuration--timeout=300s # Wait 5 minutes before giving up
Use pod disruption budgets:
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata:name: myapp-pdbspec:minAvailable: 1selector:matchLabels:app: myapp
Monitor node health:
# Alert on NotReady nodeskubectl get nodes -o json | jq -r '.items[] |select(.status.conditions[] |select(.type=="Ready" and .status!="True")) |.metadata.name'
Quick Troubleshooting Checklist
When nodes have storage issues:
# 1. Check node conditionskubectl describe node <node-name> | grep -A 10 Conditions# 2. Check disk usagekubectl debug node/<node-name> -it --image=ubuntudf -h# 3. Find what's using spacedu -sh /* | sort -h | tail -10# 4. Check for DiskPressurekubectl get nodes | grep DiskPressure# 5. Check ephemeral storage usagekubectl describe pod <pod-name> | grep -i ephemeral# 6. Check for stale mountsmount | grep kubelet# 7. Check volume attachmentskubectl get volumeattachment# 8. Check for orphaned podsls /var/lib/kubelet/pods/ | wc -lkubectl get pods -A | wc -l# If first number way bigger = orphaned pods
My Node Health Monitoring Script
I run this on a cronjob to catch issues early:
#!/bin/bash# node-health-check.shfor node in $(kubectl get nodes -o name); doNODE_NAME=$(basename $node)# Check disk pressureif kubectl get node $NODE_NAME -o json | jq -r '.status.conditions[] | select(.type=="DiskPressure") | .status' | grep -q True; thenecho "ALERT: $NODE_NAME has DiskPressure"# Send alertfi# Check disk usage (requires metrics-server)DISK_USAGE=$(kubectl top node $NODE_NAME --no-headers | awk '{print $5}' | tr -d '%')if [ $DISK_USAGE -gt 80 ]; thenecho "WARNING: $NODE_NAME disk at ${DISK_USAGE}%"# Send warningfi# Check for stale podskubectl debug node/$NODE_NAME -q -- sh -c 'ACTIVE=$(curl -s localhost:10250/pods | jq ".items | length")ON_DISK=$(ls /var/lib/kubelet/pods | wc -l)if [ $ON_DISK -gt $((ACTIVE + 10)) ]; thenecho "WARNING: $NODE_NAME has orphaned pod directories"fi'done
Best Practices I Follow Now
1. Set Resource Limits on Everything to Prevent Kubernetes Node Storage errors
resources:requests:ephemeral-storage: "1Gi"limits:ephemeral-storage: "2Gi"
2. Configure Kubelet Garbage Collection to minimize Kubernetes Node Storage errors
# /var/lib/kubelet/config.yamlimageGCHighThresholdPercent: 85imageGCLowThresholdPercent: 80evictionHard:nodefs.available: "10%"imagefs.available: "10%"
3. Use Log Rotation
apiVersion: v1kind: Podmetadata:name: myappspec:containers:- name: appimage: myapp:latest# In Dockerfile or command:# Configure log rotation# Use structured logging# Send logs to external system
4. Monitor Disk Usage to Prevent Kubernetes Node Storage errors
# Prometheus alerts- alert: NodeDiskPressureexpr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1for: 5m- alert: NodeDiskUsageHighexpr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) > 0.85for: 10m
5. Regular Node Maintenance for Prevention From Kubernetes Node Storage errors
# Weekly cleanup scriptkubectl apply -f - <<EOFapiVersion: batch/v1kind: CronJobmetadata:name: node-cleanupnamespace: kube-systemspec:schedule: "0 3 * * 0" # Sunday 3AMjobTemplate:spec:template:spec:hostPID: truehostNetwork: truecontainers:- name: cleanupimage: ubuntucommand:- bash- -c- |crictl rmi --prunejournalctl --vacuum-time=7dfind /var/log/pods -mtime +7 -deletesecurityContext:privileged: truerestartPolicy: OnFailureEOF
Final Thoughts About Kubernetes Node Storage errors
Kubernetes Node storage errors taught me that Kubernetes isn’t magic. It’s running on real servers with real disks that fill up with real garbage.
That 3 AM wake-up call cost us 4 hours of downtime and taught me:
Prevention for Kubernetes Node Storage errors:
- Monitor disk usage before it’s a problem
- Set up garbage collection correctly
- Use ephemeral storage limits
- Clean up regularly
When it breaks:
- Check disk usage first (
df -h) - Find what’s using space (
du -sh /*) - Clean up images (
crictl rmi --prune) - Check for orphaned mounts
- Restart kubelet if needed
The best fix: Don’t let Kubernetes Node Storage errors happen. Set up monitoring, configure kubelet properly, and have cleanup scripts ready. This will help us identify the kubernetes node storage errors quickly.
I haven’t had a Kubernetes Node Storage errors outages since implementing these practices. Knock on wood.
Have a Kubernetes Node Storage errors horror story? Share it in the comments – misery loves company!
Additional Resources:
Keywords: kubernetes node storage errors, diskpressure kubernetes, node disk full, ephemeral storage exceeded, kubelet volume cleanup failed, stale mounts kubernetes, image filesystem full, kubernetes node reboot volume, kubernetes disk space, node eviction kubernetes