Kubernetes Certificate Expired: Real Production Outage & Recovery Guide

Last Updated: January 2026

In this blog article I am gonna share my experience with a live Kubernetes Certificate Expired issue which i faced in the production environment and what steps I took to fix it.

So last Thursday started off pretty normal. Got my coffee, checked Slack, everything looked fine. Our production cluster was running smooth, no alerts, nothing weird in the logs. I was actually planning to leave early for once.

Then around 11:30 AM my phone literally exploded with alerts. PagerDuty, Slack, email – all at the same time. The kind of notification storm that makes your stomach drop because you know something is seriously VERY wrong.

Everything Just… Stopped

I’m talking complete failure. kubectl commands timing out. Pods couldn’t talk to the API server. The dashboard was dead.

My first thought was “okay someone pushed something bad” but nobody had deployed anything all morning. Then I thought maybe AWS was having issues but their status page was all green.

The Error Message That Didn’t Help At All

The logs were throwing the classic Kubernetes error:

x509: certificate has expired or is not yet valid

And I just stared at it for like 30 seconds thinking “what certificate?” Because here’s the thing – I knew our external SSL certs were fine. We use Let’s Encrypt with auto-renewal and I’d literally checked them two days ago.

Took me way too long to realize this was about the INTERNAL Kubernetes certificates. You know, the ones that nobody tells you about when you’re learning Kubernetes. The ones that just quietly exist until they don’t.

How Did This Even Happen

Quick backstory: we set up this cluster like 13 months ago. Used kubeadm because that’s what the tutorials said to use. Everything worked great. We added nodes, deployed apps, felt pretty good about ourselves.

What we didn’t know (and honestly what nobody really emphasizes) is that kubeadm generates certificates that expire after ONE YEAR. Not five years, not ten years. One. Single. Year.

And sure, there’s probably documentation about this somewhere. But when you’re rushing to get a cluster up and running, you’re not exactly reading every line of the docs. You’re copy-pasting commands and hoping for the best.

The Panic Phase

My coworker Monika came over and asked what was wrong and I just pointed at the screen. She looked at the error, looked at me, and said “oh shit” which pretty much summed it up.

We tried the obvious stuff first. Restarted some pods. Didn’t help. Restarted the nodes. Really didn’t help – actually made things worse because now those nodes couldn’t rejoin the cluster at all.

I googled “kubernetes certificate expired” and got about a million results. Half of them were for completely different cert problems. The other half were GitHub issues from 2022 that ended with “did you fix this? no response in 6 months, closing.”

Super helpful.

Finding The Actual Problem

Eventually found a command to check all the cert expiration dates:

kubeadm certs check-expiration

And yes. Everything expired. API server cert, controller manager cert, scheduler cert – all of them expired the night before. We just hadn’t noticed until pods started restarting and couldn’t authenticate anymore.

The output looked something like:

CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 Dec 15, 2023 10:20 UTC   <invalid>
apiserver                  Dec 15, 2023 10:20 UTC   <invalid>
apiserver-kubelet-client   Dec 15, 2023 10:20 UTC   <invalid>

That “invalid” really hits different when it’s your production cluster.

The Fix That Kinda Worked

Found a kubeadm command that’s supposed to renew everything:

kubeadm certs renew all

Ran it on the control plane node. It said “certificates renewed successfully” which was encouraging. Then I tried kubectl again and… nothing. Still broken.

Turns out (and this is the part that cost me another hour) you have to restart the control plane components after renewing certs. And I don’t mean like a normal restart. These are static pods so you gotta do this whole dance:

mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 20
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Same thing for controller-manager and scheduler. It’s basically forcing the kubelet to recreate the pods. Feels super hacky but that’s what works.

But Wait There’s More Problems

Got the control plane back up. kubectl started working again. Felt like a genius for about five minutes.

Then realized all the worker nodes were still screwed. They couldn’t join the cluster because their certs were also expired. And here’s where it gets fun – you can’t just run kubeadm on the workers to renew their certs. That’s not how it works.

Had to generate new join tokens on the control plane:

kubeadm token create --print-join-command

Then SSH into every single worker node and run that join command. We have 8 workers in production. This took forever.

Oh and some nodes just refused to join even with the new token. Had to completely reset them:

kubeadm reset
# then run the join command

Lost all the pods on those nodes obviously. They rescheduled on other nodes but it was messy. Monika was handling the Slack channel where everyone was asking why their apps were down and she was NOT happy with me.

The Things That Broke In Weird Ways

Some stuff started working right away after the fix. Other stuff… didn’t.

Our monitoring stack completely died. Prometheus couldn’t scrape anything. Turned out it was using a service account token that was signed by the old CA cert. Had to delete the Prometheus pods and let them restart with new tokens.

Ingress controller was being weird too. Would work for like 2 minutes then fail. Eventually figured out it had cached the old certs somehow. Deleted those pods, problem solved.

The weirdest one was our CI/CD pipeline. Jenkins kept failing to deploy stuff even though kubectl was working fine from my laptop. Took us way too long to realize Jenkins had the old kubeconfig cached. Had to regenerate it:

rm ~/.kube/config
kubeadm kubeconfig user --client-name jenkins-agent > jenkins-config

Then updated Jenkins with the new config. Why this wasn’t automatic I’ll never understand.

What We Should Have Done Different

Okay so in hindsight (and everyone’s got perfect hindsight when the fire’s already out), here’s what we should’ve done:

Set up monitoring for cert expiration BEFORE they expired. There’s a Prometheus exporter specifically for this. We added it at like 3 PM that day. Would’ve been more useful at 3 PM the day before.

Actually read the kubeadm docs about certificate management. They’re not exciting but turns out they’re pretty important.

Test the cert renewal process in our staging environment. We have staging for a reason but apparently not for this.

Put the cert renewal commands in a runbook. I basically had to figure this out from scratch while everything was on fire. Now we have step-by-step instructions for next time.

Maybe not use kubeadm certs for production? I’ve heard managed Kubernetes services handle this automatically. Might be worth looking into.

The Auto-Renewal Setup We Added

After getting everything back up, first thing I did was set up automatic cert renewal. Not trusting myself to remember to do this manually in 12 months.

Made a cronjob that runs monthly:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cert-renewer
  namespace: kube-system
spec:
  schedule: "0 0 1 * *"
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
          - name: cert-renewer
            image: k8s.gcr.io/kubeadm:v1.27.0
            command:
            - /bin/sh
            - -c
            - |
              kubeadm certs renew all
              kill $(pidof kube-apiserver)
          restartPolicy: OnFailure

Is this elegant? No. Does it work? Also no, actually we had to debug this for another hour because of permissions issues. But eventually got something working that actually runs.

Also set up alerts in Grafana to yell at us when certs are within 30 days of expiring. Better to know early and do it during business hours than have another Thursday like this one.

Stuff I Learned About Cert Management

Certificates in Kubernetes are way more complicated than they need to be. You’ve got:

The CA cert that signs everything
API server certs for the API
Kubelet certs for each node
Service account signing keys
Front proxy certs if you’re using aggregation
etcd certs if you care about your data

All of these can expire independently. All of them will absolutely break your cluster if they expire. Good times.

The CA cert itself lasts 10 years by default which is why this didn’t become a problem immediately. But all the certs signed BY the CA only last 1 year. Nobody mentions this until it’s a problem.

You can check individual cert files with openssl:

openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text

Look for the “Not After” date in the output. If that’s in the past, you’re gonna have a bad time.

The Postmortem Nobody Wanted To Attend

Had to do a postmortem meeting the next day. Our CTO was there. Not fun.

The timeline looked really bad written down:

11:30 AM – cluster dies
11:45 AM – figured out it was certs
1:30 PM – control plane back up
3:00 PM – all workers rejoined
4:30 PM – monitoring and CI/CD working again

Five hours of downtime for something that should’ve been preventable. Got asked a lot of “why didn’t we know about this” questions that I didn’t have great answers for.

The action items were basically:

Set up cert monitoring (done)
Automate renewal (in progress)
Document everything (you’re reading it)
Test this in staging quarterly
Consider managed Kubernetes

Number 5 sparked a whole separate debate about cloud costs versus operational overhead. That meeting went long.

Random Tips That Might Save You

If your certs are expired and kubectl doesn’t work AT ALL, you can still access the cluster by SSH’ing to the control plane node and using the local kubeconfig:

export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes

This works because that config file talks directly to the API server without going through the load balancer or whatever.

You can renew individual certs instead of all of them:

kubeadm certs renew apiserver

Useful if you just need to fix one thing quickly.

Don’t forget about etcd certs if you’re running external etcd. Those expire too and breaking etcd is even worse than breaking the API server.

Back up your PKI directory before messing with certs:

cp -r /etc/kubernetes/pki /etc/kubernetes/pki.backup

Trust me on this one.

Final Thoughts

Look, Kubernetes is powerful and all that but sometimes it feels like there are a thousand little things that can break in surprising ways. Certificates expiring is just one of them.

The frustrating part is this is a KNOWN thing. It happens to people all the time. There are blog posts about it dating back years. But somehow it still catches people by surprise because nobody talks about it when you’re getting started.

If you’re running a kubeadm cluster and you set it up more than 10 months ago, do yourself a favor and check those cert expiration dates right now. Seriously, stop reading this and go check. I’ll wait.

Better to spend 10 minutes checking than 5 hours fixing. And way better than explaining to your boss why production was down all afternoon because some certificates expired.

Anyway, that’s my story. Hope it helps someone avoid the same disaster. If you’ve been through this too, I feel your pain. If you haven’t yet… good luck. It’s coming.

FAQ: Kubernets Certificate Expired

After I posted about this internally, got flooded with questions from other teams. Here’s the ones that came up most:

How do I know when my certs are expiring?

Run kubeadm certs check-expiration on your control plane node. Takes like 2 seconds. If you see anything under 30 days, deal with it now. If you see “invalid” anywhere, you’re already too late.

Can I renew certs before they expire?

Yes absolutely. In fact please do. You can renew them anytime. They’ll be valid for another year from whenever you renew them. Don’t wait until the last minute like we did (well, we didn’t wait at all, we just forgot).

Will renewing certs cause downtime?

Kinda yeah. You have to restart the control plane pods which means a few seconds where kubectl won’t work. In a HA setup with multiple control plane nodes you can do it one at a time with no downtime. We only have one control plane node so we just did it during a maintenance window.

What about worker node certs?

Those get renewed automatically by kubelet as long as the control plane certs are valid. BUT if everything expires at once like it did for us, you’ll need to rejoin the workers manually. It sucks.

Do managed Kubernetes services have this problem?

From what I’ve heard, no. EKS, GKE, AKS – they all handle cert rotation automatically. That’s like half the point of paying for managed Kubernetes. If I had known this would be such a pain I might’ve pushed harder for EKS from the start.

I’m getting the expired cert error but kubeadm says everything is valid?

Check if you have any old cached configs lying around. Jenkins, CI/CD tools, monitoring – they all might have old kubeconfigs that reference expired certs. Regenerate those configs and update them everywhere.

My pods are running but I can’t create new ones, is this certs?

Probably yeah. Existing pods keep running even with expired certs because they’re not talking to the API constantly. But creating new pods, getting logs, exec’ing into pods – all that needs valid certs.

Can I just extend the cert validity to like 10 years?

Technically yes but security team will murder you. The whole point of short-lived certs is that if one gets compromised, it’s only valid for a limited time. One year is already pushing it according to our security guy Dave.

What happens if the CA certificate expires?

Pray that it doesn’t? Renewing the CA is a much bigger deal. You basically have to regenerate every cert in the cluster and restart everything. The CA cert lasts 10 years though so you’ve got time to figure that out. Put a reminder in your calendar for 2033.

Should I automate the renewal?

Yes. We set up a cronjob but there are better ways. Some people use cert-manager. Some write scripts that run via cron. Some just set calendar reminders to do it manually every 6 months. Literally anything is better than forgetting about it completely.

How often should I check cert expiration?

We’re checking monthly now. Set up a monitoring alert that fires when any cert is within 30 days of expiring. Also have a runbook that documents exactly what to do when that alert fires.

I tried renewing and got permission errors?

You need to run kubeadm as root or with sudo. The cert files live in /etc/kubernetes/pki which is root-owned. If you’re getting permission errors that’s probably why.

Do I need to tell users about the restart?

Depends on your setup. For us, the API server restart caused like 30 seconds of “kubectl doesn’t work” which users noticed. If you’ve got multiple control plane nodes you can probably do it without anyone noticing. Either way probably good to give people a heads up.

Can this happen with RKE or K3s or other Kubernetes distributions?

Different tools handle certs differently. K3s uses certificates but manages them a bit differently than kubeadm. RKE has its own cert management. Check the docs for whatever you’re using. The underlying problem (certs expire) is universal though.