Why Kubernetes Service Account Tokens Keep Breaking (And How I Fixed It for Good)

You know what’s wild? I’ve been running Kubernetes clusters for almost three years now, and kubernetes service account tokens still manage to catch me off guard. Last week was a perfect example.

Everything was humming along fine until around 2 PM when Jenkins started failing builds. Not just one or two – literally every pipeline that touched our k8s cluster. My Slack was blowing up. Fun times.

Here’s What Went Down

The error messages were useless, honestly. Something about forbidden access and service accounts not having permissions. I remember thinking “great, another one of these days” because this exact thing happened six months ago and I couldn’t remember how I fixed it back then.

Started digging through logs and found pods trying to talk to the API server but getting rejected. The weird part? Nothing had changed. No deployments, no updates, nothing. Or so I thought.

Turns out Kubernetes Changed Everything

This is gonna sound dumb but I didn’t realize Kubernetes basically rewrote how service account tokens work in version 1.21. We upgraded our cluster last month and I completely missed the memo about tokens being different now.

Before the upgrade, you’d create a service account and boom – permanent token, sits in a secret, works forever. Easy. Now though? Tokens expire. They get mounted into pods automatically but they’re not permanent anymore.

Nobody bothered mentioning this in the upgrade notes we skimmed through. Or maybe they did and we just didn’t read carefully enough. Either way, our automation scripts were still looking for those old permanent tokens that don’t exist anymore.

How I Actually Fixed Kubernetes Service Account Token issue

Okay so first thing I did was verify the service account wasn’t deleted somehow:

kubectl get sa jenkins-agent -n cicd

Still there. Then I checked what it could actually do:

kubectl describe sa jenkins-agent -n cicd

Looked normal to me. But then my coworker Mike walked by and was like “did you check the role bindings?” and I realized I’m an idiot.

kubectl get rolebinding -n cicd

Yeah. The role binding was gone. Completely missing. Turns out someone ran a cleanup script last Friday (naming no names but it rhymes with “Mike”) and accidentally nuked some bindings we actually needed.

Getting Things Working Again

Had to recreate everything from scratch. Here’s what ended up working:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: jenkins-role
  namespace: cicd
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "pods/exec"]
  verbs: ["get", "list", "create", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jenkins-binding
  namespace: cicd
subjects:
- kind: ServiceAccount
  name: jenkins-agent
  namespace: cicd
roleRef:
  kind: Role
  name: jenkins-role
  apiGroup: rbac.authorization.k8s.io

Slapped that into a file and ran kubectl apply -f fix.yaml. Jenkins started working again immediately. Crisis averted.

But Wait There’s More

Two days later our Prometheus instance started acting weird. Couldn’t scrape metrics from half the pods. Different problem, same root cause – tokens.

See, Prometheus runs outside our main cluster and uses a service account token to authenticate. That token had expired. The new token system in Kubernetes means tokens aren’t valid forever anymore, which broke our monitoring setup.

If you really need a token that doesn’t expire (I know, I know, security people hate this), you gotta make a secret manually:

apiVersion: v1
kind: Secret
metadata:
  name: prometheus-token
  namespace: monitoring
  annotations:
    kubernetes.io/service-account.name: prometheus
type: kubernetes.io/service-account-token

Is this the most secure thing ever? Probably not. But you know what’s also not secure? Not having monitoring because your tokens keep expiring every few hours. Pick your battles.

Stuff I Learned The Hard Way

Next time this happens (and let’s be real, there will be a next time), I’m checking these things first:

Does the service account actually exist in the right namespace? Wasted 20 minutes once looking in the wrong namespace.

Is there a role defining what the service account can do? If not, create one.

Is the role binding connecting them? This is the one that gets me every single time.

Check which namespace everything’s in. Roles only work in their own namespace unless you’re using cluster roles.

If stuff’s running outside the cluster, the token might be expired.

Quick way to test a token:

kubectl --token=whatever_your_token_is get pods

Works? Good. Doesn’t work? Now you know.

The Namespace Thing Is Annoying

I spent like three hours debugging an issue once where everything looked perfect. Service account existed, role existed, role binding existed. Nothing worked.

Know what the problem was? Service account was in production, role was in production, but the role binding was in default because I copy-pasted from an example and forgot to change the namespace. Three. Hours.

Now I triple-check namespaces every single time. Saves so much frustration.

Sometimes You Just Gotta Restart Stuff

Had this one issue where the API server was being weird about tokens. Everything looked correct but authentication kept failing. Stack Overflow had nothing useful. GitHub issues had nothing useful.

Eventually just restarted the API server pods:

kubectl delete pod -n kube-system -l component=kube-apiserver

Worked like a charm. Sometimes the old “turn it off and on again” really is the answer. Don’t do this in production without a good reason though, obviously.

What I’m Doing Different Now

Made a little script that checks all our service accounts every morning. Not fancy, just loops through namespaces and makes sure the bindings are there. Sends me an email if something’s missing.

Also started keeping notes in Confluence about which service accounts need which permissions. Future me will appreciate it. Hell, current me already appreciates it because I’ve referenced it twice this week.

Bottom Line

Service account tokens in Kubernetes are more complicated than they should be. The old way was simpler but less secure. The new way is better for security but breaks stuff if you’re not paying attention.

Most problems come down to missing role bindings or expired tokens. Check those first, save yourself some time. And maybe don’t run cleanup scripts on Friday afternoons without checking what they’re actually going to delete.

Oh and document your stuff. Seriously. Your future self will thank you when this breaks again at midnight and you can’t remember anything.

You can also check my Blog Feed to learn about various kubernetes issues and their fix that i have implemented for almost each major component of kubernetes.

## Resources That Actually Helped Me

Look, the official Kubernetes docs are fine but they’re way too formal sometimes. Here’s what I actually used when I was stuck:

The Kubernetes RBAC docs – yeah they’re dry but eventually you need them

## Questions People Keep Asking Me (FAQ)

**Why can’t I just use the old token method?**

You technically can if you create the secret manually like I showed with Prometheus. But Kubernetes wants you to move away from that. Something about security and tokens living forever being bad. They’re not wrong but it’s annoying when you’re trying to get work done.

**How do I know if my token expired?**

Try using it. Seriously that’s the fastest way. Or look at the pod logs and you’ll see authentication errors. The new tokens in pods auto-refresh though so this is mainly a problem for external tools.

**Do I really need both a Role AND a RoleBinding?**

Yeah unfortunately. The Role says what actions are allowed. The RoleBinding says who gets those permissions. It’s like having a key (Role) and then saying who can use the key (RoleBinding). Annoying but that’s how it works.

**What’s the difference between Role and ClusterRole anyway?**

Role only works in one namespace. ClusterRole works everywhere. Most of the time you want Role unless you’re doing something that needs access across the whole cluster. Less permissions is usually better.

**My service account can’t see pods in other namespaces, why?**

Because Roles are namespace-scoped. If you need cross-namespace stuff you need a ClusterRole and ClusterRoleBinding. But think real hard about whether you actually need that because it’s a security risk.

**Can I just give everything cluster-admin and call it a day?**

Please don’t. I mean you CAN but then when something gets compromised it can trash your entire cluster. Been there, not fun. Give the minimum permissions needed and add more later if you have to.

**How often do I need to rotate these tokens?**

The auto-mounted ones in pods handle themselves. The manual ones you created? That’s on you. We’re rotating ours every 90 days now but honestly it depends on your security requirements. Ask your security team, they’ll have opinions.

**My role binding is there but it’s still not working help**

Check the namespaces. I’m serious. Then check them again. Then check the service account name is spelled exactly right. Then check that the role actually has the permissions you think it does. 90% of the time it’s one of those three things.