Kubernetes: Container updates
Over the last week or so, I’ve gradually gone through all of my services and updated everything. The only things left this morning were my collectd container (which I build, so a bit more involved) and Traefik, which is my ingress, which I theorized should be the easy one. Sure, I’ve been a bit slack and haven’t updated it in a while, but it’s only one minor version bump?
I’m not sure who’s fault it is, the upgrade notes for 2.10 made it sound easy - they’re switching the API names, both are supported but you gotta update the CRDs first. No worries.
Did that, applied the deployment with the new container version, and all my routes and middlewares disappeared. Fuck! Roll it back, everything came back, so I went to take the dog for a walk (she had fun, FYI).
Upon returning, I set about looking into it. Tried updating all the ingressroutes to use the new API, learned that
kubectl caches things, so when you do that then
kubectl get ingressroute won’t show anything, so learned how to blow away the cache (
rm -r ~/.kube/cache) and repeat. Still no luck.
Nearly an hour into it, I had the genius thought to look at the logs…
It turns out that at some point either Traefik changed their examples, or I did my own thing, anyway my Traefik lives in
kube-system for some reason, but also my serviceaccount is
traefik-ingress and theirs is
traefik-ingress-controller, so between these two things, I was getting errors (which I did not save) about how the service account Traefik is running as doesn’t have access to the new
traefik.io APIs, and because of that it was throwing an exception and none of the updates were happening.
So that’s easy enough to fix, destroy all the traefik-related roles, serviceaccounts, etc, fix a couple mistakes in the
rbac.yaml I downloaded from the guide (ie changing the SA name, and putting it in the namespace I’m using), and off it goes. Little by little, re-deploying the ingresses meant all my services came back.
Have I made a mistake by not changing it to conform to the examples? Maybe, but I’m loath to break it again by trying to fix something that isn’t a problem yet, so I’ll kick the can down the road and hope I don’t fuck something else up in the move to 3.0. What I probably should do at that point is tear the whole cluster down and rebuild my manifests from scratch. What I probably really should do is redo the whole thing in Helm instead.
But I think everyone knows what I’m gonna do when that day arrives instead. Walk the fucking dog.
Update: 2023-07-16: After discussing it with folks on Discord, Kevin recommended putting Traefik into it’s own namespace after all. The discussion prompted a vague memory, I somewhat recall either deciding to, or being told to, put Traefik in
kube-system to protect access to the
cloudflare secret from the rest of the containers. Given that this is a single-user cluster, it’s a bit irrelevant, but regardless it was fairly painless (other than accidentally blowing away the secret) to do so.
I should probably split the other workloads up (ie all the other shit doesn’t require access to Mastodon’s secrets, for instance) but that can go on the backlog for now.