Kubernetes Readiness Probes: Apparently Too Difficult

I like to joke that burning yourself with lit cigarettes is too expensive in Australia, and sometimes you just need to feel, so I run Kubernetes at home. I jest, of course, for the most part it’s fairly pain-free now I’m over the hump of figuring things out for the first time.

But one of the minor pain points is that when I update the containers for Mastodon, I spend several minutes getting either a 404 or a 502 from Traefik, every time… and in principle, this shouldn’t happen.

So what am I doing wrong? The first thing I figured I was doing wrong was only having less than 2 replicas (or in what humans would describe it, no replicas at all), but that on it’s own didn’t help anything.

After chatting it over with some folks on Discord, lamenting that one day I should figure this out, they mentioned liveness probes, which is the obvious sounding solution to it, and it kinda worked - but the second time around the container got shot-down because rails takes forever to fucking start for some unholy reason. I tried increasing the initial delay, and that failed too, because Mastodon’s rails server is just that fucking slow to start up (the cli is patently unusable for me, it just takes days to spin up).

But what I didn’t realize at the time (because I have the attention span of a goldfish and didn’t read all the documentation before trying shit) was that liveness isn’t the only thing you can control, they have “startup” probes as well, which are similar but different: startup is checked when the container first boots, and only when the threshold is met, is the container considered “up”, and then liveness probes are just checked to see if the pod is healthy.

So in the end, this appears to have done the trick:

  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
...
          livenessProbe:
                httpGet:
                  path: /health
                  port: web
                  httpHeaders:
                  - name: server
                    value: Mastodon
                failureThreshold: 2
                periodSeconds: 10
          startupProbe:
                httpGet:
                  path: /health
                  port: web
                  httpHeaders:
                  - name: server
                    value: Mastodon
                failureThreshold: 30
                periodSeconds: 10

Note the missing z from the Kubernetes documentation - Mastodon exposes a short, light /health endpoint, so that’s the one I use. I probably don’t need to check the headers, but I figure I might as well, and there may be some additional thing I could check to make sure it’s working, but for now this seems good enough.

This makes it wait up to 5 minutes for rails to start (which should fucking be enough) before 2 failures in 20 seconds is enough to shoot the pod down for misbehaving. I could look into using readiness instead of liveness (which just doesn’t send traffic to the pod instead of killing it to restart it), but I’m not sure that matters for my use case.

But the result was that the update to 4.5.0 kept the whole shebang running without a 404 or 502, which was nice.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry