Kubernetes: Pods restarting
As part of recovering from the power outage yesterday, this morning I noticed that one of my pods still has a huge number of restarts:
nginx-5dfd696766-pw928 0/1 CrashLoopBackOff 193 (13s ago) 20h
It’s much too late for there to still be a recurring problem, but I have had this issue before on one of the instances, foodin
. I had previously drained, rebooted, then uncordoned this issue, which seemed to have fixed it before, but not this time:
...
Node: foodin/192.88.99.65
Start Time: Tue, 02 Sep 2025 10:33:54 +1000
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal Killing 2m40s (x193 over 20h) kubelet Stopping container nginx
Normal SandboxChanged 2m39s (x193 over 20h) kubelet Pod sandbox changed, it will be killed and re-created.
I’m sure I’ve dealt with this issue in the past, but I didn’t write down what the solution was. A quick 6am Google search because I’m wide awake for some reason and someone mentioned that it can be caused by containerd
not having the correct configuration. When I inspect foodin
and compare it to the other two compute nodes, sure enough, they have an /etc/containerd/config.toml
and foodin
does not. So:
root@foodin:~# mkdir /etc/containerd
root@foodin:~# containerd config default > /etc/containerd/config.toml
root@foodin:~# systemctl restart containerd
and so far so good?
nginx-5dfd696766-pw928 1/1 Running 194 (12m ago) 20h
Thinking about it, I’m fairly sure that the reason that draining and rebooting the node fixed it last time was probably just because it evicted all the pods from that node, and they restarted on a different node, where they were happy?
Power outage!
I had just finished making a coffee, and Sabriena was gearing up to trim the dog’s nails when right as she went to turn on an audiobook on her headphones, one of the speakers in our house made a loud noise similar to the bluetooth pairing noise. It thus took us both some seconds of confusion before we realized the power had gone out.
This is not actually - touch wood - a terribly common occurrence out here. We’ve had one large-scale outage in the 12 years or so we’ve lived in Horsham, and that was when the local substation caught fire on a very hot afternoon. We’ve had a couple of smaller ones, 20 minutes or so, usually involving either a storm or a car accident with a power line, all of which the UPSes coped well with and we went right on with our business.
Today’s outage however appeared to be one of the larger ones - the outage map looked rather huge and listed the location of the outage as the substation (not sure if they all do?). I was in the process of shutting down machines (it shows how infrequent these issues are that I haven’t automated that) when the rack UPS suddenly died. It looks like it’s definitely time for more batteries (I have been saying that for a while now though):
It did illustrate that when they swapped out our NTD, I neglected to plug it back into the UPS… so I did that, and we had internet for about 20 minutes before that UPS died also. It’s a 1500VA Cyberpower model, so it’s not surprising it doesn’t have a huge amount of juice in it, but that did seem premature as well, and it’s a few years old so it may want a new battery too?
The power came back about 30 minutes later, the disk server gave me just enough grief starting up that I had to dig out a 4:3 monitor in anger, but it actually started up without help if I was just a little more patient.
Getting all the services to come back online required a bit of work - for some reason my Kubernetes cluster will start them without the requisite iSCSI or NFS mount working, they’ll crash repeatedly, and I’ll have to delete and recreate the pod and sometimes the entire deployment before it’ll quit doing it and act right.
Unless we get a few more of these, I don’t think it’s quite worth dumping money into a house battery yet, but I am going to have to grit my teeth and order replacement batteries for the UPSes.
More blog tinkering…
Fooling around with the templates, I set about fixing some indentation issues. It’s a long-standing annoyance of mine that Hugo’s markdown to HTML renderer throws away all indentation - I’m pretty sure you can fix it with something like:
{{ .Content | strings.ReplaceRE `[\r\n]` "\n " | safeHTML }}
… however I wasn’t really happy with the results of this, so I elected to just wrap it in some HTML comments and be done with it. HTML comments are themselves ignored by Hugo’s templating engine for some reason, this was my solution:
{{ printf "<!-- this is a comment -->" | safeHTML }}
But for some reason, it was not happy with this inside the inline <script>
tag? After quite a bit of experimentation, I finally Googled it… to learn that apparently HTML comments around inline Javascript hasn’t been necessary in like 20 years or some shit, and Hugo was actually Doing The Right Thing. Bah!
Anyway, I cleaned up a bunch of the indentation, threw away some more classes I didn’t need, and it’s actually not looking too bad right now. Did I do anything about the contrast ratio of light mode text? No, no I did not.
Oh, I also learned that SRI is useless for inline script elements, so I removed that. I’m sure some security scanner or another told me I needed it, but we’ll see. It seems to be fairly happy without it, in Firefox at least.
Update - 2025-08-14: I figured out a hackish way of making the main text black, and I put a black stroke around the logo for now so it’s visible. I also added some CSS which will make the top and bottom navigation links just icons if you’re on a mobile or really small screen, which stops it form having sideways scrolling for no damn reason.
I think that’s it most of the way “good enough” - I am of course itching to redo a theme from scratch because I’m starting to hate this one, but for now it’s definitely good enough.
PicoCSS - merged
Quite some time ago, I looked at switching from Bootstrap to PicoCSS, but I didn’t end up going through with it. I don’t remember all the problems that I had, but I managed to, while not having any internet this weekend, solve a couple of them.
Getting rid of the dark backgrounds on the syntax-highlighted code blocks was easy enough, I just had to add this to my config:
pygmentsUseClasses: true
This has, of course, left me without syntax highlighting at all, as it just sets the classes and then expects CSS to actually pick the colours, and I do not have any CSS set up for that. But I’m actually okay with that, the defaults are mostly readable, and I can pick colours later on… I just need to find a theme (or potentially hack one together) which supports auto light/dark mode and remains readable on both. Easy, but a task for another day.
The only other thing I can think of that’s bothering me is the header image doesn’t really work properly for light mode… but I think I can fix that later when I get around to it.
In order to fully benefit from it, I need to subset font-awesome, since I only use about seven of the glyphs I don’t need all 180KB of it. I could have them do this, but looking into it, they want $150/yr for the ability to do that… that’s a tough sell for someone like me with no design intentions. In fact I’m starting to wonder whether a PNG of sprites mightn’t be a better solution, abandoning font-awesome completely? Heck even if I did separate PNGs it’s still going to come in at less load time on a very slow link, I think - at the cost of losing perfect scaling. SVGs?
For now, I don’t know if I care enough. I’m still sub-megabyte first page load if there’s no images on the front page - around 500KB, and approximately 250KB or so of that is font-awesome.
I think the first order of business is to sort out improving the contrast of the light-mode text, and any other associated annoyances with it.
No internet!
Woke up this morning to my phone being connected solely via its 5G connection, rather than our wifi. That’s weird. Also wireguard is down - doubly weird. Out of bed, and Sabriena goes “I think the internet is down” - yes it sure seems that way.
I look at the NTD, and there’s no optical light on it. I then remembered that we were supposed to have maintenance last night, so I checked and the window was from midnight to 6am, and given that it was about quarter past seven at this point, clearly something had gone very wrong.
Looking at the guides, I power cycled our router - even though it was clear it wouldn’t do any good. I then waited until 8 to call up tech support, where they informed me that yes, it was fine and probably even a good idea to power-cycle the NTD as well. Still no improvement, and at this point the tech asked me to confirm the sequence of lights: yes, power status light is green most of the time, the optical status light is dark, and the power status light periodically goes flashing which suggests it’s starting up. We did try a reset of the NTD, by holding the tiny reset button in with a paper clip.
No good, they’ll have to ask NBN to come out and check out the NTD, and the earliest appointment is Monday. We don’t have any confirmation or anything, but the speculation is:
- the rather lengthy maintenance window was them rolling out software patches for the optic hardware to support the forthcoming 2-gigabit connections they’ll offer.
- it seems quite likely that the software update was botched on our NTD, and it’s boot-looping as it reaches the point where it tries to enable the optical hardware.
Anyway, once I picked myself up out of the fetal position, I started thinking about what I’d do for work. I’d already tethered my phone to my work laptop which would get me through today. I looked to see if we could get a proper modem in town that would do 4g or 5g as a backup - I can, but at $250AUD I’m not sure I want to… our connection is generally fairly reliable and I can use my wireless hotspot for work.
Then around my lunch break I remembered that I have a very old TP-link WAP that can be configured as a wireless->ethernet bridge. I can turn off it’s WLAN, bridge it to my phone’s hotspot, and have everything on the network have access to the internet!
This worked, but with a few issues: first, the device is grossly out of date. It’ll get me through the weekend, but it’s not a permanent backup solution (entirely unsuitable for work). Second, it’ll be slow - for some defintion of slow. Third, there’s the possibility I’ll chew up all the saved up transfer quota on my mobile account doing this - I have 215GB free… but we’ve only used 400-ish for this month (8 days) so it seems quite likely that if we limited our downloads we could get through the weekend using up our phones after all.
The final issue would be I can’t have my phone in my pocket or near me, so I decided I would not set it up until after the work day was finished.
Update 2025-08-08: I set it up, it works, but keeps dropping off. I then tried Duncan’s iPhone, on the assumption that if we eat up his quota instead, it won’t matter as he never uses his phone. He approved this idea because I told him he can watch YouTube if he wishes - it’s his quota. Interestingly, his ancient iPhone XR performs much better, staying connected pretty much the entire time until it shuts off the hotspot after 8 hours or so? He did mention that the couple of Roblox games he tried were unplayable, but I did manage to stay connected to GTA Online without too much issues.
Also it’s interesting to note how well Mastodon handles sporadic connections. When we’re connected and the tailscale tunnel I set up for inbound connections (due to the fact that our mobile connections are CGNAT, so I can’t open an incoming port, much less 443) comes up, it immediately starts picking up new posts, sending the old ones, etc. The sidekiq queue on my end flushes out fairly quickly, maybe 45 minutes? I’m not sure what it looks like from another admin’s perspective, with my server throwing 502s for every request that isn’t in the cache, but I’m impressed.
Update 2025-08-11: Our technical visit window was from 8am til noon, so I skipped lunch and didn’t walk the dog today. Unfortunately, I got the SMS that the tech was on their way about 10 minutes before noon, and then a phone call shortly thereafter - he was busy helping someone out who had a medical connection, but he’d be there when he could, would I be around later? Sure, as much as I might rock back and forth and chant at the prospect of going additional days without a decent internet connection, someone with a “medical” connection likely has far greater needs than mine.
He finally made it out at about 4:30, and the visit was very quick. He brought in a tester to check the optical connection, disappeared out to his truck, grabbed another NTD, replaced it, put some numbers into the app on his phone, told us he’d wait into the driveway until it synced but if he drove off it’d be fine. He drove off, so I plugged the UDM-SE back in and we have internet again.
At one point he asked if I could throw the old NTD away for him, and I said sure… I was already thinking about taking it apart! Alas, he took it with him, so I don’t get to see what’s in them.