Kubernetes NFS issues fixed?

Way back at the start of last year, I started “scaling out” our K8s cluster, and used NFS where possible to provide a filesystem for containers. The plan is to switch most of these out for Longhorn, but I’ll still need some of it… like Plex for instance, I am not buying enough SSDs to have longhorn be the storage backend for the media (though I probably will put the SQLite database on it).

This has presented a single nagging issue: every so often, I’ll try to schedule a pod (usually due to an upgrade or whatever) and it’ll just sit there “pending” with the error message in the log something to the effect of connection refused, which is really odd because the NFS server is up and functioning fine.

It finally did it the other morning before work, and I managed to think to look at syslog at the time, and found something slightly more helpful:

svc: failed to register lockdv1 RPC service (errno 111).

That’s still a pretty shit error message, but it’s a slightly more helpful Google-snack than the previous one. I spent a stupid amount of time trying to work out what was wrong with the RPC daemon on the NFS server to no avail, and I thought it was weird because in the past, if I drained+rebooted the client node, it would typically work fine after that.

It was only after a shameful amount of screwing around that I realized what I actually seem to need is the rpcbind service on each of the client nodes. This raises questions like “why?” (I am not an expert in NFS by a long shot, I do not know how the different pieces interact with each other) and more importantly “how did it ever work without it?”

My best guess is that it works fine when there’s only one mounted directory for each NFS export, and what’s probably happening is something is not cleaning up after itself and a directory is still mounted in another namespace, which is why rebooting the client node fixed it? Complete guess though.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry

Hello Longhorn!

After the disk server upgrades, but which also somewhat underscored the need for it, I decided to finally “shit or get off the pot” on looking at some durable storage for things. Because at the moment when the disk server goes down, pretty much everything else does too. Some things, like Plex, this is unavoidable. For others, like Home Assistant, it’s very avoidable, I’m just lazy.

But so I pointed the latest version of my “Put Kubernetes on the Thing” Ansible playbook at the first two compute nodes, then one at a time drained them, powered them off, and fitted a 500GB SSD to them (one Samsung Evo 850 and one 860), and fired them back up.

I installed Longhorn to the cluster, deleted the default disks (I don’t want them on the boot SSDs), created new ext4 partitions, pointed the two compute nodes at them, and then for completeness I created a ZFS dataset on the disk server and pointed Longhorn at it too. My theory was that if I used taint to keep things on the compute nodes, but a third extra replica on the disk machine, I should be fine.

To test things out, I figured Traefik is probably a good candidate: it’s needs for persistant storage are just to stop it from needing to get a new ACME certificate on every start-up, so the consequences for shit going wrong are not particularly dire. So I reasoned that I’d add a second volumeMount to it, and I could copy the acme.json file across - this was not to be because I got the error message:

code = Unknown desc = file extent is unsupported: operation not supported

A quick check of old Googs showed a pretty obvious explanation: ZFS-backed data is not supported. Ahh well. So I deleted that disk, set the replicas to 2, and off it went.

I then struggled on a restart because my Traefik pod is rootless with an immutable root partition, and the permissions were coming up wrong. I can use fsGroup to set the mount point of the volume to the gid that the container runs as, and that lets it create acme.json once, before it fails to do anything because it complains that the permissions are too wide - 0660 instead of 0600, which makes sense as the file and dir are owned by root.

I ended up giving up for the night by relaxing the policy on that namespace and just running Traefik as root. I was able to point Longhorn at the former dataset on the disk server, via NFS, for backups though, so that’s nice.

So yeah, I think this will do the job, I just really need a third SSD for the other compute node. I have a 240GB one out of Sabriena’s desktop but I’d kinda like to try find another 500GB… I just don’t want to pay the price of it.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry

Homeprod Upgrades: Part 2 - Supermicro X8DAH+-F -> MSI C236m

An MSI C236m motherboard with a Noctua cooler in a SuperMicro 3U chassisToday, after walking the dog and collecting the groceries, I shut down all the services that might not like having their disks yanked out from under them and then shut down the disk server in preparation for replacing the dual Xeon 5650 board that’s in it with a single 1230v5 board, which will hopefully save some power and heat.

It ended up not being as cheap as I’d hope, as I needed to buy a new cooler (opting for a Noctua NH-L9x65) and in order to not have the five server chassis fans blow up the two fan headers on the board, I bought a pair of Coolermaster “ARGB and PWM hubs” which are apparently rated for 1.5A per fan and 4.5A total.

I plugged everything in, then noticed that Supermicro’s front chassis uses a ribbon cable connector, so I used some Arduino jumpers to re-pin it so I could plug the bits I needed into the board. I found someone who buzzed it out and recorded the pinout for an FP836, which appears to be equivalent for my purposes and which I’ve mirrored below:

Pin Purpose
1 Power Switch+
2 Power Switch-
3 Reset Switch+
4 Reset Switch-
5 Power Supply Fault LED+
6 Power Supply Fault LED-
7 Thermal Fault LED+
8 Thermal Fault LED-
9 Network Interface LED 1+
10 Network Interface LED 1-
11 Network Interface LED 2+
12 Network Interface LED 2-
13 Hard Disk Drive Activity LED+
14 Hard Disk Drive Activity LED-
15 Power On LED+
16 Power On LED-

The next thing I noticed is it’s loud, and the fans are not calming down. They’re probably not on PWM mode, I figure, so I head into the BIOS and take a look, and near as I can tell PWM mode is called “Smart Fan Mode”, but nothing I can do will make them shut the hell up.

Not wanting my garage to sound like a cryptocurrency mine, I threw the low-noise adaptors I bought ages ago back on it as a temporary measure and then reconfigured everything. Naturally the NIC has changed again, so I had to alter my netplan config, but otherwise everything mostly came straight up and I’m back in action.

So I still need to work out what the fuck is going on with the PWM fans, but I gave the whole disk server a good clean and then slid it back into the rack for now, and we’re up and running again.

As I don’t have proper power monitoring on this rack (the UPS will tell me “amps” and “load” but as it’s severely underloaded and reports in integer values, it’s not precise enough) but last night our power usage dipped to well below 500W overnight so I’m pretty sure there’s some improvement. The payoff time is likely in the 3 year region though, and that’s not counting if I have to spend even more money to fix the fans, so this hasn’t really been a good move, economically.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry

Homeprod Upgrades: Part 1 - Renumbering the server subnet

Today, I started work a couple hours early to catch an early planning meeting, which meant I would finish early as well. Since I had a couple of hours to kill (and potentially kill our internet connection) until Duncan came home, and Sabriena was reading a book and not using it, I decided what the hell, let’s kick this show off early.

In preparation for the weekend, on Thursday morning before work I’d already set the DHCP lease time to 300 seconds (5 minutes, if my maths is correct), so there was no reason not to. While I was at it, after reading a thread on a forum someplace, I turned off DHCP guarding as well.

Starting out by preparing for the worst-case scenario, which was a reset of everything - I wrote down all the port configs for each switch. I also noted down all the static IPs for each server, since I figured it would reset them if I changed the subnet of the network they were on. Finally, to ensure there was no issues with SQLite taking a shit because something evaporated underneath it, I shut down a bunch of services on the Kubernetes cluster by just deleting all the deployments.

I then unplugged the switch cable that went to the server rack, so they could talk amongst themselves while I broke everything, and then I hit the button to save the changes with the new subnet on the default VLAN and… nothing. It won’t save, it gave me an error message:

Failed saving network “Default”. {modelType, select, profile {Profile} network {Network} portIpGroup {Port and IP Group} other {}} includes {type, select, User {a Client’s } FirewallRule {a Firewall Rule } other { }}"{name}" configuration. Please remove this first before deleting the {modelType, select, profile {Profile} network {Network} portIpGroup {Port and IP Group} other {}}.

This took a while to figure out, as I’d already removed all the firewall rules that referred to it (and no, the above is not me filling it in, that’s verbatim what the error message said)… I also had to remove the two static routes that pointed to one of the servers in that network, and then I was able to save it.

I then threw up the new network on a VLAN, and added the original subnet back to it, which meant that Unifi would be responding on the old “inform” host as well, and well… everything just worked. So I set about putting the firewall rules back, re-added the static routes for the old LXD server and the new (unused, as of right now) Incus server, plugged the rack switch back in, and then after a few minutes I re-applied all the deployments on the cluster and everything “just worked”.

Well, almost everything, I managed to leave out (trying to simplify) a rule which allowed the server cluster to speak to the IoT subnet, so the lights etc did not connect… but once I put that back, everything came back as well (didn’t even have to restart Home Assistant).

I’m really well impressed, it almost went too smooth. Hopefully that’s an indication (rather than the calm before the storm) of how this weekend’s maintenance will go!

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry

Office lights: almost done

I’ve found a million other, better things to do, but this weekend I finally got around to making some progress on finishing off the shelving in my office, by which I mean finishing up the stupid lighting idea I had.

The cheap adhesive on the back of the RGBW tape did not stick well at all to the underside of the plywood, so I decided to use a staple gun to attach it semi-permanently. I did some test fires into a piece of scrap plywood to get the staples so they’d stop prior to crushing the tape. I then started along, and got three of the five lighted cubbyholes done before it went wrong in the most predictable fashion ever: I pierced the tape with a staple somehow.

This was immediately evident by the “warm white” color I’d selected being cyan for one segment, and what followed was several hours of work to unsolder the ruined segment and solder a decent one in. At this point I decided to drive the staples in first, and then feed the tape through.

What this meant, however, was that I would have to feed each piece through, then solder and heat-shrink the joins upside down under the shelf, which made the entire job way worse. But by abount lunchtime on Sunday I was done, and cycling through all the colors of the rainbow in Home Assistant I was satisfied everything worked.

Next I used the little reciprocating “rennovator” tool to notch out the bullnose trim strip we bought a good couple months ago, and stapled it into place. It almost perfectly hides the LED strip, there’s a couple places where a join wasn’t quite straight so it bows down just enough to peak out at sitting height, but at standing height you cannot see it at all.

All that’s left now is to counter-sink the staples, fill the holes, and one more lick of paint to make it match, and then probably try do something about matching the wood stain on the bits they did not bother to stain where the closet door was.

Another weekend, this one’s over.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry