Un-fucking BGP again…
Over the weekend, I restarted our UDM-SE in order to apply a security fix that had been bugging me all week, and what I didn’t realize at the time was that I’d managed to break some of our MetalLB load balancers on our production home network.
After an excruciating time trying to work out why some of the services didn’t work (data was making it to the LBs just fine, it wasn’t being sent on to the service backing it), pouring over the iptables
output, all of which look correct, I missed the obvious issue: BGP on our Calico network was broken!
I’d missed this, because the one machine that was causing the issue was actually the one where everything was working… the packets that were routed to what used to be our disk server were working fine, it was anything else that went elsewhere that wasn’t. And the reason was once again, because of my hare-brained idea to run BGP servers for MetalLB/Calico and LXD on the same physical machine. Typically this manifests itself as LXD failing to start, but in this case the opposite happened.
I only noticed it when, to rule it out, I stopped LXD on the machine, and everything “just worked” immediately. Bugger! Now I’m in a situation where LXD won’t start, because it’s trying to bind to the BGP port and K8s is hogging it. I can’t turn it off without starting LXD, and I didn’t want to stop BIRD on the K8s cluster to make it work.
The solution, as I’ve done before, is sqlite fuckery on the dqlite database:
echo "DELETE FROM config WHERE key like 'core.bgp_%';" > /var/snap/lxd/common/lxd/database/patch.local.sql
echo "DELETE FROM config WHERE key like 'core.bgp_%';" > /var/snap/lxd/common/lxd/database/patch.global.sql
snap start lxd
They specifically mention in the documentation to not do this without asking LXD’s support team, but we don’t have time for that.
Besides, I really should migrate the few containers/VMs still left in there to another machine, so if I destroyed it I’m well-versed at plucking containers out of the wreckage anyway.
Anyway, now LXD is started, but I can’t route to the containers… since I only have one LXD host at this stage, I can just add a static route like I probably should have to begin with. Add that in my router and job’s done, everything’s working, and more importantly should come up after a reboot next time.