At work, we’ve had a minor grievance with ZFS for quite some time now… hung tasks of
umount with no seeming recovery beyond a reboot. We had a pile of tweaks that minimized the frequency of occurrence, and then used our neato infrastructure to recover from it and minimize impact to clients. That all changed with Ubuntu Bionic, which saw it come back with horrifying regularity, so this weekend without much better to do I wondered if I could reproduce it in the lab.
I setup a VirtualBox VM on my desktop, with two virtual disks: a sparse-allocated 50GB partition for a boot drive, and a full-allocated 100GB partition to back the ZFS pool (so I could abuse it at full speed). I installed Ubuntu Bionic on it, install ZFS + LXD, and wrote some nasty scripts to spin up and destroy containers and… nothing.
I then realized we were running a different version of ZFS, so I installed that, rebooted, and had another crack at it and still nothing. Almost deterred, I took another look at our configuration and noticed we were using deduplication, which I suspected to be the culprit because frankly I’ve had really shitty luck with dedup (though I generally run ZFS on much worse hardware than we use at work!). I blew away all the containers, turned on dedup on the LXD pool, and span up a bunch more containers.
I wrote another script that would write 20MB of random data to the container’s filesystem, take a snapshot, then replace it, and sure enough after about 12 containers LXD became completely unresponsive. Two minutes later, I had the “hung task” messages in syslog, meaning there was a pretty good likelihood this is entirely reproducible.
At this point I decided to go back to my video games and leave it till tomorrow, but I threw up a quick message in Slack, which seemed to thrill the boss… so Monday morning we’ll have to assemble a bug report and see if the folks much smarter than I can actually work out where the bug is and squash it. I’m still really excited at what appears to be progress though!