Last week, I finally got around to ordering replacement disks for the NAS, which I've been meaning to do for months now. There were no new SMART errors at the time I shut the machine down, but upon reboot it took approximately 2 minutes for the problem drive to detect correctly. Uh oh!
So on Thursday when the disks showed up, I yanked the old disks out and set them aside, configured the new drives one by one (I'll explain why in a moment) and then unrolled a RAIDZ pool on top of them. I ended up opting for 4TB WD Reds, which seems to be the optimal spot at the moment for $/TB and with absolutely everything in place it'll end up about 25% full. I'm optimistic that with a tiny bit of storage management, we won't fill them before the warranty runs out. That's the plan anyway.
So I put the disks in one at a time and labelled them with GEOM, using the numbers found on the end of each drive. I'm not sure if these are batch numbers, part of the encoded serial number, or what, but they're unique to each drive at the moment and serve to uniquely identify each disk for my purposes:
NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 label/1AHAK86p3 ONLINE 0 0 0 label/5PDYKALp3 ONLINE 0 0 0 label/6STX1EZp3 ONLINE 0 0 0
The problem with this is that apparently ZFS picking up the drives by their labels isn't guaranteed... though I haven't run into this issue yet, I thought I was being super clever and it turns out maybe not so. The basic idea is if one of the drive faults and needs replacing, I won't have any guesswork to know which drive to remove, assuming ZFS doesn't try and outsmart by grabbing them by GUID or something instead.
So next up was figuring out how to get the data across from the old pool. I could have (and in retrospect probably should have) added the old drives in the same machine and mounted the old pool, but instead I decided to whack it into an old machine and copy everything across the network. This was an unfortunate idea, as I was already deeply committed to this folly by the time I realized the machine I stuck the drives in only had a fast ethernet card. FYI, it seems it takes nearly 36 hours to copy about 1.8TB of data across fast ethernet.
We have most of our important data already backed up elsewhere - my work stuff via Git, Photos on a USB external drive, and so on. Still, I copied the important things across first with Rsync. Then I copied the shows Sabriena really wanted right now, and span up Plex again on the new machine while everything else copied.
Nearly 24 hours into the process, it suddenly stopped. I was working at the time, and heard a telltale sound behind me: a clunk, and a drive spin down. Yep, the drive I have been concerned about stopped - talk about gliding into the runway on fumes! Unfortunately it took out the entire machine, because 1/3rd of the swap existed on that drive and FreeBSD really doesn't like it when its swap disappears out from under it.
So I rebooted the machine, and after a nervous couple of minutes the drive came back up again (it should be noted that it'd have been fine on two drives), and so Rsync picked right back up where it left off, and I got the rest of the data off the pool without any further issues. I did get to see what an automated resilver looks like, and if I get time I might play with deliberately destroying and resilvering that disk (since none of the data on that drive matters now, it's the perfect opportunity to play) before I permanently decommission that drive.