A while back, I built a NAS out of parts we had laying around as an experiment. We're using 3 1TB hard drives (two identical ones and a third completely unrelated one), in a ZFS RAID array (essentially RAID5, based on my completely elementary understanding) on FreeBSD. That experiment was pretty much a resounding success, we use the thing almost daily and it's been super handy, however there is one minor problem with it:
# for i in /dev/ada[0-2]; do smartctl -a $i; done | grep 'Pre-fail\|Old_age' | wc -l 69
The drives were ancient when I built the thing, I just didn't want to sink a bunch of money in if we weren't going to use it, or it wound up being a pain in the ass, or anything like that. The SMART data wasn't fantastic when they were installed, but now there is absolutely nothing that isn't in either
Old_age state. So far there are no detected errors in the filesystem, but the SMART data is certainly worrying so it's time to start thinking about disks.
I ummed and ahhed over disks and in the end got cold feet. It didn't make much sense to spend several hundred bucks on disks and then strap them to a 10 year old PSU, in an ancient case with poor airflow, and the total cost quickly ballooned to more than I wanted to spend at once, so I compromised... I've ordered parts necessary to replace these, and keep using the old disks for another month, where I'll be working a fair bit of overtime and be able to more readily drop the money on disks - at this stage I'm leaning towards Seagate or WD 4TB drives, both the offerings from my preferred local vendor have pretty good failure stats so far (at least, aren't in the batches that appear to fail regularly) but in the mean time it's time to address the elephant in the room: my woeful backup policy.
Once upon a time I was really good about backups. When my average hard disk size was 1~8GB, mostly filled with shit I didn't care about, I was regularly burning CD-Rs with all the important stuff on them. Wanting to spelunk through old data once upon a time and realizing the dye layer had failed on one of my Kodak Gold CD-Rs, supposedly with a 100-year guarantee but which had failed after about a decade, I became disillusioned with optical backups, and my ballooning storage and the time it took to burn a DVD-R only served to cement that decision.
I moved to using cloud storage for backups - I'm a huge fan of Tarsnap and to a lesser extent Amazon S3 in general - but only for really important shit. There's still things like photos and so on that aren't backed up unless we've uploaded them to somewhere like Flickr. We should fix that.
I'm considering using something like S3 for them, but in the mean time we have a 2TB NAS and a 2TB external USB drive, so we could do something with this right?
I used FUSE to mount the NTFS partition on the FreeBSD NAS, fired up
rsync to copy some files across and left it running overnight. In the morning, I ran the script again, and it promptly started from scratch - what the fuck? Apparently this is a common problem as NTFS is whacky about modification times, so I tried checksum mode... which was far too slow. Even if no changes have taken place, ~500GB of binary files is going to take a while to checksum, particularly on a low-spec machine. Finally, I switched to size-only verification which works, but can leave the potential for corruption of things like Steam backups:
rsync -rav --size-only /Storage/Games/ /mnt/Games/
So it's not perfect, but it does work. It doesn't take too much effort to plug the hard disk in and run a short shell script every month or so to keep things backed up.
I'm going to look at the cloud-based options next, as if for example we have a fire or a flood there are things like photos which will be gone forever, but at least for the time being this will work for most cases.
I just need to pull the trigger on disks now!
Update - 2018-01-11: Disks are going to have to happen after this payday I think:
Error 41 occurred at disk power-on lifetime: 17292 hours (720 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 08 80 00 Error: UNC at LBA = 0x00800800 = 8390656 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 08 80 40 00 00:02:24.479 READ DMA 2f 00 01 10 00 00 00 00 00:02:24.398 READ LOG EXT 60 00 01 01 08 80 40 00 00:02:21.620 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 00:02:21.531 READ LOG EXT 60 00 01 01 08 80 40 00 00:02:18.770 READ FPDMA QUEUED
Still no ZFS errors, but I feel like that drive is not long for this world. Thankfully the others are just "old age" and "prefail" still.