Saturday, February 20, 2016

I tried to set up remote backups, now I'm running a scientific experiment.

You know how programmers trot out this video to explain what dealing with computers is actually like?


This is one of those stories.

For a while now, I've had in mind to set up remote backups. I have duplicity set up, which successfully creates backups, but they're stored on an external hard drive. For better coverage, I wanted to also send to S3 (duplicity makes encrypted archives with gpg, mind you). Well, I feel a bit uneasy about running the S3 uploader, which is the boto library, as root (I couldn't find any indication one way or the other as to whether duplicity dropped permissions during the upload phase, but I'm assuming not). The reason I don't just run all of duplicity as another user is that some other user needs read access to everything in my system. Even if I were to only back up my main user's home directory, I don't like the idea of hoping I don't accidentally set an important file to have permissions set to make it inaccessible.

So I decided that, rather than run the built-in duplicity-to-S3, I would just write it to the external drive, and then do a sync to s3 of that. That requires that I change the group ownership of all of these archives so that my special backup_uploads user can read them. I had run through many iterations of folders to back up, how often to do incrementally, how often to do full backups, how often to make the backups remote. As it turns out, even with my reasonably fast Internet connection (Sonic, fiber to the node, albeit in a remote area in Berkeley), I can't dream of backing up anywhere close to all my files, all the time. So I came up with an alternate idea for the big, rarely changing stuff (photos and videos) - a scheme involving two external drives that I would swap out with friend or family. I would then shrink down what I was uploading to S3 something manageable. So my job just got more complicated, and this is why backups are awful.

That's when I made a mistake in writing my shell scripts, which I had restructured to accommodate easily having different parameters for each folder. Suffice it to say, a variable didn't get passed into a bash file which I was expecting to be there, and I ended up running chgrp -R backup_uploads on the root of the file system, for a second or two before I figured it out. For those unfamiliar, this means I screwed up the permissions on a lot of my system files. So, it was time to reinstall my system. Fortunately, I had been making backups! (I think this properly qualifies as ironic, right?) I had made backups of most of it, anyway. The backups were a few days old. I did a lot of rsyncing and diffing and chgrp'ing to get it all up to sync, and I have the data that I think I want to restore, ready to go.

This was a great opportunity to confirm that my backups do indeed work. It was also a great way to learn about safe shell scripting. The "don't" option in that link, at this point, is the advice that makes the most sense to me. I will redo all this in Python, now . But if I must, now I know how to have bash safely fail if there's a missing variable (as was the case here) among other potential problems. But there was a third reason this was fortunate mistake, though not in any way pleasant.

All data squared away, I was ready to install Ubuntu server again on my mini server. I downloaded the Ubuntu Server iso, the SHA256SUM file, the SHA256SUM signature file, and the relevant PGP keys. Everything checked out, and I found the fingerprints of said keys on Ubuntu's site, and on Stack Overflow. I'm as sure as I can be without meeting somebody in person.

I flash the installer disk image onto a spare USB stick. I put it into my mini server. It boots up. I run integrity check on the installer disk, because why not. FAIL. efi.img was corrupted. That's interesting. I went back to my laptop with my USB stick, mounted my iso locally, and diffed the hex dumps. Sure enough, a single bit had flipped. That could be a fluke, or a cosmic ray, but I'm not taking chances on defective hardware. I labeled the USB stick as "to zero out and throw away", and put it aside.

The next day I went to Walgreens and picked up a brand new new USB stick. I come home, flash the installer disk onto it. Plug it into my mini-server and.... integrity check passes. Okay, cool. It was late, so I went to sleep. The day after, before commencing with install, I did a mem test, because why not. After a half hour, I realized it would be very slow, so I decided to cancel it and just do it overnight after installing. I ran the integrity check again, because why not. FAIL. Same file as before. I bring it to my laptop, the same bit has changed! Again, this was now on a separate, and brand new USB stick.

I pulled out the original USB stick, and did a hex dump to confirm that it was indeed the same bit that had changed on both disks. However, while it was just sitting there, as best I can tell, more bits were now broken on the old USB stick. So, it seems as though the disks have somehow been made defective by this process.

At this point I'm sort of at a loss. I re-flashed the iso onto the new USB, and ran memcheck on the mini-server that night as originally planned, since it's likely a problem at this point, but it came up fine with three passes. What I have remaining is to run a mem check on my laptop. I gave my laptop the "spill test" a couple months back, maybe that screwed something up, but it is a Lenovo and it's built to withstand such stupidity.

Beyond that, I've decided to buy six identical brand new USB sticks, to run an experiment. I will burn the ISO to 3 USB sticks, right from my mini box, and likewise 3 on my laptop. I will not plug them into anything else. Then I wait, and maybe restart my computers a few times. Hopefully one of them will break again, and I know which computer to blame. Otherwise I have no idea how to check what hardware of mine is busted.

So here I am, I started with wanting to run backups, now I'm conducting a scientific experiment. I don't understand how people handle computers.