When backups fail: A cautionary sysadmin tale
It was late summer in 2000 when things went terribly awry in my new job at EDS. The backups that we needed had failed. I traced the failure's root cause to numbers 4 and 7 in my article, 10 things I wish I'd known before becoming a Linux sysadmin. I discovered that we hadn't had a good backup on the systems in question in at least three years. I discussed this failure with the backup and restore (BUR) team lead, and his and my manager's opinions were the same—it was my fault that the backups were bad. Here's the interesting part of the story: I'd only been at this job for less than four months.
[ Did you take the backup technology poll? ]
There were other people in the group of varying levels of technical expertise, but one person was praised as a "guru" and, much to my chagrin, no one called her out for not checking the failed backups. My manager actually told me that I "should have been checking those backups," and it was my responsibility to do so. It had been my errant assumption that the BUR team would verify the backups.
"My best advice to all system administrators is to verify backups for every system you touch..."
And, yes, I did bring up the fact that the backups hadn't worked for three years and that three years was as far back as the backups went. So, basically, it's likely that there had never been good backups of those systems.
I took responsibility, albeit under protest, and then also took on the action item of getting backups working on the twenty or more systems that monitored our infrastructure. It took me a couple of weeks to get it all going, to test, and to verify that the backups were working. And although I considered this task to be significant, I never heard a "good job" or "thank you" for my work. I assume my lack of accolades for a successful backup implementation was because it had been deemed my fault that the backups had never worked.
My best advice to all system administrators is to verify backups for every system you touch or might have adjacent responsibility for, because someone will most likely eventually need to point a finger, and it could be at you.
Here's how I verify backups to ensure that they're working on my systems:
- Create a restore_test.txt file for each system buried deep in the filesystem.
- Create a script to scrape the backup logs for your restore_test.txt file.
- Select a random system once per week and restore the restore_test.txt file.
- Create a backup_restore_log.txt file and log your weekly progress.
- Prepare to share the backup_restore_log.txt file with your manager in case of a failure, disaster, accident, or neglect.
Hopefully, your work environment isn't as dysfunctional as mine was. But, just in case of any issues that might arise, be proactive in checking backups and verifying that you can restore a file from your backups. It's too important of a task to leave it to chance. Whether you have official responsibility or not, make it your job to verify that backups are being done and that they're working as expected.
[ Want to test your sysadmin skills? Take a skills assessment today. ]