Any sysadmin worth their salt knows backups are one of their major priorities. Maintaining a good backup is crucial for ensuring business continuity when the inevitable unforeseen catastrophe occurs.
But, not all backup plans are equal. It’s a complex subject, especially when we’re dealing with production environments that are in a constant state of flux.
In an embarrassingly well-publicized mishap, KDE, developers of the popular open source desktop environment, saw how not thinking hard enough about backing up data can have serious consequences.
The KDE Git repository is kept on a pair of virtual machines. That master repo is mirrored to a large number of secondary repositories around the world. When people commit or clone the repos they use the secondary repositories to spread the load.
Recently, the main repo was taken down for security updates, and when they brought it back up again, it was noticed that there was filesystem corruption. Unfortunately, because the repo was mirrored out to the secondary repos, every one of them contained a copy of the corrupted files. There were no clean repos, so it was impossible to sync back from the secondary repos to get a clean version onto the main server.
The KDE folks were able to solve the problem, but, as they acknowledged, they were very lucky. They narrowly avoided a catastrophe.
The KDE backup solution was eminently scalable, but it was neither reliable nor secure. They’d failed to properly account for all potential failure scenarios. It’s not sufficient that data exist in many different places for it to be backed up, it must also exist in sufficient historical versions. A hundred copies of broken data is no better than one copy.
There are various solutions that KDE could and should have implemented to make sure that their backups were reliable. One, which they are considering, is to use the ZFS file system, which is capable of making snapshots. But a more common solution, would be to do regular backups and setup a cron job to copy them to an external storage device.
If they had backups available, it would have been fairly easy to take one of the production servers offline, sync it to the backup, and then put it back online.
How do you manage backups? If you were designing KDEs new backup protocols, how would you go about it? Let us know what you think in the comments.