GitLab Data Loss: A Discussion

In case you missed the big news in the industry this week, a GitLab employee accidentally deleted a ton of production data and took their platform down for hours. It was only when everything was on fire and they were in deep trouble that they turned to their backup systems… only to find that none of them actually worked.

Backup Prod Data Regularly

Not exactly a groundbreaking statement, right? Everybody knows this. If there was a “working in corporate IT 101” manual it would have a chapter on this concept. It’s common sense.

Even still, a lot of people and companies – like GitLab – tend to “set and forget” their backups. They probably created their backup mechanism years ago, tested it at the time, confirmed that it worked, and then scheduled it to run every night at 1am EST or something. Then, since it was out of sight and out of mind, they promptly forgot about it and moved on to other things. After all, they never had a need to check on it right? Nothing had broken down. Until yesterday.

A Guide To Good Backup Process

The secret to ensuring that your backup process is effective and functional is to integrate it into your daily work. One of the best ways to do this is to use it to set up a new dev’s local environment. Have them configure and install the IDE and related tools, and then have them pull down the most recent backup and restore from it to set up their local database. What’s that, you say? It has PII and sensitive data? You’re probably right, which is why your backup process should, as appropriate, create 2 copies: 1 that strips the data (for local dev env) and 1 that doesn’t (for prod restore).

Great, so you’ve confirmed that your backups work for a local environment, but what about production? The next step in a good process is simple too: artificially destroy your production environment regularly. Set up fail-over tests at off hours (and compensate your amazing site reliability / IT team appropriately for conducting these tests in off hours too). I recommend once per quarter as a starting point: at 2am on Sunday drop your production database (but don’t delete it, just take it offline so you can bring it back if you find out that your backup system isn’t working). Let your staff work to restore a recent backup and bring the site back online. Announce the outage in advance to your users, and update people on social media or via email when it begins and ends.

There is much to be learned and gained from this intrusive and destructive process. For one, you will force your dev team to create a good “the site is down” experience since your customers will otherwise see infinitely spinning web pages or terrible error dumps. Another is that you can time the average outage and thus discern how long you’ll be down if your production database ever actually takes a spill. Finally, your disaster recovery staff will be fresh on their skills and able to fix your real outages quickly and predictably. There are many tangible and hidden benefits derived from just a few hours of planned outage per year.

GitLab Did One Thing Right

The final step in your solid, functional backup process which you test quarterly and use to spin up new dev hires is to document the hell out of everything. When you do these planned outages, have the disaster recovery staff document, step by step, the actions taken to fix it. When you have real live outages, document those too and share the knowledge with the public.

GitLab got this part right, and are being heralded as a great example and learning experience in the industry instead of spited for mysterious downtimes and no communication. I promise you that this week, many disaster recovery people are doing extra backup tests that they wouldn’t have thought to do otherwise – all as a direct result of the GitLab incident. Making your disasters and their recoveries public creates goodwill in the community, provides a learning experience, and shows people that you can be trusted.

GitLab took a bad situation and created the best possible outcome, both for themselves and the entire community. For that they should be thanked, not mocked. After all, we are all human and we all make mistakes. Knowing this, you’ll be really glad that you practice making mistakes every quarter when your production database actually goes down in flames.


See also