In case you missed the big news in the industry this week, a GitLab employee accidentally deleted a ton of production data and took their platform down for hours. It was only when everything was on fire and they were in deep trouble that they turned to their backup systems… only to find that none of them actually worked.
Backup Prod Data Regularly
Not exactly a groundbreaking statement, right? Everybody knows this. If there was a “working in corporate IT 101” manual it would have a chapter on this concept. It’s common sense.
Even still, a lot of people and companies – like GitLab – tend to “set and forget” their backups. They probably created their backup mechanism years ago, tested it at the time, confirmed that it worked, and then scheduled it to run every night at 1am EST or something. Then, since it was out of sight and out of mind, they promptly forgot about it and moved on to other things. After all, they never had a need to check on it right? Nothing had broken down. Until yesterday.
A Guide To Good Backup Process
The secret to ensuring that your backup process is effective and functional is to integrate it into your daily work. One of the best ways to do this is to use it to set up a new dev’s local environment. Have them configure and install the IDE and related tools, and then have them pull down the most recent backup and restore from it to set up their local database. What’s that, you say? It has PII and sensitive data? You’re probably right, which is why your backup process should, as appropriate, create 2 copies: 1 that strips the data (for local dev env) and 1 that doesn’t (for prod restore).
Great, so you’ve confirmed that your backups work for a local environment, but what about production? The next step in a good process is simple too: artificially destroy your production environment regularly. Set up fail-over tests at off hours (and compensate your amazing site reliability / IT team appropriately for conducting these tests in off hours too). I recommend once per quarter as a starting point: at 2am on Sunday drop your production database (but don’t delete it, just take it offline so you can bring it back if you find out that your backup system isn’t working). Let your staff work to restore a recent backup and bring the site back online. Announce the outage in advance to your users, and update people on social media or via email when it begins and ends.
There is much to be learned and gained from this intrusive and destructive process. For one, you will force your dev team to create a good “the site is down” experience since your customers will otherwise see infinitely spinning web pages or terrible error dumps. Another is that you can time the average outage and thus discern how long you’ll be down if your production database ever actually takes a spill. Finally, your disaster recovery staff will be fresh on their skills and able to fix your real outages quickly and predictably. There are many tangible and hidden benefits derived from just a few hours of planned outage per year.
GitLab Did One Thing Right
The final step in your solid, functional backup process which you test quarterly and use to spin up new dev hires is to document the hell out of everything. When you do these planned outages, have the disaster recovery staff document, step by step, the actions taken to fix it. When you have real live outages, document those too and share the knowledge with the public.
GitLab got this part right, and are being heralded as a great example and learning experience in the industry instead of spited for mysterious downtimes and no communication. I promise you that this week, many disaster recovery people are doing extra backup tests that they wouldn’t have thought to do otherwise – all as a direct result of the GitLab incident. Making your disasters and their recoveries public creates goodwill in the community, provides a learning experience, and shows people that you can be trusted.
GitLab took a bad situation and created the best possible outcome, both for themselves and the entire community. For that they should be thanked, not mocked. After all, we are all human and we all make mistakes. Knowing this, you’ll be really glad that you practice making mistakes every quarter when your production database actually goes down in flames.
I was working on my fireplace this past weekend. Specifically I had just finished ripping down the old surface to the red brick, and then preparing the brick surface with a layer of thinset for tiling. I spent all of Saturday cutting tiles and then placing them on the fireplace surround and hearth. Even with help it took 11 hours to do, and about 8 hours of it was measuring and cutting tiles.
While I was doing this work, which is just mindless enough that your mind wanders but requires just enough attention that it doesn’t wander freely, I began to recite a common trades mantra. Measure twice, cut once.
This quip – a practical saying – saturates the construction industry. Whether you’re a DIYer like me, or a professional tradesperson, it’s important to measure everything twice and do the work once. This saves you a lot of pain and time down the road, since you can double check your angles and distances and get everything right the first time.
The reason that this practice is important is as simple as considering a tile. Let’s say that I need a 3/4″ width tile, but I measure incorrectly and cut it to 1/2″. There’s no way for me to turn that 1/2″ piece back into a 3/4″ piece, so I just wasted that tile. I need to toss it out (if it can’t be used elsewhere) and cut a new tile to the correct measurement. In short, measuring twice saves you time and money.
As I stood above my trusty wet saw, cutting tile, after tile, after tile, my mind began to wander into the realm of programming. I began to realize something interesting. In my opinion, many IT departments have a policy of measuring twice and cutting once, with the supposed benefit of cost and time savings. One might even call this sort of approach waterfall or agile, where estimates are gathered in detail (measured) long before the work is done (cut).
I believe that this is a fallacy that ironically leads to even more work. Every single developer that I’ve ever met in my career, including myself, cannot accurately estimate anything. We sometimes get close, because we can relate the task at hand to a similar task we accomplished previously, but in general I find that a new task is very much an unknown and the time spent to gather an estimate is pointless since it’s wrong anyway. By measuring twice and cutting once, we waste a ton of time.
I believe that developers should measure once, quickly, for a rough estimate, and then cut. The reason that I believe this is due to a fundamental difference between programming and other kinds of work that is managed with processes and estimates.
Code is not a tile or piece of wood. It is a highly flexible, malleable, mutable, digital thing. If a developer cuts a feature short, they can add on to it later, expanding it seamlessly to the required size. If they overestimate a feature’s length, they can easily chop off the excess and move on to the next feature. There is no significant cost in quick, roughly estimated measurements for programming work.
Immediately your team will regain a ton of time in which they can do their development work. They won’t have to attend hours of planning meetings or requirements gathering sessions. They will just work to get things done as fast and accurately as they can.
The only tradeoff is a lack of estimates that management-types can cite and depend on. I would challenge that any estimates derived are very commonly wrong and useless regardless. More-so, if you do not trust your developers to do the right thing and use their time effectively, why do you keep them employed?
To me, a lot of the process models around development that are popular (waterfall, agile) are derived from the measure twice, cut once methodology. This approach is super practical to physical goods since inaccurate measurements are expensive, but this does not apply to development work. These meetings to gather estimates in the hopes of controlling costs ironically bloat budgets and help to deliver less code and extend goal dates and deadlines. You take people that are hired to code, and tie them up in meetings where they have to try and justify what they’re going to code by the hour. They don’t know how long it will take, but they will have a better idea after a few hours of coding – if you’d just give them a few hours of no meetings to code.
If you’re working on tiling your fireplace, measure twice and cut once. If you’re working on code, take a rough guess at the measurement and get to work!