Skip to main content

Software is Difficult, Very Difficult

We all witnessed Facebook going offline for 5 hours, which was caused by human error. This is something all software companies encounter.

We all witnessed or heard about Facebook going offline for 5 hours, which was caused by human error. This is typically the case and something that all software companies encounter. The best tools and best audit process can still lead to significant failures.

We witnessed this ourselves last week when a third-party vendor pushed a change that effectively broke our software integration with them for about thirty minutes. Our team scrambled to figure out what happened and created a patch to workaround the vendor’s mistake as we were not sure how long it would take for them to fix.

Turns out we were about 15 minutes faster than they were at rolling back their mistake.

Preparedness is paramount when humans continue to make mistakes. It is how you handle them that separates companies from each other.

We treat any major issue as a fire.

We have a fire team that is triggered with a simple Slack command of ‘/fire’ and we have a playbook to follow.

This is what keeps everything consistent from our customer-facing messaging to our internal messaging and assembly of the teams.

From a customer perspective the communication is almost as crucial as the fix. As a customer nothing is worse than when something is broken on your website and you can’t directly fix it. Couple that with having no line of sight to the time to resolution. It is why we update our status page every 15 minutes during a fire – information and transparency are vital.

The actual troubleshooting and fixing is only half of the process and while our customers just want things back to normal as fast as possible it is what happens after the fire that is more important.

Reviewing everything, from how we got alerted to the problem, what was the fix, how do we prevent it again, is talked through and plans are made for any changes that we need to make.

We never lay blame on a human error, that solves nothing other than creating a fear of retribution. Instead we only focus on prevention and mitigation for the future. It is part of our Fail Fast and Publicly value.

You can think of the process as continuous improvement, the goal is to never repeat the same mistake or have the same issue twice.

This is just one short synopsis of how we react. The engineering team has started sharing more information via their newly launched tech blog. If you are technically inclined or interested you can find the new blog here.