Joshua Paling

Breaking Prod

When broken production apps break, stakeholders immediately ask: “What can we do to ensure this never happens again?”

The relationship between development speed and bugs looks something like this:

Bugs vs speed

Rushed pace & poor testing means many bugs. Slow down (ie, be more thorough), and at first you’ll drastically reduce bugs. But slow down more, and you begin to bottom out. You can never reach zero.

Breaking prod less means using time, money, and bureaucracy to slide right on an axis of diminishing returns.

Finding the sweet spot requires ongoing collaboration between business and tech. Since sliding RIGHT is expensive, you want to be as far LEFT as your appetite for risk allows.

Writing pacemaker firmware? Slide very far right to ensure no one dies.

Selling shoes online? You have a higher tolerance for risk. Slide left and take advantage of it.

Coming back to the original question, the answer is “we can slide as far right as you want, but it’s not free… how much do you want to pay for it?”


Update: What is important is to look for opportunities to efficiently decrease the risk of bugs. People's knee-jerk reaction to "prod is broken!" is to ramp up bureacracy with generic solutions like:

These make you feel like you've "done something about it" but you haven't. They slow the team down and don't really reduce risk. Instead, look for specific, low bureaucracy process changes that target high risk scenarios. Example: