Root Causes
Searching for root causes can often be misguided, because complex systems rarely have a single root cause. This article discussed some of the problems with the 'single root cause' idea:
Again, this tendency to look for a single root cause for fundamentally surprising (and usually negative) events like outages is ubiquitous, and hard to shake. When we’re stressed for technical, cultural, or even organizationally political reasons, we can feel pressure to get to resolution on an outage quickly. And when there’s pressure to understand and resolve a (perceived) negative event quickly, we reach for oversimplification. Some typical reasons for this are:
- Management wants an answer to why it happened quickly, and they might even look for a reason to punish someone for it. When there’s a single root cause, it’s straightforward to pin it on “the guy who wasn’t paying attention” or “is incompetent”
- The engineers involved with designing/building/operating/maintaining the infrastructure touching the outage are uncomfortable with the topic of failure or mistakes, so the reaction is to get the investigation over with. This encourages oversimplification of the causes and remediation.
- The failure is just too damn complex to keep in one’s head. Hindsight bias encourages counter-factual thinking (“..if only we payed attention, we could have seen this coming!” or “…we should have known better!”) which pushes us into thinking the cause is simple.