
When you’re building and operating in high-volume, chaotic environments, you need to plan for failure. The Stoics were known to predict this: a foreboding of evil— deliberate considerations of what will go wrong. Modern systems science uses the term failure: the specific ways in which a system breaks down. The more you can map out your failure modes in advance, the better you can design your system to get past, over, or around them. Of course, not all failures are equal, and not all failure directions are the same. In this post, we’ll discuss three axes for differentiating between failures and how each can inform different design choices for complex systems operating in high-performance environments.
Failing to fail vs
Systems that fail to something outside of their normal operating model have defined a second point of stability. These secondary points are not ideal, but they are sufficient to keep activity low, and failing them increases the chance of achieving the goal even when the first approach fails. humanAI teams use this “fail-to-fail” approach. When operating at full capacity, these teams balance complex inputs, execute complex operations, and distribute decision authority between human and artificial elements. Only to people decision making is a natural secondary point of stability, so a well-designed human-AI team can be set up to fail when human control fails, with the AI component failing first and the human elements continuing to move towards the mission.
Avoidance systems recognize a specific threat and are designed to avoid it directly. As a very simple model, imagine walking along a cliff. If you’re going to walk, it’s definitely better to walk away from the wall, so most people instinctively lean inward a bit when walking. Airplanes that are about to land also use the failover method. The main risk is contact with the ground, so when the landing sequence is not followed properly, the preferred reaction is often to fly over the runway, regain altitude and return. They don’t yet activate a specific backup plan, just avoid catastrophic failure. The more you understand your system’s secondary points of stability and its areas of concentrated risk, the more deliberately you can plan where it will fall.
Fail early and fail late
Early failure allows systems to abandon a course of action before irreversible consequences are reached. Because the system theoretically still has the ability to continue, early failure can feel like abandonment before it’s done – so care must be taken to proactively identify and exploit early signals that a course of action is inevitably and irreversibly leading to failure. A classic example comes from the startup world. Startups have limited resources and may lack structure stability more advanced companies, so they are regularly advised to conduct cost-effective experiments, test ideas quickly, and abandon ways that don’t work as soon as possible, and keep scarce resources for what works. In his classic book Good luck, Jim Collins called this shooting and then cannonballs: using low-cost failures (“bullets”) to triangulate the right direction before committing serious resources (“cannons”).
Late-failure systems, in contrast, run existing plans to the end of their useful life and extend operations as long as possible. These systems typically have scarce resources that are already set on a fixed path: Cannonballs are already fired about Collins. This is how oxygen tanks work in hospitals. We want them to deliver as much oxygen as possible to the patients who need it. Importantly, delay never means failure: Oxygen tanks have gauges and alarms that indicate depletion before the tank is empty, allowing teams to get maximum value from the stock while planning ahead.
Partial failure and total failure
Partial failure is useful when the failed system maintains significant operational capability, even if at reduced or differential capacity. In the emergency department, many video laryngoscopes have partial failure modes. Laryngoscopes are instruments for placing a breathing tube in the patient’s airway with a curved blade and (for video models) a small camera under the tip. Mimicking the design of the classic non-video scope, many manufacturers of video laryngoscopes have built in partial fail-safe capabilities: If the video fails, the laryngoscope functions as a standard non-video model and allows the procedure to continue.
Total failure is useful when the failed system may still be operational, but not in fact, or when continued operation poses ongoing risks or catastrophic consequences. Imagine a bridge that was severely damaged in an earthquake but is still standing. It looks like it can still handle traffic, and maybe a few cars actually cross, but the city has closed it down entirely because the risk of collapse is too high to justify opening it. In this case, the complete closure of the bridge to all traffic will prevent continuous destruction of injured people and destroyed property.
For all of these failure modes, the key is to build a shared mental model of how and why the system is designed to fail in a particular direction. Teams that don’t agree that one member is pushing to quit the course and another is pushing to finish it will cause serious friction at the wrong time. Talking before a crisis and updating it as circumstances change is not pessimism. It is preparation. This is especially true for human-AI teams, where the human and artificial elements are generally not the same on the structure of failure, and agree on the direction of failure is not only useful, but important.




