Chaos Engineering and why it Matters

If you’ve ever gone through the pain and anxiety of responding to an unexpected failure in your production system, then intentionally breaking things in production is probably not anywhere on your current “to do” list. However, the principle of chaos engineering intentionally breaks parts of a production process to test its resilience. Such experiments’ intended outcomes are not at all the same as the unplanned outage you may have experienced. The expectation is that you are testing in such a way that if a failure does occur, that its actual impact is constrained to an acceptable level.

Where did Chaos Engineering Come From?

The idea of fault injection, the practice of testing both hardware software dependability by intentionally introducing errors into the system, dates back to the 1970s. In its current form, the concept of chaos engineering was conceived and implemented at Netflix in response to a major service outage in 2008 to their then monolithic on-premises system. The resolution Netflix chose was migration to an Amazon Web Services cloud system, which eliminated the single point of failure that their on-prem system represented, but introduced the cloud’s inherent complexity and uncertainty. Because cloud systems operate with little or no guarantee of hardware reliability, fault tolerance is a fundamental operational concept. Netflix created the first chaos engineering tool, Chaos Monkey, to validate that the unexpected loss of virtual machines and containers would not impact the service they provided their customers.

How does Chaos Engineering Work?

One thing to realize early on about chaos engineering is that there is very little that is chaotic about the approach. These experiments in production intend to test the resilience of the system against failure. This starts with an assumption that you are testing infrastructure that has processes in place to provide resilience in the face of failure. The experiments that chaos engineers run on systems are well thought out, generally test a specific perturbation, and are run with an awareness of possible consequences and with preparation to mitigate rapidly. This means that an experiment will have a clearly defined scope or “blast radius” that needs to be understood and acceptable if an actual failure is induced.

Chaos engineering fits firmly into the SRE (site reliability engineering) DevOps subdiscipline with its associated focus on not allowing a system to exceed agreed upon service level objectives (SLOs). This means that any chaos experiment should never cause the SLO to be exceeded. Doing this requires not only a clear understanding of the range of the negative effects but also a multidisciplinary approach that requires the appropriate people to mitigate being on hand during the experiment. Testing Kubernetes service or network resilience when the engineers with the knowledge to fix those systems are unavailable would not be a best practice.

Much of the focus and literature of chaos engineering focuses on the application of its tools and experiments on production systems. While this certainly provides the greatest value in understanding your actual system’s ability to disrupt experiments, there can also be value in applying chaos experiments in development environments. It should be noted that the data you get out of such experiments will be a reflection of how realistic your dev environment and the load profiles that you test it under are. Still, the information gleaned from such experiments can identify potential performance issues before releasing your code into the wild.

Chaos Engineering is Not Just a Pass or Fail Exercise

The goal of chaos engineering is not simply one of validating that your production system is resilient or not. An ideal experiment will provide information about how to improve the system regardless of the outcome. If something truly did break, there is an obvious task to fix the issue and retest for resilience. However, even if the system performed as expected, comparison against baseline may reveal some level of degradation in the function being tested that deserves further investigation or proactive mitigation. In other cases, unexpected behavior might be revealed that, while not necessarily impacting your SLO, may warrant further exploration.

Conclusion

In the end, chaos engineering is less about chaos than understanding the edges of where your control starts to degrade and intentionally expanding those boundaries. It is a way of more deeply understanding your production system’s behavior and validating your confidence in the systems you have in place to mitigate against the uncertainty inherent in distributed systems. Done thoughtfully, experiments in production don’t need to be scary; rather, they can assure you that your system is ready and resilient in the face of the types of disturbances that the real world provides.