It’s helpful to think of a vaccine or a flu shot. While seemingly counterintuitive, you inject yourself with something harmful in order to prevent a future issue. This same technique applies to complex systems on the cloud as well. Chaos engineering is carefully injecting this harm into our systems to test the system’s ability to respond to it. This is an effective method to practice, prepare, and prevent/minimize downtime and outages before they occur.
It’s a misnomer to think of chaos engineering as chaotic. In reality, chaos engineering involves thoughtful planned experiments that teach us how our systems behave in the face of failure. We want to apply engineering discipline to the approach, hence ‘thoughtful.’ These are experiments where we start with a hypothesis about how things will behave, measure the impact in order to improve the system, and then identify when it’s time to abort an experiment in order to to prevent any impact to the user.
An essential principle of chaos engineering is minimizing the blast radius - the potential impact of the experiment you’re running. Instead of testing on every host in our production environment, we can start with a single host in our test environment. This allows us to grow our confidence in the system and our understanding in lockstep with scale and the potential risk.
This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions: