Despite our greatest efforts to avoid them, IT incidents are an inevitable part of the job – and trying to stay ahead of business-impacting downtime is only getting trickier. Systems today are tightly coupled and increasingly complex, and with more moving parts come more opportunities for things to go wrong.

This is one reason why more and more organizations are turning to microservices for increased service availability and better resilience to failure. But while these are great premises for breaking monolithic applications, they can also potentially compound the risk of failure – unless designed expressly with resilience in mind.

Preparing for Failure

Given the inherently chaotic nature of distributed systems, services should be developed not only to anticipate failure, but to automatically recover in the event of failure. This means instigating failures on a regular basis to ensure your systems can handle chaos without disrupting service to end customers. And in order to achieve this, you need the ability to simulate production-like traffic in test environments.

Of course, it’s a good idea to test resilience before changes make it to production. If you don’t do this, you won’t be able to verify that your services can support both average and peak loads. In fact, the safest bet is to ensure your product can handle up to twice the peak amount without having to scale up.

When it comes to resilience testing, the right tools shouldn’t be too concerned about how requests are handled, just that they have the correct impact in the end. Remember that under certain conditions, the input service can fail to hand off a request to the rest of the system but not report the failure. Don’t risk issues flying under the radar of monitoring by making sure that end-to-end validation is, in fact, occurring. (For more, see Tech Failures: Can We Live With Them?)

The Next Steps

After understanding how services behave under load, it’s time to start introducing the failure events. As with all software testing, it’s best to have automated tools that allow you to easily and rapidly reproduce scenarios, so that you can coordinate complex events that impact different infrastructure technologies. And beyond the ability to verify fixes and changes to the services, this allows you to run random failure scenarios in any environment and on a schedule.

Meaningful failure events depend largely on the layout of your services, and you can formulate them by asking specific questions that are relevant to you. For instance, what is the impact for people using the front-end when a database becomes unreachable for a certain period of time? Can those users still navigate the web UI? Can they still issue updates to their information, and will those updates be processed correctly when the database becomes reachable again?

If you run multiple microservices, you can ask whether there will be a global outage if any individual service crashes. Or if you have a queuing mechanism to buffer the communication between services, what happens when the consumer service (or services) stop working? Will users still be able to work with your application? And given an average load, how long do you have before the queues overflow and you start losing messages?

Once you’ve defined a few key questions about your infrastructure, you can then start listing different ways to simulate those failures. It might be enough to stop a particular service or a database server. You might want to block the main thread of a service to simulate a dead-lock, while its container is still responsive and running. You might decide to introduce rules in your network to block traffic between specific services. On Linux environments, you can use tools like ‘tc’ to emulate network situations like high latency, dropped, corrupted or duplicated packets. (It can be important to involve users in testing. Read more in 4 Reasons Why End Users Need to Participate in Testing Before UAT.)

Learning and Improving Through Drills

One of the most valuable aspects of creating failure scenarios is that they can expose all the potential ways that the system can fail, thereby carving the path to self-healing logic. Your team will go through the steps to recover services manually – a great drill, by the way, for confirming they are able to do this within SLAs. Automation of this recovery process can be worked on, but in the meantime, you can rest easy knowing your team has walked through the process of getting services back on track. By making failure scenarios random and regular and not disclosing the full details of the run, you can also include discovery and diagnoses to the drill – which is, after all, a critical part of SLAs.

At its core, chaos engineering takes the complexity of the system as a given, tests it by simulating new and wacky conditions, and observes how the system responds. This is the data engineering teams need to redesign and reconfigure the system to achieve higher resilience. There are so many opportunities to learn new and useful things. For instance, you may find instances where services aren’t getting updates when downstream services have changed, or areas where monitoring is missing completely. There’s no shortage of exciting ways to make your product more resilient and robust!