Chaos Engineering - why is this important? Steps to perform it in your company and useful tool set.
For most people chaos is a state of complete confusion and lack of order, and this definition exists in dictionaries all around the world. So, does Chaos Engineering stand for a set of unexpected engineering tasks carried out without any order? There is only one answer to that question – No.
In our world, the technical world, where different businesses work and cooperate closely with each other, we could never do a thing like that. For me, Chaos Engineering is creating possibilities to perform and analyze proactively any point of failure or any undesirable action in order to identify service behavior. Moreover, it provides a more stable product, not through root cause analysis made after incidents, but rather in controlled conditions with a decreased blast radius.
It’s not hard to imagine that behind our everyday routines there are many important systems. Let’s use Seats Service as a first example. It decides which airplane seats are free, and which fares & passengers are assigned to those and remaining seats.[.zoom]
[.zoom]So, what may happen if this service freezes for a while?
Will the passengers manage to buy their tickets at all? Will they get assigned to the wrong seats on the plane? Will the service assign air fares to some percent of free seats?
All of the above could, but does not have to happen. If you do not perform any chaos tests, you won’t be able to determine the real impact of system or service inaccessibility on your business for quite a while, not to mention the real costs they entail. All incidents generate colossal revenue losses for companies, but not exactly at the moment the accidents happen. This is due to the long or unmeasurable MTTR (stands for Mean Time To Repair).
This example shows the importance of identifying different behaviors. Without it, we may lose not only money, but also lives.
All of these tool sets/ tools have some advantages and some cons, most likely expensive solution, big barrier to entry when used in common situations, or some other limitations. I will tell you about all of them.
Chaos Engineering was started by Netflix Engineers. Knowing this, let’s review the set of principles drawn up by Netflix:
You can see a connection with my 5 steps to perform Chaos Engineering in your company the mature way, right?
Now, the basic toolset created by Netflix is called Chaos Monkey. So, let’s start with it.
Chaos Monkey is a piece of software that was created in 2011 by Netflix, and later became part of a larger suite of programs called Simian Army, a collection of software tools designed to test the AWS infrastructure. The software is open source, allowing other cloud services users to adapt it for their own use
More tools have been added to test different security and configurations issues.
The main and basic rule of Chaos Monkey is "the best way to elude major failures is to fail constantly…"
Unlike unexpected failures, which by definition occur randomly and often without warning, the software is opt-out by default. It can also be configured for opt-in. However, unlike unexpected failures, which seem to occur at the worst possible times, the software is opt-out by default. It can also be configured for opt-in!
Chaos Monkey allows for simulated disturbances and failures to occur, so they can be analyzed and monitored.
Netflix engineers plan to add more monkeys to Simian Army. They are open for community suggestions.
Right, this tool set is adequate for some situations, but still needs additional EC2 or server for others. And from my perspective, this always generates too many costs for a company. Nevertheless, it is a good tool for those who start their adventure with Chaos Engineering.
Another Netflix tool is, in fact, few “Monkey” sets called Netflix Simian Army.
Fault - tolerance is key in cloud computing because 100% uptime is never guaranteed (anything can break at any time). Cloud architecture has to be designed with components in mind. So, when one of them fails, it won't drag the whole system down with it!
Our weakest part should not dictate the performance of the whole infrastructure.
We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, data center-/availability-zone-, and even servers located in different parts of the world.
Designing a fault tolerant architecture is only part of the process:
Okay, at this moment we have great tool set for many fault-injection scenarios, but we still need additional EC2/server to perform the task. But it doesn’t change the fact that Netflix Engineers has done a great engineering work. Right now, only K8S/Containers things are not supported. Having said that, let’s move to the latest Netflix Tool set: ChAP.
At a high level, the platform checks the deployment pipeline for a user - oriented service. After that, it launches experiment and control clusters of said service, sending traffic to each. A FIT scenario is executed on experimental groups and the results are reported to service owner.
“The best experiments do not disturb the customer experience!”
In ChAP, we take direct small subset of traffic and distribute it evenly between the experimental clusters and the control clusters.
Some failure modes are only visible when the ratio of failures to total requests in a system crosses certain thresholds.
Load balancing and request routing for FIT requests are evenly spread throughout our production capacity.
Some instances where critical thresholds are reached because of failing requests:
Okay, we have a great toolset for practically every imaginable fault scenario or injection, but they all need additional EC2/server or something else to perform the task.
And do we really need this? No. This is only one way of performing tests, as we can always opt for the serverless…
Why Chaos - Lambda?
There is one very important reason: Chaos Lambda is Chaos Monkey within Lambda, but without the need of any additional EC2 instance!
“Tools such as chaos - lambda by Shoreditch Ops look to replicate Netflix’s Chaos Monkey, but execute from inside a Lambda function instead of an EC2 instance - hence bringing you the cost saving and convenience Lambda offers.”
* Problem origins:
* Failure modules:
Okay, and what about other solutions like, for example, SAAS Gremlin? You can read a comparison with serverless (Lambda) in the next part with code examples. But before that, I will show you my final guidelines about Chaos Engineering and Chaos Testing in your firm:
Do you want to be a customer of such broken software? Thank you for reading my introduction to Chaos Engineering. Stay tuned for more.
Read about our client's real-world migration case supported by this full-stack observability platform. Find out all the key benefits we have identified.
We're here to help you!