All software systems crash. Do you want $ not to last long? More so, do you want to avoid hysteria while they fail? The answer has no mystery whatsoever: train hard. The principles of Resilience Engineering or fault-tolerant engineering and in this article we tell you how massive service providers such as Amazon, Google or Netflix do it.
Fault tolerant engineering is a relatively recent discipline (year 2001, Erik hollnagel) that has been consolidated as the necessary evolution towards adaptive security ("Safety-II") even for environments as demanding as EUROCONTROL's “ATM systems”.
In essence, this "engineering for resilient behavior" poses as a perspective that the increasing complexity of the systems leads to functional resonance effects that can cause small changes in the behavior of systems to produce disproportionate, non-linear and unpredictable consequences. In short, it is not feasible to know the exact behavior of a complex system in the face of variable demand.
Therefore we focus on stabilizing performance and therefore resilience is considered as an intrinsic capacity of systems to adjust their operation before, during and / or after changes and malfunctions, so that they can maintain the required operations both in the face of expected and unexpected operating conditions (from the book Resilience Engineering in Practice, 2010).
the more likely something is to work properly,
the less likely it is to go wrong.
So, train hard.
Cultivating fault tolerance
This mindset inspired Jesse robbins ex-firefighter, responsible for web availability and «Master of Disasteron Amazon.com. To set the culture of "be prepared" In 2001, he promoted the GameDays, with a mixture of real and simulated incidents, because he considered it necessary to train to achieve a solvent response to problems.
Similarly since 2005 in Google we have Kripa Krishnan with the Disaster Recovery Test event (DiRT): global scope, several days long, demanding.
“However, to get the maximum benefit from these types of system recovery events,
an organization also needs to invest in
continually test their services. »
(excerpt) «However, to benefit the most from such recovery events,
an organization also needs to invest in continuous testing of its services. »
The chaos of the hand as an advantage
The next rung they climbed hard in Netflix starting with the patent for his "Validating the resilience of networked applications" in 2010, following the articles on this issue published in the Netflix technical blog and culminating in the 2012 release of the platform Chaos monkey for Amazon from Netflix Open Source Software Center (OSS), in continuous evolution.
Thus, starting from emulating the effects of unleashing a "wild monkey with guns" in the data centers, they began by creating the Chaos monkey, a tool that randomly disabled instances in production but allows them to cause crashes when engineers are ready to serve and learn from them.
Objective: ensure that in the face of incidents
Netflix is able to continue offering
a sufficient quality of service.
The success of the Chaos monkey inspired them to constitute the Simian army as a collection of specialized virtual primates:
- Latency monkey- Introduces latencies in the RESTful communications layer between client and server to simulate service degradations and even drops.
- ConformityMonkey: checks that the established good practices are being applied.
- Doctor monkey: check the health of the systems and remove the sick.
- SecurityMonkey- Look for security policy violations, vulnerabilities, or certificates that are about to expire, for example.
- 10-18 Monkey: the name refers to the Localization and Internationalization of the applications and looks for problems in instances for various languages or character sets.
- Janitor Monkey- Free up idle resources on Amazon services (AWS).
- Chaos gorilla: It has a similar philosophy to the Chaos Monkey but at the Amazon Availability Zones level.
- Chaos kongCan you imagine it, it simulates the fall of an entire Amazon region, which includes several Availability Zones. Who Said Fear ?.
We have mentioned it briefly, but all this makes little sense if we are not able to know when the system fails or if it is performing according to the benchmarks (Service Level Agreements - SLA).
That is to say, you have to continually check that the applications in production work. Our automation solution Zahorí can help you with it, of course.
Where are we going with so much Chaos Monkey?
From Netflix, the consolidation of Chaos Engineering is being promoted, starting with its principles: Principles of Chaos Engineering.
In the background we have the thesis of Nassim Nicholas Taleb who in 2012 published "Antifragile: things that gain from disorder" establishing Antifragile Systems as those that improve when they suffer incidents.
This is the next level: building systems that are not only able to withstand problems but are malleable and improve their performance as they face them.
“Antifragility is beyond resilience or robustness.
The resilient resists shocks and stays the same;
the antifragile gets better »
Nassim Nicholas Taleb
If you are interested in continuing with the topic, we recommend:
- Session "Doing the (Chaos) Monkey" by Alejandro Guirao Rodríguez from the BBVA Innovation Center. Very didactic.
- Visit Netflix Open Source Software Center on Github.
- The view from Google: Weathering the Unexpected
- Eurocontrol White Paper: From Safety-I to Safety-II
There are two types of software systems:
those who have failed and
the ones that will fail.
You are ready?