The Netflix Simian Army: reliability, security, resiliency and recoverability
- Along with price and scalability, redundancy and fault-tolerance are possibly the most important triggers driving cloud migration.
- The cloud architecture should allow failure without affecting the availability of the entire system.
- We want to be able to test the failure scenarios.
- Randomly disables production instances
- Testing ability to survive the failure without overall impact on the service
- Leads to building automatic recovery mechanism to deal with system failures
- Induces artificial delays to RESTful client-server communication layer to simulate service degradation.
- Measures if upstream services respond appropriately.
- Simulate a node or an entire service downtime without physically bringing these instances down.
- Finds instances that don’t adhere to best-practices and shut them down.
- Detecting unhealthy instances using health checks and other external signs of health.
- Removes unhealthy instances from service.
- Searches for unused resources and disposes them.
- Finds security violations and vulnerabilities and terminates the offending instances.
10-18 Monkey (Localization / Internalization)
- Detects configuration and run time problems in instances serving customers in different multiple geographic regions.
- Simulates an outage of an entire Amazon availability zone.
- Services should re-balance to the functional AZs without user-visible impact or manual intervention.
The Simian Army project on Github has retired and the functionality has been moved to other Netflix projects. Check the Simian Army Github page to find more details about hte new projects.