Rent A Chaos Monkey From Netflix

Video rental company Netflix has used its extensive consumption of the Amazon Web Services cloud to give something back to the open source community. The company’s Chaos Monkey system was developed to ensure that its operations were capable of self-healing (or at least continuing to run) should instances in the AWS cloud fail. This month sees the firm open source its code.

The firm’s Cory Bennett and Ariel Tseitlin have written on the Netflix techblog explaining that over the last year, “Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey, which allows us to isolate and resolve them so they don’t happen again.”

Chaos Monkey is perhaps not that “chaotic”; it is in fact completely configurable and is flexible enough to run on clouds other than AWS. “The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support,” write the pair.

Justifying this open source release and clarifying that their company’s technical prowess is fit for public consumption, Bennett and Tseitlin explain that because the cloud is all about redundancy and fault-tolerance — no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails).

This means that Netflix had to design a cloud architecture where individual components can fail without affecting the availability of the entire system — and now the company has released this intelligence to the community contribution model of computing.

“In effect, we have to be stronger than our weakest link. We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter/availability-zone, and even regionally-redundant deployments. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these ‘once in a blue moon’ failures,” write the pair.

Netflix says it has plans to open up more of its architecture and may move to release Janitor Monkey next, a cloud “clutter and waste” management tool. The firm also sports Latency Monkey, Conformity Monkey, Doctor Monkey, and Security Monkey among its “Simian Family” of technologies.