Netflix Availability Tools
Hystrix is a library that provides control over interactions among distributed services and enables efficient handling of latency and failure. It helps to:
Isolate points of access to remote systems, services and third parties.
Prevents cascading failure.
Enable resilience in complex distributed systems where failure is more likely.
Netflix developed Hystrix during its resilience-engineering work in 2011. At Netflix, it now handles and processes tens of billions of thread-isolated and hundreds of billions of semaphore-isolated calls per day.
SimianArmy is a set of tools/services, called monkeys, that helps to generate failures and abnormal conditions in the cloud. Its purpose is to keep the cloud safe and secure. It currently includes the following monkeys:
Chaos Monkey is responsible for randomly shutting down a group of systems in order to continuously test system disaster readiness.
Janitor Monkey helps clean up unused resources in the AWS cloud, and can also be used with other cloud providers.
Conformity Monkey takes responsibility of AWS cloud instances that don’t conform to the predefined rules of best practices.
Turbine is a stream-processing engine with two significant features: low latency and high throughput. It provides real-time insights into distributed systems, even those comprising thousands of servers. Its distinguishing feature is its speed in generating real-time data in mere seconds. Turbine is specifically used for monitoring Hystrix metrics across key systems.
Netflix Cloud Management Tools
Ice is an AWS usage tool that provides a complete view of the cloud landscape’s usage and cost. It provides a platform providing usage patterns across the globe, and segregates them by region, availability zone and/or service teams. Ice is a Grails project that includes three parts:
Processor processes detailed AWS billing files into readable data in a format easily processed by the reader.
Reader renders processor data read from the processor to the UI.
UI queries reader data and provides results in interactive graphs.
Asgard is a web interface that aids application deployment and cloud management by:
Limiting the set of permitted characters for application names.
Quickly rolling back an application by deploying a new version in times of failure.
Handling deployment-process automation.
Providing graphical controls for modifying ASGs and helping to set up metrics-driven autoscaling.
Netflix Big Data Tools
Aegisthus is a MapReduce program for reading data from Cassandra’s SST tables and making it easily available for Big Data analysis. It handles up to 100 datasets per day totaling up to 20TB of data. Aegisthus works in conjunction with two other tools that make Big Data workflow handling easier:
Frankie provides a standard metadata set for data across various sources.
Genie acts as the building block for an application needed to launch Hadoop jobs.
Genie aids Hadoop-ecosystem job and resource management in the cloud. It provides two type of services:
- RESTful Execution Service helps to abstract physical details of the Hadoop ecosystem and provides the execution service to allow Hadoop, Hive and Pig job submissions without the need for a Hadoop client.
- Configuration Service serves as a registry for clusters and their related Hive and Pig configurations.
Lipstick is a Pig workflow visualization tool. Given the high-level abstraction of Pig workflows, it’s necessary to understand the logical flow in execution of Pig scripts, a task that can become difficult to handle in complex scenarios. Lipstick comes to the rescue by providing Pig script visualization and monitoring features.
PigPen is a Map-Reduce to Clojure and provides the ability to write map-reduce queries as programs instead of as scripts. PigPen also aids unit testing and iterative development.
S3mper provides an additional layer of consistency to Amazon S3’s index by using a consistent secondary index. It identifies and provides solutions when an S3 list operation is inconsistent. S3mper features:
Recovery delays listing when an inconsistency occurs.
Notification is made when a listing cannot be achieved.
Reporting provides events to track the number of recoveries, files missed, jobs affected and so on.
Administration comprises utilities to inspect the metastore and to resolve conflicts between database indices.