Hadoop Distributions – A Detailed Comparative Study | Prelude to a free White Paper
”BigData” is a term that has been buzzing around a lot for the last few years. And when you hear this buzz, you’ll hear ”Hadoop” as well. In last 2-3 years, many big players in the industry have come up with their own distribution of Apache Hadoop, be it Intel, Microsoft, IBM, or EMC, etc. Also, some startups, focusing only on Hadoop, have become big players now – Cloudera, Hortonworks – in this area.
Each Hadoop distributor claims how its distribution is the best one out there. Each distribution has some unique features which really may be useful for a set of users and may not be useful for another.
It may become non-trivial to choose from so many distributors matching your requirements, especially when the user is spending money on purchasing a distribution and support.
Update: The free white paper comparing the Hadoop Distributions is ready for download! Click here or check the resources section on the sidebar to download the whitepaper for free.
There are multiple user bases that may need to deploy Hadoop. Some of them are listed below:
1. Higher management in some company, willing to move to BigData solutions using Hadoop.
2. A developer building some tool in Hadoop Ecosystem.
3. A newbie learning Hadoop and looking for a temporary/non-serious Hadoop deployment.
Keeping these things in mind, we have completed a thorough study of following distribution sources, which will be covered in a 6-part series.
1. Intel Distribution for Apache Hadoop
2. Cloudera Distribution Including Apache Hadoop
3. Hortonworks Data Platform
Through this series, we’ll share our experience with each of these distributors and provide subjective as well as objective results of the feature/performance comparisons we did. This will help you shortlist the distributors, based on your requirements.
AWS EC2 Instances (5-node cluster) – We installed each of these distributions on each of the instance and studied them for feature comparisons.
Intel’ HiBench Benchmarking utility – For Performance comparisons
HiBench is a benchmarking suite, to benchmark Hadoop deployments, developed and open sourced by Intel. [You can read more about Intel HiBench and its each benchmark test here.]
We performed following benchmarks from HiBench suit:
This workload sorts its input data, which is generated using the Apache Hadoop* RandomTextWriter example. Representative of real-world MapReduce* jobs that transform data from one format to another.
This workload counts the occurrence of each word in the input data, which is generated using Apache Hadoop RandomTextWriter. Representative of real-world MapReduce jobs that extract a small amount of interesting data from a large data set.
This workload is an open-source implementation of the page-rank algorithm, a link-analysis algorithm used widely in web search engines.
Mahout Bayesian Classification
Typical application area of MapReduce for large-scale data mining and machine learning (for example, in Google and Facebook platforms).
This workload tests the naive Bayesian (a well-known classification algorithm for knowledge discovery and data mining) trainer in the Mahout* open-source machine-learning library from Apache.
This workload models complex analytic queries of structured (relational) tables by computing the sum of each group over a single read-only table.
This workload models complex analytic queries of structured (relational) tables by computing both the average and sum for each group by joining two different tables.
Enhanced DFS IO
Tests HDFS* system throughput of a Hadoop cluster. Computes the aggregated bandwidth by sampling the number of bytes read or written at fixed time intervals in each map task.
Each of the above benchmarks were run thrice with each distribution. The average values of these results were used to prepare performance comparison reports and graphs.
Please feel free to send request for any particular case studies you want us to conduct during our experiments.