Introduction to the Mainstream Hadoop Distribution

Source: Internet
Author: User
Keywords hadoop distrbution mapr hadoop apache hadoop
Tags apache hadoop hdp hdp hadoop mapr hadoop hadoop distribution
1, Apache Hadoop2.0 version, has the following modules:
Hadoop universal module that supports a common set of tools for other Hadoop modules;
Hadoop Distributed File System (HDFS), a distributed file system that supports high-throughput access to application data;
Hadoop YARN, a framework for job scheduling and cluster resource management;
Hadoop MapReduce, a YARN-based big data parallel processing system.
In addition to the community version, Hadoop currently has distributions from numerous vendors.

2, Cloudera: The most popular release version, with the most deployment cases; provides powerful deployment, management and monitoring tools. Developed and contributed to the Impala project that can process big data in real time.

3, Hortonworks: 100% open source Apache Hadoop unique provider. Hortonworks is the first provider to use the metadata service features of Apache HCatalog. Moreover, their Stinger greatly optimized the Hive project. Hortonworks offers people a very nice, easy to use sandbox. Hortonworks has developed a number of enhancements and submitted them to the core backbone, which enables Apache Hadoop to run locally on Microsoft Windows platforms including Windows Servers and Windows Azure.

4. MapR: Compared to competitors, it uses a number of different concepts, especially to support local UNIX file systems rather than HDFS (using non-open source builds) for better performance and ease of use. We can use local UNIX commands instead of Hadoop commands. In addition, MapR distinguishes it from other competitors with high-availability features such as snapshots, mirroring, or stateful failover. The company also leads the Apache Drill project, a re-implementation of Google's Dremel open source project to perform SQL-like queries on Hadoop data to provide real-time processing.

5. Amazon Elastic Map Reduce (EMR): The difference with other providers is that this is a hosted solution that runs on a network of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3) Above the scale of the infrastructure. In addition to the Amazon distribution, MapR can also be used on EMR, which is the primary use case. If you need one-time or unusual big data processing, EMR can save you a lot of money. However, this also has disadvantages. It only includes the Pig and Hive projects in the Hadoop ecosystem and does not include many other projects by default. Also, the EMR is highly optimized to work with the data in S3, which has a higher latency and will not be located on the data on your compute nodes. So file IO on EMR is much slower and has more latency than your own Hadoop cluster or your private EC2 cluster.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.