First. Brief description
In addition to the open source version of Apache, the current Hadoop distribution includes Huawei distribution, Intel distribution, Cloudera distribution (CDH), Hortonworks distribution (HDP), MapR, etc. All of these distributions are based on Apache Hadoop. It comes out because Apache Hadoop's open source protocol allows anyone to modify it and publish it as an open source or commercial product. Most domestic company distributions are charged, such as Intel distributions, Huawei distributions, and so on. There are four foreign Hadoop versions that are free of charge, namely the Apache Foundation Hadoop, Cloudera version (CDH), Hortonworks version (HDP), and MapR version.
Second, the advantages and disadvantages of the Apache community version
advantage:
Fully open source free
Community active
Detailed documentation and information
Disadvantages:
Complex version management. Version management is confusing, and various versions are emerging, leaving users at a loss.
Complex cluster deployment, installation, and configuration. Usually a large number of configuration files are written according to the cluster and distributed to each node, which is error-prone and inefficient.
Complex cluster operation and maintenance. For cluster monitoring and operation and maintenance, it is necessary to install other third-party software, such as ganglia, nagois, etc., which is difficult to operate and maintain.
Complex ecological environment. In the Hadoop ecosystem, component selection and use, such as Hive, Mahout, Sqoop, Flume, Spark, Oozie, etc., require a lot of considerations for compatibility issues, compatibility of versions, component conflicts, and whether compilers can pass. It often wastes a lot of time to compile components and resolve version conflicts.
Advantages and disadvantages of third-party distributions (such as CDH, HDP, MapR, etc.)
Advantage:
Based on the Apache protocol, 100% open source.
Clear version management. For example, Cloudera, CDH1, CDH2, CDH3, CDH4, CDH5, etc., followed by a patch version, such as CDH4.1.0 patch level 923.142, which means that 1065 patches are added based on the original Apache Hadoop 0.20.2.
More compatibility, security, and stability than Apache Hadoop. Third-party distributions are often tested and verified, with numerous deployment examples and a large number of production environments.
The version is updated quickly. Usually, CDH will have an update every quarter and a release every year.
Based on the stable version of Apache Hadoop, and applied the latest bug fix or feature patch
It provides deployment, installation, and configuration tools, greatly improving the efficiency of cluster deployment, and deploying clusters within a few hours.
Operation and maintenance is simple. It provides tools for management, monitoring, diagnosis, and configuration modification. It is easy to manage and configure. The positioning problem is fast and accurate, which makes the operation and maintenance work simple and effective.
Disadvantages:
It involves the issue of vendor lock-in. (can be solved by technology)
Third, the comparison of third-party distribution versions
Cloudera: The most outfit release with the most deployment cases. Provide powerful deployment, management, and monitoring tools. Cloudera has developed and contributed Impala projects that can process big data in real time.
Hortonworks: Does not own any proprietary (non-open source) modified provider that uses 100% open source Apache Hadoop. Hortonworks is the first provider to use the metadata service features of Apache HCatalog. And, their Stinger pioneered the Hive project dramatically. Hortonworks provides a very nice, easy to use sandbox for getting started. Hortonworks has developed a number of enhancements and submitted them to the core backbone, which enables Apache Hadoop to run locally on the Microsft Windows platform including Windows Server and Windows Azure.
MapR: Compared to competitors, it uses a number of different concepts, especially to support local Unix file systems rather than HDFS (using non-open source components) for better performance and ease of use. Local Unix commands can be used instead of Hadoop commands. In addition, MapR distinguishes it from other competitors with high-availability features such as snapshots, mirroring, or stateful failover. The company also leads the Apache Drill project, a re-implementation of Google's Dremel open source project to perform SQL-like queries on Hadoop data to provide real-time processing.
Amazon Elastic Map Reduce (EMR): Different from other providers, this is a hosted solution that runs on a network scale consisting of Amazon Elastic Compute Cloud (Amazon EC2) and Amzon Simple Strorage Service (Amzon S3) Above the infrastructure. In addition to the Amazon distribution, you can also use MapR on EMR. Temporary clustering is the primary use case. If you need one-time or unusual big data processing, EMR can save you a lot of money. However, this also has disadvantages. It only includes the Pig and hive projects in the Hadoop ecosystem and does not include many other projects by default. Also, EMR is highly optimized to work with the data in S3, which has higher latency and does not locate data located on your compute nodes. So file IO on EMR is much slower and has more latency than your own Hadoop cluster or your private EC2 cluster.
The above are representative third-party distributions, and other distributions are not listed.
Fourth, choose the decision
When we decide whether to use a software for an open source environment, there are usually several factors to consider:
(1) Whether it is open source software, that is, whether it is free.
(2) Is there a stable version, this general software official website will give instructions.
(3) Whether it is verified by practice, this can be known by checking whether there are some big companies that have already been used in the production environment.
(4) Is there a strong community support, and when a problem arises, it can quickly obtain solutions through network resources such as communities and forums.
To sum up
In summary, considering the efficient deployment and installation of big data platform, centralized configuration management, stability, compatibility, scalability during use, and simpler and more efficient operation and maintenance in the future, low-cost problems are encountered. Solve the cost.
A third-party release is recommended.