about how to choose the right solution for your Hadoop platform

Source: Internet
Author: User
Tags hortonworks mapr hadoop mapreduce hadoop ecosystem

multiple options for the Hadoop platform

Shows a variety of options for the Hadoop platform. You can install only the Apache release, or choose one of several distributions offered by different providers, or decide to use a big data suite. It is important to understand that every release contains Apache Hadoop, and almost every big data suite contains or uses a release version.

650) this.width=650; "alt=" Hadoop learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Hadoop01.jpg "/>

Let's start with Apache Hadoop and take a look at each of these options first.

Apache Hadoop; The current version of the Apache Hadoop project (version 2.0) contains the following modules
Hadoop Universal module: A common toolset that supports other Hadoop modules.
Hadoop Distributed File System (HDFS): A Distributed file system that supports high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: Yarn-based Big data parallel processing system.
It is very easy to install Apache Hadoop on your local system independently (just unzip and set some environment variables and you can start using them). But this is only appropriate for getting started and doing some basic tutorial learning.

650) this.width=650; "alt=" Hadoop learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Hadoop02.jpg "/>

If you want to install Apache Hadoop on one or more “ true nodes ”, that's a lot more complicated.

Issue 1: Complex cluster settingsYou can use pseudo-distributed mode to simulate multiple-node installations on a single node. You can simulate the installation on a single server on multiple different servers. Even in this mode, you have to do a lot of configuration work. If you want to set up a cluster of several nodes, there is no doubt that the process becomes more complex. If you are a novice administrator, you will have to struggle with issues such as user permissions, access permissions, and so on. problem 2:hadoop The use of ecosystems

In Apache, all projects are independent of each other. That's a good point! But the Hadoop ecosystem includes many other Apache projects in addition to Hadoop:

Pig: A platform for analyzing large datasets, which consists of a high-level language for expressing data analysis programs and an infrastructure for evaluating these programs.
Hive: A data Warehouse system for Hadoop that provides a SQL-like query language that makes it easy to summarize data, specific queries, and analyze big data stored in a Hadoop-compatible file system.
Hbase: A distributed, scalable, big data repository that supports random, real-time read/write access.
Sqoop: A tool designed to efficiently transfer bulk data for data transfer between Apache Hadoop and structured data repositories such as relational databases.
Flume: A distributed, reliable, and usable service for efficiently collecting, summarizing, and moving large volumes of log data.
ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing packet services.

You need to install these projects and manually integrate them into Hadoop.

Issue 3: Business Support

Apache Hadoop is just an open source project. This certainly has many benefits. You can access and change the source code. In fact, some companies have used and expanded the underlying code and added new features. A lot of information is available in many discussions, articles, blogs, and mailing lists.

The real question, however, is how to get commercial support for open source projects like Apache Hadoop. Companies usually only support their products, not support open source projects (not just Hadoop projects, all open source projects face such problems).

when to use Apache Hadoop

Apache Hadoop is a good fit for the first attempt because it can be installed on the local system in just 10 minutes or so. You can try the WordCount example (this is the “hello world” example of Hadoop) and explore some of the Java code for MapReduce. If you don't want to use a “ real ” Hadoop release (see next section), then Apache Hadoop is also the right choice. However, I have no reason not to use a release version of Hadoop —— because they also have a free, non-commercial version.

So, for a real Hadoop project, I highly recommend using a Hadoop distribution instead of Apache Hadoop. The following section explains the advantages of this choice.

650) this.width=650; "alt=" Hadoop learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Hadoop03.jpg "/>

Hadoop Release version

The Hadoop release resolves the issue that was mentioned in the previous section. The distribution provider's business model relies on its own distribution. They provide packaging, tooling, and business support. These not only greatly simplify the development, but also greatly simplify the operation. The Hadoop distribution packages together the different projects that the Hadoop ecosystem contains. This ensures that all versions that are used can be smoothly to work together. The release is periodically published, and it contains version updates for different projects.

The distribution provider also provides graphical tools for deploying, managing, and monitoring Hadoop clusters on top of packaging. In this way, complex clusters can be set up, managed, and monitored more easily. Save a lot of work. As mentioned in the previous section, it is hard to get commercial support for a common Apache Hadoop project, while the provider provides commercial support for its own Hadoop distribution.

Hadoop distribution Provider

Currently, in addition to Apache Hadoop, the Hortonworks, Cloudera and MAPR Troika are almost on the same page in their release. However, other Hadoop distributions have also appeared during this period. such as EMC's pivotal HD, IBM's Infosphere BigInsights. With Amazon Elastic MapReduce (EMR), Amazon even provides a managed, preconfigured solution on its cloud.

While many other software providers do not develop their own hadoop distributions, they work with one release provider. For example, Microsoft and Hortonworks work together to bring Apache Hadoop into the Windows Server operating system and Windows Azure cloud services in particular. Another example is Oracle, which provides a big data application product by combining its own hardware and software with the Hadoop release of Cloudera. Software providers like SAP and Talend support several different distributions at the same time.

How do I choose the right Hadoop release?

This article does not evaluate the release versions of each Hadoop. However, the following is a brief introduction to the major distribution providers. There are generally only minor differences between distributions, and providers see these differences as secrets and the difference between their products. These differences are explained in the following list:

Cloudera: The most-formed release, with the most deployment cases. Provides powerful deployment, management, and monitoring tools. Cloudera has developed and contributed to Impala projects that deal with big data in real time.
Hortonworks: does not own any private (non-open source) modified to use the 100% open source Apache Hadoop's only provider. Hortonworks is the first provider to use the Apache Hcatalog Metadata Service feature. And their stinger dramatically optimizes the Hive project in a groundbreaking manner. Hortonworks provides a very good, easy-to-use sandbox for getting started. Hortonworks has developed many enhancements and is committed to the core backbone, enabling Apache Hadoop to run locally on the Microsft Windows platform, including Windows Server and Windows Azure.
MapR: compared to competitors, it uses a number of different concepts, especially for better performance and ease of use, supporting local UNIX file systems rather than HDFS (using non-open source components). You can use local UNIX commands instead of Hadoop commands. In addition, MAPR is distinguished from other competitors with high availability features such as snapshots, mirroring, or stateful failure recovery. The company also leads the Apache Drill project, a re-implementation of Google's Dremel open source project to execute SQL-like queries on Hadoop data to provide real-time processing.

To make the right choice, learn the concepts of each release and try it out. Please verify the tools provided and analyze the total cost of the Enterprise Edition plus business support. After that, you can decide which distribution is right for you.

When do I use the Hadoop release version?

Because distributions have the benefits of packaging, tooling, and business support, the release version of Hadoop should be used in most use cases. It is rare to use the normal (original plan, plain) Apache Hadoop release version and build its own distribution on this basis. You'll have to test your own packaging, build your own tools, and write your own patches. Others have encountered the same problems that you will encounter. So, make sure you have a good reason not to use the Hadoop release version.

However, even Hadoop releases require a lot of effort. You still need to write a lot of code for your MapReduce job and integrate all of your disparate data sources into Hadoop. And this is the entry point for the big Data suite.

650) this.width=650; "alt=" Hadoop learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Hadoop04.jpg "/>

Conclusion

There are several options for Hadoop installation. You can use only Apache Hadoop projects and create your own distribution from the Hadoop ecosystem. Hadoop distribution providers like Cloudera, Hortonworks or MAPR have added features such as tools, business support, and more to Apache Hadoop to reduce the effort users need to pay. On top of the Hadoop release, you can use a big data suite to use additional features such as modeling, code generation, Big data job scheduling, and all the different kinds of data source integrations. Be sure to evaluate different choices to make the right decisions for your big data projects.

about how to choose the right solution for your Hadoop platform

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.