How do I pick the right big data or Hadoop platform?
KeywordsDistribution large data provider
Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
This year, big data has become a topic in many companies. While there is no standard definition to explain what "big Data" is, Hadoop has become the de facto standard for dealing with large data. Almost all large software providers, including IBM, Oracle, SAP, and even Microsoft, use Hadoop. However, when you have decided to use Hadoop to handle large data, the first problem is how to start and what product to choose. You have a variety of options to install a version of Hadoop and achieve large data processing. This article discusses different options and recommends where each option applies.
Multiple options for
The following figure shows a variety of options for the Hadoop platform. You can install only the Apache release version, or select one from several distributions offered by different providers, or decide to use a large data kit. It is important to understand that each release contains Apache Hadoop, and that almost every large data suite contains or uses a release version.
below we start with Apache Hadoop to take a good look at each option.
The current version of the
Apache Hadoop project (version 2.0) contains the following modules:
Hadoop Universal module: A common toolset that supports other Hadoop modules.
Hadoop Distributed File System (HDFS): A Distributed file system that supports high throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A large data parallel processing system based on yarn.
It is easy to install Apache Hadoop independently on the local system (just unzip and set some environment variables and start using them). But this is only appropriate for getting started and doing some basic tutorial learning.
If you want to install Apache Hadoop on one or more "real" nodes, that's much more complicated.
Problem 1: Complex cluster settings
You can use pseudo distributed mode to simulate multiple node installations on a single node. You can simulate installation on a single server on multiple different servers. Even in this mode, you have to do a lot of configuration work. If you want to set up a cluster of several nodes, the process becomes more complex. If you are a novice administrator, you will have to struggle with user rights, access rights, and so on.
problem 2:hadoop Ecosystem use
in Apache, all projects are independent of each other. This is a good point! But the Hadoop ecosystem contains many other Apache projects besides Hadoop:
Pig: A platform for analyzing large datasets consisting of a high-level language that expresses data analysis programs and an infrastructure for evaluating these programs.
Hive: A data Warehouse system for Hadoop that provides a SQL-like query language that enables easy data aggregation, specific queries, and analysis of large data stored in a Hadoop-compliant file system.
Hbase: A distributed, scalable, large data repository that supports random, real-time read/write access.
Sqoop: A tool designed to efficiently transfer bulk data for use with Apache Hadoop and structured data repositories such as relational databases.
Flume: A distributed, reliable, and available service for efficiently collecting, summarizing, and moving large amounts of log data.
Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing packet services.
there are other projects.
you need to install these projects and manually integrate them into Hadoop.
You need to pay attention to different versions and release versions. Unfortunately, not all versions work perfectly together. You have to compare your release notes and find a solution. Hadoop offers a multitude of different versions, branches, features, and so on. Unlike the version numbers 1.0, 1.1, and 2.0 that you know from other projects, the version of Hadoop can be far less simple. If you would like to learn more about the "Hadoop version of Hell", read the article "The Elephant's family tree (Genealogy of elephants)".
Question 3: Business Support
Apache Hadoop is just an open source project. Of course, there are many benefits. You can access and change the source code. In fact, some companies have used and expanded the underlying code and added new features. A lot of information is provided in many discussions, articles, blogs, and mailing lists.
However, the real question is how to get business support for open source projects like Apache Hadoop. Companies often simply support their products without providing support for open source projects (not just the Hadoop project, all open source projects face this problem).
when to use Apache Hadoop
because it takes about 10 minutes to complete its standalone installation on the local system, Apache Hadoop is ideal for the first attempt. You can try the WordCount example (this is the "Hello World" example of Hadoop) and browse through some of the MapReduce Java code.
If you don't want to use a "real" Hadoop release (see the next section), then it's also true to choose Apache Hadoop. However, I have no reason not to use a release version of Hadoop because they also have free, non-commercial editions.
so, for real hadoop projects, I highly recommend using a release version of Hadoop to replace Apache Hadoop. The following section will illustrate the advantages of this choice.
Hadoop release addresses the issues mentioned in the previous section. The business model of the distribution provider relies entirely on its own distribution. They provide packaging, tools and business support. These not only greatly simplify the development, but also greatly simplify the operation.
The Hadoop release packs together the different projects that the Hadoop ecosystem contains. This ensures that all available versions work together smoothly. Releases are published periodically and contain version updates for different projects.
distribution provider also provides graphical tools for deploying, managing, and monitoring the Hadoop cluster. In this way, complex clusters can be set up, managed, and monitored more easily. Save a lot of work.
as mentioned in the previous section, it is hard to get commercial support for the common Apache Hadoop project, while the provider provides business support for its own version of the Hadoop release.
Hadoop distribution Provider
Currently, in addition to Apache Hadoop, the Hortonworks, Cloudera and MAPR Troika in the release version of the nearly parallel. Although, there are other versions of Hadoop released during this period. For example, EMC company Pivotal HD, IBM's Infosphere biginsights. With Amazon elastic MapReduce (EMR), Amazon even offers a managed, pre-configured solution on its cloud.
Although many other software providers do not develop their own hadoop distributions, they work with one distribution provider. Microsoft and Hortonworks, for example, collaborate with each other, especially with the introduction of Apache Hadoop into the Windows Server operating system and Windows Azure cloud services. Another example is that Oracle provides a large data application product by combining its own hardware and software with Cloudera's Hadoop releases. Software providers like SAP and Talend support several different distributions at the same time.
How do I choose the right Hadoop release version?
This article does not evaluate the release versions of each Hadoop. However, the following is a brief introduction to the major distribution providers. There are generally a few nuances between different distributions, and providers see these differences as the secrets of their products. The following list explains these differences:
Cloudera: The most-formed distribution, with the most deployment cases. Provides powerful deployment, management, and monitoring tools. Cloudera develops and contributes impala projects that can handle large data in real time.
Hortonworks: Does not have any proprietary (non-Open-source) modified use of the only provider of 100% open source Apache Hadoop. Hortonworks is the first provider to use the Meta data Service features of Apache Hcatalog. And, their stinger have pioneered and greatly optimized the Hive project. Hortonworks provides a very good, easy-to-use sandbox for getting started. Hortonworks developed a number of enhancements and submitted them to the core backbone, allowing Apache Hadoop to run locally on the Microsoft Windows platform, including Windows Server and Windows Azure.
MAPR: It uses a number of different concepts than the competitor, especially to support local UNIX file systems rather than HDFS (using non-open-source components) for better performance and ease of use. You can use local UNIX commands instead of the Hadoop command. In addition, MAPR is distinguished from other competitors with high availability features such as snapshots, mirrors, or stateful failback. The company also led the Apache Drill project, a new project for Google's Dremel Open source project, designed to execute SQL-like queries on Hadoop data to provide real-time processing.
Amazon Elastic Map Reduce (EMR): What distinguishes it from other providers is that it is a managed solution that runs on Amazon elastic Compute Cloud (Amazon EC2) and Amzon simple Strorage Service (Amzon S3) consists of a network-scale infrastructure. In addition to Amazon's release version, you can also use MAPR on EMR. The temporary cluster is the primary use case. If you need a one-time or uncommon large data processing, EMR may save you a lot of money. However, there are disadvantages. It contains only the pig and hive projects in the Hadoop ecosystem and does not contain many other items by default. Also, EMR is highly optimized to work with the data in S3, which has a higher latency and does not locate data located on your compute node. So file IO on EMR is much slower than your own Hadoop cluster or your private EC2 cluster and has a much longer latency.
the above distributions can be used flexibly or in combination with different large data suites. Some of the other distributions that occur during this period are inflexible enough to bind you to specific software stacks and/or hardware stacks. EMC's pivotal HD native, for example, blends the Greenplum Analysis database to provide real-time SQL queries and superior performance on top of Hadoop, or Intel's Apache release version, Intel's Apache The Hadoop release is optimized for solid-state drives that other Hadoop companies are not currently doing.
so if your business already has a specific supply stack, be sure to check which Hadoop release it supports. For example, if you use a Greeplum database, then pivotal may be a perfect choice, and in other cases it might be more appropriate to take a more flexible approach. For example, if you already have a talend ESB, and you want to use Talend Big data to start your large project, you can choose the version of the Hadoop you like, because Talend does not rely on a particular provider of the Hadoop release.
to make the right choice, understand the concepts of each release and try it out. Please verify the tools provided and analyze the total cost of the Enterprise version plus business support. After that, you can decide which release version is right for you.
When to use the Hadoop release?
because the release has the advantages of packaging, tooling, and business support, you should use the release version of Hadoop in most usage scenarios. It is rare to use the common (original plan, plain) Apache Hadoop release and build its own distribution on this basis. You will have to test your own packages, build your own tools, and write your own patches. Others have encountered the same problems you will encounter. So make sure you have good reasons not to use the Hadoop release.
However, even a Hadoop release will require a lot of effort. You still need to write a lot of code for your MapReduce job and integrate all your different data sources into Hadoop. And that's the point of the big Data suite.
Large Data Kit
You can use a large data suite on top of Apache or Hadoop releases. Large data suites typically support multiple versions of the Hadoop release. However, some providers have implemented their own Hadoop solutions. Either way, the large data suite adds several further features to the release to handle large data:
tools: Typically, large data suites are built on Ides like Eclipse. Add-on Plug-ins facilitate the development of large data applications. You can create, build, and deploy large data services within your familiar development environment.
Modeling: Apache Hadoop or Hadoop releases provide the infrastructure for the Hadoop cluster. However, you still have to write a lot of complex code to build your own MapReduce program. You can use normal Java to write this code, or you can also have optimized language, such as Piglatin or hive Query Language (HQL), they generate MapReduce code. Large data suites provide graphical tools to model your large data services. All required code is generated automatically. You only use to configure your job (that is, to define certain parameters). This makes it easier and more efficient to implement large data jobs.
code generation: Generate all code. You don't have to write, debug, analyze, and optimize your mapreduce code.
Scheduling: Need to schedule and monitor the execution of large data operations. You don't need to write cron jobs or other code for scheduling. You can easily use large data kits to define and manage execution plans.
Integration: Hadoop needs to integrate data from all different classes of technologies and products. In addition to file and SQL databases, you also integrate NoSQL databases, social media such as Twitter or Facebook, messages from message middleware, or data from business-to-business products like Salesforce or SAP. Large data suites provide a lot of help for integration by providing many connectors from different interfaces to Hadoop and back-end. You don't have to write the connection code manually, you just use graphical tools to integrate and map all of this data. Integration capabilities typically also have data quality characteristics, such as data cleansing to improve the quality of imported data.
Large Data Kit provider
The number of
large data kits continues to grow. You can choose between several open source and proprietary providers. Most large software providers, such as IBM, Oracle, and Microsoft, integrate a large data suite of one class into their software portfolio. The vast majority of these vendors support only one version of the Hadoop release, either their own, or their partner with a Hadoop release provider.
on the other hand, there are also data-processing providers to choose from. They provide products that can be used for data integration, data quality, Enterprise service Bus, business process management, and further integration components. There are both proprietary providers like Informatica and open source providers such as Talend or Pentaho. Some providers support not just one version of a Hadoop release but many at the same time. For example, at the moment of writing this article, Talend can be with Apache Hadoop, Cloudera, Hortonworks, MAPR, Amazon elastic MapReduce or a customized version of the release, such as using EMC's pivotal HD.
How to choose the right large data suite?
This article does not evaluate individual large data kits. When you choose a large data package, consider several aspects. The following should help you make the right choice for your big data problem:
Simplicity: Try a large data kit yourself. This means: Install it, connect it to your Hadoop installation, integrate your different interfaces (files, databases, business-to-business, and so on), and finally model, deploy, and execute some large data jobs. Find out for yourself how easy it is to use a large data suite-it's not enough to let a provider's advisor show you how it works. Do a concept validation yourself.
Breadth: Whether the large data suite supports widely used open source standards-not just hadoop and its ecosystem, but data integration through SOAP and rest Web services, and so on. Is it open source and can be easily changed or expanded depending on your specific problem? Is there a large community with documents, forums, blogs, and exchanges?
feature: Do you support all required features? The release version of Hadoop (if you've already used one)? Do you want to use all the parts of the Hadoop ecosystem? What do you want to integrate with all the interfaces, technologies, and products? Note that too many features can significantly increase complexity and cost. So verify that you really need a very heavyweight solution. Do you really need all of its features?
Traps: Please note certain traps. Some large data packages use data-driven billing ("Data tax"), that is, you have to pay for each data line you handle. Because we're talking about big data, it's going to be very expensive. Not all large data suites generate native Apache Hadoop code, usually with a private engine installed on each of the servers in the Hadoop cluster, which relieves the independence of the software provider. Also consider what you really want to do with the big Data suite. Some solutions only support the use of Hadoop for ETL to populate data warehouses, while other solutions provide large data analysis, such as reprocessing, transformations, or Hadoop clusters. ETL is only one use case of Apache Hadoop and its ecosystem.
Decision Tree: Framework vs. release vs. Suite
Now, you know the difference between Hadoop's different choices. Finally, let's summarize and discuss the scenarios for choosing the Apache Hadoop Framework, the Hadoop release, or the large data suite.
the "decision tree" below will help you choose the right one:
Learn and understand the bottom details?
expert? Choose and configure yourself?
need business support?
Large Data Kit:
different data source integrations?
need business support?
a graphical scheduling of large data operations?
to achieve large data processing (integration, operation, analysis)?
There are several options for
Hadoop installation. You can only use the Apache Hadoop project and create your own distribution from the Hadoop ecosystem. Hadoop release providers such as Cloudera, Hortonworks, or mapr add features such as tools, business support, and so on, to reduce the effort that users need to pay. On top of the Hadoop release, you can use a large data suite for additional features such as modeling, code generation, large data job scheduling, and all kinds of data source integrations. Be sure to evaluate different options to make the right decisions for your big data projects.
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or
reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or
complaint, to email@example.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
and provide relevant evidence. A staff member will contact you within 5 working days.