This year, big data has become a topic of relevance in many companies. While there is no standard definition to explain what "big Data" is, Hadoop has become the de facto standard for dealing with large data. Almost all large software providers, including IBM, Oracle, SAP, and even Microsoft, are using Hadoop. However, when you have decided to use Hadoop to handle large data, the first problem is how to start and what product to choose. You have a variety of options to install a version of Hadoop and achieve large data processing. This article discusses the different options and recommends the application of each option.
Multiple options for the Hadoop platform
The following illustration shows a variety of options for the Hadoop platform. You can install only the Apache release version, or select one from several distributions offered by different providers, or decide to use a large data kit. It is important to understand that each release contains Apache Hadoop, and that almost every large data suite contains or uses a release version.
Let's start with Apache Hadoop to take a good look at each of these choices.
Apache Hadoop
The current version of the Apache Hadoop project (version 2.0) contains the following modules:
Hadoop Universal module: A common toolset that supports other Hadoop modules.
Hadoop Distributed File System (HDFS): A Distributed file system that supports high throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A large data parallel processing system based on yarn.
It is easy to install Apache Hadoop independently on the local system (just unzip and set some environment variables and then start using them). But this is only appropriate for getting started and doing some basic tutorial learning.
If you want to install Apache Hadoop on one or more "real" nodes, that's a lot more complicated.
Problem 1: Complex cluster settings
You can use pseudo distributed mode to simulate multiple node installations on a single node. You can simulate installation on a single server on multiple different servers. Even in this mode, you have to do a lot of configuration work. If you want to set up a cluster of several nodes, the process becomes more complex. If you are a novice administrator, you will have to struggle with user rights, access rights, and so on.
Problems 2:hadoop the use of ecological system
In Apache, all projects are independent of each other. This is a good point! But the Hadoop ecosystem contains a lot of other Apache projects besides Hadoop:
Pig: A platform for analyzing large datasets consisting of a high-level language that expresses data analysis programs and an infrastructure for evaluating these programs.
Hive: A data Warehouse system for Hadoop that provides a SQL-like query language that enables easy data aggregation, specific queries, and analysis of large data stored in a Hadoop-compliant file system.
Hbase: A distributed, scalable, large data repository that supports random, real-time read/write access.
Sqoop: A tool designed to efficiently transfer bulk data for use in Apache Hadoop and structured data repositories such as relational databases.
Flume: A distributed, reliable, and available service for efficiently collecting, summarizing, and moving large amounts of log data.
Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing packet services.
There are a number of other projects.
You need to install these projects and integrate them manually into Hadoop.
You need to pay attention to different versions and release versions. Unfortunately, not all versions can work perfectly together. You have to compare your release notes and find a solution. Hadoop offers a multitude of different versions, branches, features, and so on. Unlike the version numbers 1.0, 1.1, and 2.0 that you know from other projects, the version of Hadoop can be far less simple. If you would like to learn more about the "Hadoop version of Hell", read the article "The Elephant's family tree (Genealogy of elephants)".
Issue 3: Business support
Apache Hadoop is just an open source project. This certainly has a lot of benefits. You can access and change the source code. In fact, some companies have used and expanded the underlying code and added new features. A lot of information is provided in many discussions, articles, blogs, and mailing lists.
The real question, however, is how to get business support for open source projects like Apache Hadoop. Companies often simply support their products without providing support for open source projects (not just the Hadoop project, all open source projects face this problem).