When we visit the official Apache Hadoop website (December 1, July 2014), we can see that the current official website provides three recommended versions. Beginners like me must be confused: 1.2.X
When we visit the official Apache Hadoop website (December 1, July 2014), we can see that the current official website provides three recommended versions. Beginners like me must be confused: 1.2.X-current stable version, 1.2 release 2.4.X-current stable 2.x version 0.23.X-similar to 2. X. X but missing NN H
When we visit the official Apache Hadoop website (December 1, July 2014), we can see that the current official website provides three recommended versions. Beginners like me must be confused:
1.2.X-current stable version, 1.2 release
2.4.X-current stable 2.x version
0.23.X-similar to 2. X. X but missing nn ha.
1. Why is such a strange version available?
Hadoop generation: When Hadoop was initially developed, versions such as 0.20, 0.21, 0.22, and 0.23 appeared. Later, version 0.20.x evolved into version 1.0.x, that is, the stable version of the first generation of Hadoop. That is to say, the first generation of Hadoop contains three major versions: 0.20.x, 0.21.x, and 0.22.x. Among them, 0.20.x finally evolves to 1.0.x and becomes a stable version, that is, 1.2.x is a stable product recommended on the official website. 0.21.x and 0.22.x are MapReduce implementations in the Next Generation Hadoop, but the resource management system still uses JobTracker instead of YARN.
Hadoop generation II: the first version of Hadoop 0.23 is completely different from the first generation of Hadoop. It is a completely new architecture, including HDFS Federation and YARN systems, later, the 2.0.x series of the second generation was developed from 0.23. As for the differences between 0.23 and 2.0.x series, NameNode HA and Wire-compatibility are added to 2. x compared with 0.23.x.
We can clearly understand the three download links provided on the Hadoop Official Website:
1.2.x represents the first-generation Hadoop framework; 2.4.x represents the second-generation Hadoop framework; 0.23.x is also the second-generation framework, but lacks the nn ha feature.
What is nn ha? A: Namenode High Available, that is, High availability of Namenode. Here is an introduction to the HA solution:
Http://wenku.baidu.com/link? Url = aPnXLQjY3rXyxSwgn_9u4-7fuvmuW6WNmLDLr3YFQ7_RZjFR7YODjPK-pSbgyHBX2AZ9bzB5EYeiR09LO_ecSa6KmDNJn4R-3ImzUVGWjL _
Http://www.infoq.com/cn/articles/hadoop-2-0-namenode-ha-federation-practice-zh
Http://blog.csdn.net/wf1982/article/details/7793166
2. Which version should we download?
Since my work in this lab was first introduced to this framework and many features were required, I chose the second-generation framework. Moreover, as a user, we must select the Stable version. We can see that the 2.4.x version provided on the official website is indeed the Stable version. When I download the version and find the image server of H3C, the Stable directory is 2.4.1, so we use version 2.4.1 in this project.
To see the most comprehensive version of Hadoop, go here: http://svn.apache.org/repos/asf/hadoop/common/branches/
In fact, there are only two versions of Hadoop: Hadoop 1.0 (I think it should be called another generation) and Hadoop 2.0 (I think it should be called another generation, and Chinese characters should not be confused, hadoop 1.0 is composed of a Distributed File System (HDFS) and an offline computing framework (MapReduce). Hadoop 2.0 contains an HDFS supporting horizontal scaling of NameNode, A Resource Management System YARN and an offline computing framework MapReduce running on YARN. Compared with Hadoop 1.0, Hadoop 2.0 has more powerful functions, better scalability and performance, and supports multiple computing frameworks.
When deciding whether to use a software for an open-source environment, we usually need to consider the following factors:
(1) Whether it is open source software, that is, whether it is free.
(2) Whether there is a stable version. This general official software website will provide instructions.
(3) whether it has been verified by practice. check whether there are some large companies that are already using it in the production environment.
(4) Is there strong community support? In case of a problem, the solution can be quickly obtained through communities, forums, and other network resources.
3. Another distribution version based on open-source Hadoop in the central area of China ------
When paying attention to Hadoop, we may often see Hadoop versions such as CDH3 and CDH4. They are released by a company called Cloudera, just like Redhat in the Linux operating system field, hadoop is an open-source project of Apache. Then Cloudera, a company, transformed Hadoop into another release version. CDH is an optimized version based on Apache. This cloud computing company is developing very strongly and will become the next Redhat trend.
Can learn about the relevant knowledge in the official website: http://www.cloudera.com/content/support/en/downloads.html
4. Related Concepts in the Hadoop2 generation
(1) Hadoop 1.0
The first generation of Hadoop is composed of a distributed storage system HDFS and a distributed computing framework MapReduce. HDFS consists of a NameNode and multiple DataNode. MapReduce consists of a JobTracker and multiple TaskTracker, the corresponding Hadoop version is Hadoop 1. x and 0.21.X, 0.22.x.
(2) Hadoop 2.0
The second generation of Hadoop was proposed to overcome various problems existing in HDFS and MapReduce in Hadoop 1.0. In view of the scalability of HDFS restricted by a single NameNode in Hadoop 1.0, HDFS Federation is proposed, which allows multiple NameNode to manage different directories for access isolation and horizontal scaling; aiming at the shortcomings of MapReduce in Hadoop 1.0 in scalability and multi-framework support, a new Resource management framework YARN (Yet Another Resource Negotiator) is proposed, which separates Resource management and Job control functions in JobTracker, implemented by the components ResourceManager and ApplicationMaster respectively. Among them, ResourceManager is responsible for allocating resources for all applications, while ApplicationMaster is only responsible for managing one application. The corresponding Hadoop versions are Hadoop 0.23.x and 2.x.
(3) MapReduce 1.0 or MRv1 (MapReduce version 1)
The first generation of MapReduce computing framework consists of two parts: programming model and runtime environment ). Its basic programming model abstracts the problem into two stages: Map and Reduce. In the Map stage, the input data is parsed into key/value. After the map () function is called iteratively, the output is in the form of key/value to the local directory. In the Reduce stage, the same value of the key is normalized and the final result is written to HDFS. Its runtime environment consists of two types of services: JobTracker and TaskTracker. JobTracker is responsible for resource management and control of all jobs, while TaskTracker is responsible for receiving and executing commands from JobTracker.
(4) MapReduce 2.0, MRv2 (MapReduce version 2), or NextGen MapReduc
MapReduce 2.0 or MRv2 has the same programming model as MRv1. The only difference is the runtime environment. MRv2 is MRv1, which is processed on the basis of MRv1 and runs on the Resource Management Framework YARN. It is no longer composed of JobTracker and TaskTracker, but becomes a job control process ApplicationMaster, applicationMaster is only responsible for the management of one job, and YARN is responsible for the management of resources.
In short, MRv1 is an independent offline computing framework, while MRv2 is MRv1 running on YARN.
(5) YARN
The Resource Management Framework in Hadoop 2.0 is a framework manager that allocates resources for various frameworks and provides runtime environments. MRv2 is the first computing framework running on YARN. Other computing frameworks, such as Spark and Storm, are being transplanted to YARN. YARN is similar to the Resource Management System mesos a few years ago and earlier Torque. Yarn official introduction http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html
(6) HDFS Federation
HDFS is improved in Hadoop 2.0 so that NameNode can be horizontally expanded into multiple directories. Each NameNode is in charge of some directories, which not only enhances the scalability of HDFS, but also enables HDFS to be isolated.
5. Other concepts related to distributed computing
Spark: Spark is an efficient distributed computing system originated from the cluster computing platform AMPLab at UC Berkeley. Spark, known as Hadoop's Swiss Army knife, has extraordinary speed and ease of use. Spark is based on memory computing. Compared with Hadoop MapReduce, Spark has 100 times higher performance, and Spark provides APIs that are higher than Hadoop, the same algorithm is often implemented in Spark only with a Hadoop length of 1/10 or 1/100. Apache Spark? Is a fast and general engine for large-scale data processing.
Storm: Distributed Real-time computing system. According to the storm author, the significance of storm for real-time computing is similar to that of hadoop for batch processing. We all know that hadoop implemented based on google mapreduce provides us with map and reduce primitives, making our batch processing programs very simple and elegant. Storm also provides some simple and elegant primitives for real-time computing. There is a blog about Strom: http://www.searchtb.com/2012/09/introduction-to-storm.html
------------ Source of this article -----------
Http://dongxicheng.org/mapreduce-nextgen/how-to-select-hadoop-versions/
Http://dongxicheng.org/mapreduce-nextgen/hadoop-2-0-terms-explained/
Http://dongxicheng.org/mapreduce-nextgen/hadoop-2-2-0/