Due to the chaotic and changing versions of hadoop, the selection of hadoop versions has always worried many novice users. This article summarizes the evolution process of Apache hadoop and cloudera hadoop versions, and provides some suggestions for choosing the hadoop version.
1. Apache hadoop
1.1 Evolution of Apache
So far (December 23, 2012), the Apache hadoop version is divided into two generations. We call the first generation hadoop 1.0, and the second generation hadoop 2.0. The first generation of hadoop contains three major versions: 0.20.x, 0.21.x, and 0.22.x. Among them, 0.20.x finally evolves into 1.0.x and becomes a stable version, 0.21.x and 0.22.x are new major features such as namenode ha. The second generation of hadoop contains two versions, 0.23.x and 2.x. they are completely different from hadoop 1.0 and are a brand new architecture, both including HDFS Federation and yarn systems, compared with 0.23.x and 2. X adds two major features: namenode ha and wire-compatibility.
After the above general explanation, you may understand that hadoop distinguishes versions based on major features. To sum up, the features used to differentiate hadoop versions include the following:
(1) append supports file appending. If you want to use hbase, you need this feature.
(2) raid introduces a verification code to reduce the number of data blocks while ensuring data reliability. Link:
Https://issues.apache.org/jira/browse/HDFS/component/12313080
(3) symlink support HDFS File Link, specific can refer to the https://issues.apache.org/jira/browse/HDFS-245
(4) Security hadoop security, specific reference: https://issues.apache.org/jira/browse/HADOOP-4487
(5) namenode ha specific reference: https://issues.apache.org/jira/browse/HDFS-1064
(6) HDFS Federation and Yarn
Note that hadoop 2.0 is mainly developed by hortonworks, an independent Yahoo company.
1.2 download Apache
(1) versions: http://hadoop.apache.org/releases.html.
(2) download stable version: Find an image and download the version in the stable folder.
(3) The most complete version of hadoop: http://svn.apache.org/repos/asf/hadoop/common/branches/, which can be directly imported to eclipse.
2. cloudera hadoop
2.1 CDH version Evolution
The current version management of Apache is chaotic, and various versions emerge one after another, making many beginners confused. In contrast, cloudera has a lot to do with hadoop version management.
We know that hadoop complies with the Apache open-source protocol and users can freely use and modify hadoop for free. As a result, many hadoop versions are available on the market, one of the most famous ones is the release of cloudera, which we call CDH (cloudera distribution hadoop ). Up to now, CDH has five versions, the first two of which are no longer updated, and cdh3 (developed based on Apache hadoop 0.20.2) and cdh4 evolved on the basis of Apache hadoop 2.0.0), which correspond to Apache hadoop 1.0 and hadoop 2.0 respectively, and they are updated at intervals. Cloudera recently released cdh5 (based on Apache hadoop 2.2.0: CDH5-beta-1 download), comes with yarn ha implementation, although this version is currently beta, however, considering that this solution adopts the HA framework implemented by hadoop 2.0 (both HDFS ha and mapreduce ha adopt this framework), it is universal.
Cloudera divides minor versions by patch level. For example, if patch level is 923.142, 1065 patches are added based on the original Apache hadoop 0.20.2 (these patches are contributed by various companies or individuals, records are recorded on hadoop Jira). Among them, 923 are patches added to the last beta version, and 142 are new patches added after the stable version is released. It can be seen that the higher the patch level, the more complete the functions and more bugs are solved.
The cloudera version has a clearer hierarchy and provides hadoop installation packages for various operating systems. You can directly use apt-Get or yum commands for installation, which is easier.
2.2 download CDH
(1) version description:
Https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information
(2) view features of each version:
Https://ccp.cloudera.com/display/DOC/CDH+Packaging+Information+for+Previous+Releases
(3) Download versions:
Cdh3: http://archive.cloudera.com/cdh/3/
Cdh4: http://archive.cloudera.com/cdh4/cdh/4/
Note: The hadoop compressed package is in the top directory of the two links and is not in a folder. Many people cannot find the installation package after entering the link!
3. How to Select a hadoop version
The current version of hadoop is chaotic, making many users confused. In fact, there are only two versions of hadoop: hadoop 1.0 and hadoop 2.0. hadoop 1.0 consists of a distributed file system HDFS and an offline computing framework mapreduce, hadoop 2.0 contains an HDFS supporting horizontal scaling of namenode, a resource management system yarn, and an offline computing framework mapreduce running on yarn. Compared with hadoop 1.0, hadoop 2.0 has more powerful functions, better scalability and performance, and supports multiple computing frameworks.
When deciding whether to use a software for an open-source environment, we usually need to consider the following factors:
(1) Whether it is open source software, that is, whether it is free.
(2) Whether there is a stable version. This general official software website will provide instructions.
(3) whether it has been verified by practice. check whether there are some large companies that are already using it in the production environment.
(4) Is there strong community support? In case of a problem, the solution can be quickly obtained through communities, forums, and other network resources.
Considering the above factors, let's analyze the open-source software hadoop. Hadoop 2.0 is not yet stable and cannot be used in the production environment. Therefore, if you are currently preparing to use hadoop, you can only select one version from hadoop 1.0, the latest stable versions of Apache and cloudera are hadoop 1.0.4 and cdh3u4 (December 23, 2012). Therefore, you can choose one of them. Now hadoop 2.0 has released the latest stable version 2.2.0. We recommend that you use this version. For details, refer to "Analysis of New Features of hadoop 2.0 stable version 2.2.0". For the upgrade method, refer: "hadoop upgrade solution (2): upgrade from hadoop 1.0 to 2.0 (1 )".
Transferred from Dong's blog
Hadoop version description