Preface
Since 2011, China has entered the age of big data. Family software represented by hadoop occupies a vast territory of big data processing. Open-source communities and vendors, and all data software, are moving closer to hadoop. Hadoop has also changed from the niche high-tech field to the big data development standard. Based on the original hadoop technology, a hadoop family product emerged. The concept of "Big Data" is constantly innovated and technological advances are introduced.
Directory
- Hadoop Development History
- Selection and introduction of hadoop Release versions
1. hadoop Development History1.1hadoop background
Hadoop originated from nutch. Nutch is an open-source network search engine created in 2002 by Doug cutting. The objective of the design of nutch is to build a large full-network search engine, including web page crawling, indexing, and query. However, as the number of web pages crawled increases, it has encountered a serious scalability problem, that is, billions of web pages cannot be solved.StorageAndIndexProblem.
As a result, Google publishedTwo papersA feasible solution is provided for this problem. One is a Google Distributed File System (GFS) paper published in 2003. This paper describes the Storage Architecture of Webpage-related data of Google search engine. This architecture can solve the problems arising from web page capturing and IndexingUltra-large file storageRequirement Issues. However, since Google only has open-source ideas but has not yet opened the source code, the nutch project team has completed an open-source implementation based on the paper, that is, the Distributed File System (NDfS) of nutch ).
Another article is the Google distributed computing framework mapreduce published in 2004. This paper describes the design art of Google's most important distributed computing framework mapreduce, which can be used to deal with indexing problems of massive web pages. Similarly, since Google has not opened source code, the developer of nutch has completed an Open Source implementation. Due to NDfS and mapreduceNot only applicable to search fieldsIn early 2006, developers moved them out of nutch and became a subproject of Lucene called hadoop.
Doug cutting joined yahoo at about the same time and agreed to organize a dedicated team to continue developing hadoop. In February of the same year, the Apache hadoop project was officially launched to support independent development of mapreduce and HDFS. In January 2008, hadoop became a top-level Apache project and ushered in its rapid development.
2. Selection and introduction of hadoop Release versions 2.1introduction to hadoop Release versions
At present, there are many hadoop releases, including Huawei releases, Intel releases, and cloudera releases (CDH). All these releases are derived from Apache hadoop, the reason for so many versions is entirely determined by the Open Source protocol of Apache hadoop: Anyone can modify it and release/sell it as an open source or commercial product (http://www.apache.org/licenses/LICENSE-2.0 ).
Currently, most companies are charged for their releases, such as Intel releases and Huawei releases. Although these releases add new feature that many open-source versions do not have, however, when hadoop is selected by most companiesCharged or notAs an important indicator.
Currently,Free of chargeHadoop has three major versions (both foreign vendors:Apache(The original version, all releases are improved based on this version ),Cloudera(Cloudera's distribution including Apache hadoop ("CDH" for short "),Hortonworks version(Hortonworks data platform, referred to as "HDP ").
2.2 Introduction to the Apache hadoop release version
Current Apache hadoopMany versions, This section will sort out the features of each version and their relationships in the future. Before explaining hadoop versions, you must first understandApache Software Release Method.
For any Apache open-source project, all basic features are added to a main code line called "trunk" (maincodeline ). When you need to developImportant features, A branch will be extended from the main code line, which is called a candidate release version of candidate release ). This branch will focus onDevelop this featureOther new features will not be added. After the basic bug is fixed, it will become public after voting by the relevant people.Release(Releaseversion), and set this featureMergeTo the main code line. Note that multiple branches may be developed at the same time,A branch with a higher version may be released before a branch with a lower version..
Because Apache usesFeature prevailsNew branches are extended, so before introducing the Apache hadoop version, we will first introduce several major features of the new version of Apache hadoop independently generated:
- Append
- HDFS raid
- Symlink
- Security
- Mrv1
- Yarn/mrv2
- Namenode Federation
- Namenode ha
2.2.1 hadoop version change
By May 2012, Apache hadoop had four major branches, as shown in. The four branches of Apache hadoop constitute four series of hadoop versions.
- 0.20.x series:
After version 0.20.2 was released, several important features were not based on trunk, but were developed on the basis of version 0.20.2. It is worth mentioning that there are two main features: append and security. Among them, branches with security features are released in version 0.20.203, and later versions of 0.20.205 integrate these two features. Note thatVersion 1.0.0 is only renamed in version 0.20.205.0.20.x series versions are the most confusing for users, because they have some features, which are not available in trunk. On the contrary, some features of trunk are not available in 0.20.x series versions.
- 0.21.0/0.22.x series: this series of versions divides the entire hadoop project into three independent modules: Common, HDFS, and mapreduce. Both HDFS and mapreduce are dependent on the common module, but mapreduce is not dependent on HDFS. In this way, mapreduce can run other distributed file systems more easily. At the same time, modules can be developed independently. Improvements to specific modules are as follows:
Common module: The biggest new feature is the addition of the large-scale Automatic testframework and fault injection framework for testing;
HDFS module: New features include support for append operations and symbolic connections, secondarynamenode improvement (secondary namenode is removed, replaced by checkpoint node, and added a backup node role, as a cold standby of namenode), allows users to customize block placement algorithms, etc;
Mapreduce module: in terms of job API, the new mapreduce API is started, but the old API is still compatible. 0.22.0 fixed some bugs and made some optimizations Based on 0.21.0.
- 0.23.x series:
0.23.x was proposed to overcome the shortcomings of hadoop in terms of scalability and framework universality. It is actually a brand new platform, including the Distributed File System HDFS federation and the Resource Management Framework yarn. It can be used to centrally manage various access computing frameworks (such as mapreduce and spark. Its release comes with a mapreduce library, which integrates all the new mapreduce features so far.
- 2. x Series
Like the 0.23.x series, the 2.x series also belongs to the next generation of hadoop. Compared with the 0.23.x series, the 2.x series has new features such as namenode ha and wire-compatibility.
2.3 introduction to the cloudera1 hadoop release version
Cloudera's open-source Apache hadoop release, namely (cloudera distribution including Apache hadoop, CDH), is designed for hadoop enterprise-level deployment. Cloudera said that more than half of the project outputs were donated to various open-source projects (APACHE hive, Apache Avro, and Apache hbase) that are closely linked to hadoop Based on Apache licenses ). Cloudera is also a sponsor of the Apache Software Foundation.
2.3.1 reasons for CDH version Selection
- CDH has a clear division of hadoop versions. There are only three series of versions, cdh3, cdh4, and cdh5, corresponding to the first generation of hadoop (hadoop 1.0) compared with the second generation hadoop (hadoop 2.0,The Apache version is much more chaotic., Which is more compatible, secure, and stable than Apache hadoop;
- The CDH documentation is clear. Many users using Apache will read the documents provided by CDH, including the installation and upgrade documents;
- Security CDH supports Kerberos security authentication, while Apache hadoop uses simple username matching authentication.
- CDH supports four installation methods: Yum/apt package, tar package, RPM package, and cloudera manager. Apache hadoop only supports installation of tar package.
2.3.2 correspondence between CDH and hadoop
The cdh3 version is improved based on Apache hadoop 0.20.2 and incorporates the latest patch.Cdh3u6 corresponds to the latest version of Apache hadoop (hadoop 1.x)And cdh3u1 ~ The relationship between cdh3u5 and the Apache hadoop version is unclear, because CDH always enters some of the latest patches and is released earlier than the same functional version of Apache hadoop. In general, Apache and CDH have the same functions.
The cdh4 version is based on Apache hadoop 2. X is improved. CDH always applies the latest bug fix or feature patch, and is released earlier than the version of Apache hadoop with the same function. The update speed is faster than that of Apache.
Cdh5 was developed on the basis of Apache hadoop 2.2.0. It packages various software in the hadoop ecosystem, so there is no software compatibility problem (for example, running pig and hive on hadoop 2.0, it is very easy to use.
Yes: http://archive.cloudera.com/cdh5/cdh/5/, Documentation: http://www.cloudera.com/content/support/en/documentation.html, in addition, it is worth noting that cdh5 also included spark, so cdh5 is a more enhanced and comprehensive release.
2.3.4 download path
Based on the above considerations, we recommend that you use the latest version of cdhx (equivalent to the stable version of Apache hadoop) at http://archive.cloudera.com/cdh5/cdh/5 /.
2.4 introduction to the release version of hortonworks hadoop
HDP is a relatively new version, which is basically synchronized with Apache at present, because most employees in hortonworks are contributors to Apache Code, especially those of hadoop 2.0.
Note:
[1] cloudera (English: cloudera, Inc.) is an American software company that provides enterprise customers with Apache hadoop-based software, support, services, and training. A blog in The New York Times in March 2009 reported cloudera [1]. Three engineers from Google, Yahoo, and Facebook (Alibaba e bisciglia, Amr Awadallah, and Jeff hammerbacher) joined the company with a former Oracle Executive (Mike Orson. Orson previously served as chief executive officer of sleepycatsoftware. He is also the founder of the open-source embedded database engine berkeleydb (which was acquired by Oracle in 2006. Hammerbacher used to use hadoop in Facebook to build analysis programs involving massive user data.