Apache Hadoop
Apache version derived faster, I will introduce you to the processThe Apachehadoop version is divided into two generations, we call the first generation Hadoop 1.0, and the second generation Hadoop called Hadoop 2.0. The first generation of Hadoop consists of three large versions, 0.20.x,0.21.x and 0.22.x, of which 0.20.x finally evolved into 1.0.x and became a stable version, while 0.21.x and 0.22.x Namenode new major features such as Ha. The second generation of Hadoop consists of two versions, 0.23.x and 2.x, which are completely different from Hadoop 1.0 and are a whole new architecture that includes both HDFs Federation and Yarn Systems. Compared to 0.23.x,2.x, the Namenodeha and wire-compatibility have been increased by two major features. After a general explanation of the above, it may be understood that Hadoop distinguishes versions with significant features, summarizing the features used to differentiate Hadoop versions: (1) Append supports file append and requires this feature if you want to use HBase. (2) RAID ensures that the data is reliable and the number of data blocks is reduced by introducing a check code. (3) Symlink support for HDFs file links (4) security for Hadoop. It is important to note that Hadoop 2.0 is primarily developed by Hortonworks, a company independently of Yahoo. In October 2013, Hadoop 2.0 was released. Key features include:
a) YARNYarn is the abbreviation for "Yetanother Resource negotiator", a new generic resource management system introduced by Hadoop 2.0 that allows you to run a variety of applications and frameworks, such as MapReduce, Tez, Storm, etc. Its introduction makes it possible for various applications to run in a cluster. Yarn is derived from the MRV1 Foundation, is the development of mapreduce to a certain extent of the inevitable product, its emergence makes the application of Hadoop computing into the platform era, my blog contains a large number of articles about yarn, interested readers can read:/http dongxicheng.org/category/mapreduce-nextgen/
b) HDFs single point of failure can be resolvedHadoop2.2.0 also solves the Namenode single point of failure problem and memory constrained problem, wherein the single point of failure is through the main standby namenode switch implementation, which is an ancient solution to the service single point of failure, the main and standby namenode through a shared storage synchronization metadata information, so the sharing of storage system selection and Hadoop offers three optional shared storage systems for NFS, QJM and Bookeeper, read my article: Hadoop 2.0 single point of failure scenario summary.
c) HDFS FederationThere is a memory-constrained problem with the namenode of HDFs mentioned earlier, and this issue was resolved in version 2.2.0. This is achieved through HDFS Federation, which allows a single HDFs cluster to have multiple namenode, each namenode a subset of directories, and different namenode are independent of each other, sharing all Datanode storage resources, note that Each NameNode in the NameNode federation still has a single point of issue and requires a backup for each NameNode to address a single point of failure.
d) HDFs SnapshotAn HDFs snapshot is a read-only image of the HDFs file system (or subsystem) at a given moment, which enables the administrator to take snapshots of important files or directories at timed intervals to prevent data from being mistakenly deleted, lost, etc. Specifically read: Snapshots for HDFS (instructions for use), support for Rw/rosnapshots in HDFs. The NFSv3 access Hdfsnfs allows users to access remote file systems as if they were accessing the local file system, and when NFS is introduced into HDFs, users can read and write to files on HDFs as if they were local files, greatly simplifying the use of HDFS by introducing an NFS The Gateway service implements the service to convert the NFS protocol into an HDFS access protocol, as shown in. Interested readers can read: support for Nfsv3interface to HDFS, and related design documents: HDFs NFS Gateway.
e) Support for Windows operating systemPrior to version 2.2.0, Hadoop supported only the Linux operating system, and Windows was used only as an experimental platform. Starting with 2.2.0, Hadoop began to support the Windows operating system, and read one of the articles I wrote earlier: Hadoop for Windows. f) compatible with MapReduce applications running on 1.x fully integrated with other systems of the Hadoop ecosystem in addition to the three core systems of HDFs, mapreduce and yarn, the Hadoop ecosystem also includes HBase, Hive, pig, and other systems. These systems are based on the Hadoop kernel, and the biggest changes to Hadoop 1.0,hadoop 2.0 occur in the kernel (HDFS, mapreduce, and yarn), but integration testing with other systems in the ecosystem is required. In addition to the above features, the Apache official gave two special instructions: (1) HDFs change: HDFs symlinks (similar to a soft connection in Linux) will be moved to 2.3.0 version (2) yarn/ MapReduce considerations: When an administrator sets up Shufflehandler service on NodeManager, it uses "Mapreduce_shuffle" instead of "Mapreduce.shuffle" The new version as a property value not only enhances the functionality of the core platform, but also fixes a number of bugs. The new version has two very important enhancements to HDFs: (1), supports heterogeneous storage tiers, and (2) provides memory caching capabilities through data nodes for data stored in HDFs. With HDFS support for heterogeneous storage tiers, we will be able to use different storage types on the same Hadoop cluster. We can also use different storage mediums-such as commercial disks, enterprise-class disks, SSDs, or memory-to better balance costs and benefits. If you would like to know more about this enhancement, then you can visit here. Similarly, in the new version we can use the available memory in the Hadoop cluster to centrally cache and manage datasets in the data node's memory. Similar applications like MapReduce, Hive, pig, and so on, will be able to request memory for caching and then read the content directly from the data node's address space, greatly improving scan efficiency by completely avoiding disk operations. Hive is now implementing a very effective 0 copy read path for the Orc file, a feature that uses this new technology. In yarn, what excites us is that the automatic failover of the resource manager has come to an end, although the feature has not yet been released in this version of 2.3.0, but it is most likely to be included in Hadoop-2.4. In addition, 2.3.0The release also provides some key operational enhancements to yarn, such as better logging, error handling, and diagnostics. A key enhancement MAPREDUCE-4421 for MapReduce. With this feature, we no longer need to install the MapReduce binaries on each machine, just the need to copy a MapReduce packet into HDFs via the yarn distributed cache. Of course, the new version also contains a lot of bug fixes and other enhancements. For example: (1) The asynchronous polling operation in the Yarnclientimpl class introduced a timeout, (2) fixed an issue where rmfataleventdispatcher did not log the event cause, (3) HA configuration does not affect the RPC address of the Node Manager; (4) RM Web The UI and Rest APIs uniformly use yarnapplicationstate; (5) The RPC error message is included in the Rpcresponseheader instead of being sent separately; (6) The request log is added to the Jetty/httpserver ; (7) fixed an issue where writing files and Hflush would throw java.lang.ArrayIndexOutOfBoundsException after Dfs.checksum.type was defined as null.
April 2014, Hadoop 2.4.0 released. Key features include: (1) HDFs Support access Control List (acls,access control Lists), (2) native support HDFs rolling upgrade, (3) HDFs fsimage used protocol-buffers, which can be smoothly upgraded; 4) HDFs fully supports HTTPS, (5) Yarn ResourceManager supports automatic failover, resolves Yarnresourcemanager single point of failure, (6) Yarn Application history Server and Pplication support for new applications on the Timeline server, (7) Enabling yarn capacity Scheduler to support strong SLAs protocols through preemption; security is critical to Hadoop, so All access to HDFs in the 2.4.0 version (including Webhdfs, hsftp, and even web-interfaces) supports HTTPS. A single point of failure for ResourceManager was addressed in Hadoop 2.4.0. This will present two ResourceManager in the cluster, one in active and the other in standby. When the active fails, so that Hadoop can automatically switch smoothly to another ResourceManager, the new ResourceManager will automatically restart those submitted applications. In the next phase, Hadoop will add a hot standby (add a heat standby), this standby can continue to run the application from the point of failure to save any work that has already been done.
August 2014, Hadoop 2.5.0 released. Key features include: 1. Commona) authentication improvements when using an HTTP proxy server. This is useful when using WEBHDFS through a proxy server. b) added a new monitoring sink for Hadoop metrics, allowing direct write to graphite. c) The Hadoop file system is compatible with the relevant specification work. 2. HDFSA) supports POSIX-style extended file systems. See the extended Attributes in HDFs documentation for more details. b) Support Offline image browsing, the client can now browse a fsimage through the Webhdfs API. C) NFS Gateways get a lot of support improvements and bug fixes. Hadoop Portmapper does not need to run the gateway, the gateway can now deny connections to ports that do not have permissions. d) Secondarynamenode, Journalnode, and DataNode's Web UI has been beautified using HTML5 and JS. 3. Yarna) Yarn's Restapi now supports write/modify operations. Users can submit and kill applications with the rest API. b) The timeline is stored in yarn and is used to store a generic and special information for an application that supports Kerberos authentication. c) The Fair Scheduler supports dynamic hierarchical user queues, where the user queue is dynamically created in either of the specified parent queues.
November 2014, Hadoop 2.6.0 released. (Recommended use)is the most enterprise application in the market, with the best version of other distributions, recommend that you use this version. Key features include:
1. CommonHadoop key Management Server (KMS) is a key Management Server written based on the Hadoopkeyprovider API. He provides a client and a server component that uses the rest API communication between the client and server based on the HTTP protocol. The client is an Keyprovider implementation that interacts with KMS using the KMS HTTP REST API. KMS and its client have built-in security mechanisms that support HTTP SPNEGO Kerberos authentication and HTTPS secure transport. KMS is a Java Web application that runs on a pre-configured Tomcat server bundled with the Hadoop release.
2. TracingHDFS-5274 added the ability to track requests via HDFs, which uses an open source library, Htrace. We can look at the Htrace, the function is very powerful, Cloudera open source out.
3. HDFSA) Transparentencryption,hdfs implements a transparent, end-to-end encryption method. Once encrypted is configured, the process of reading data from HDFs to decrypt and write data encryption is transparent to the user application code strip. The encryption process is end-to-end, which means that the data can only be decrypted by the client encrypted. HDFs is never stored, nor does it access unencrypted data and data encryption keys. This satisfies the two typical requirements of the encryption process: At-rest encryption (static encryption, that is, data persistence on a medium like a hard disk), In-transit encryption (in-transit encryption, for example, when data is transmitted over a network). b) storagessd&& Memory. Archivalstorage (archive memory) is the separation of computing power from the growing capacity of storage. High-density, low-cost storage, but less compute-capable nodes become available, and cold storage can be done in the cluster. Adding more nodes as cold storage can improve the storage capacity of the cluster, independent of the computing power of the cluster.
4. MapReduceThis section is mainly about bug fixes and improvements. Two new additions have been added, as described in 2.5.2. Here's a quick look. A) RESOURCEMANGERRESTARTB) allows am to send historical event information to timeline server.
5. YARNA) Nodemanagerrestart: This feature enables NodeManager to restart without losing the container of the activity running in the node. b) Dockercontainer Executor:dockercontainer Executor (DCE) allows yarn nodemanager to start yarn container in Docker container. Users can specify the image of the Docker they want to use to run yarn container. These container provide a customizable software environment where the user's code can be run and isolated from the environment in which the NodeManager runs. These container that run user code can contain the specific libraries that the application needs, and they can have different versions of Perl,python or even Java than the NodeManager. In fact, these container can run different versions of Linux than the NodeManager OS. Although Yarncontainer must define all the environments and libraries required to run the job, none of the things in NodeManager are shared. Docer provides yarn with both consistent and isolated modes, consistent mode in which all yarn container will have the same software environment, in isolation mode, regardless of what is installed on the physical machine without interference.
July 2015, Hadoop 2.7.0 released. Key features include:
1. CommonSupports Windowsazure Storage,blob as a file system in Hadoop. Hadoop HDFSA) supports file truncation (files truncate); b) supports each storage type quota (supported for quotas per storage type); c) supports variable-length block files
2. YARN1, yarn security Module pluggable a) yarn localization resources can be automatically shared, the global cache (Beta) hadoopmapreduceb) can limit the running of Map/reduce Job task C) For a very large job (there are many output files) speed up the fileoutputcommitter.
2. HDFSA) Support file truncation (files truncate); b) supports each storage type quota (supported for quotas per storage type); c) supports variable-length block files
2. MAPREDUCEA) the ability to restrict the running of the Map/reduce job B) speeds up the fileoutputcommitter for a very large job (with many output files). July 2015, Hadoop 2.7.1 released. Key features include: This version is stable, is a stable version since the Hadoop 2.6.0, but also the first stable version of the Hadoop 2.7.x version line, is also a maintenance version of the 2.7 version line, the change is not big, Mainly fixed some of the more serious bugs (which fixed 131 bugs and patches)
The latest stable version of Hadoop uses recommendations