Parse New Features of hadoop 2.3.0

Source: Internet
Author: User

Hadoop 2.3.0 was released in February 20, 2014. This was the first hadoop version released by Apache in 2014. It kicked off the development of hadoop 2014.

 

This version introduces many long-awaited features, including HDFS's heterogeneous Hierarchical Storage Architecture, datanode cache, yarn's single point of failure solution, and mapreduce's automated deployment. This article attempts to resolve these features and provide some materials for your in-depth understanding.

New HDFS features. 2.3.0 introduces two major HDFS features: heterogeneous Hierarchical Storage Architecture and datanode cache. The first is the heterogeneous Hierarchical Storage Architecture. in earlier versions, all the storage media on HDFS by default datanode is disks, that is, all user data is stored on disks, whether hot data or cold data. However, with the rapid development of storage media in recent years, new types of media such as SSD and Flash have become increasingly mature. HDFS has begun to try to support heterogeneous media, that is, multiple storage media can exist in the same hadoop cluster at the same time, users can store different types of data to different media as needed. For example, hotspot data is stored in SSD, and massive volumes of web page data are stored on disks. The introduction of heterogeneous Hierarchical Storage Architecture makes HDFS more widely used. The second feature is datanode cache. In earlier versions, HDFS datanode did not consider data caching. After all, HDFS is positioned as a distributed disk storage system, but with the emergence of diversified computing frameworks on top of HDFS, such as the stream computing framework storm, in-memory computing frameworks such as spark and Dag computing frameworks such as Tez, hadoop is no longer limited to offline processing and analysis, but can support both offline analysis and online processing, to better support online processing, reduce the latency of online applications, and improve performance, datanode cache emerged (it is worth mentioning that the tachyon Storage System in the spark ecosystem is, is a memory system built on HDFS ). These two functions are the inevitable outcome of the development of a hadoop full-featured system. HDFS is no longer limited to storing some offline batch processing data, and it is also trying to store online data. For the design documents of these two functions, refer:

Https://issues.apache.org/jira/browse/HDFS-2832

Https://issues.apache.org/jira/browse/HDFS-4949

It should be noted that these two features are currently in the initial development stage. Despite their beautiful vision, they have only implemented the most basic functions and many functions have not yet been implemented, for example, in a heterogeneous Hierarchical Storage Architecture, three copies of one block, one stored on SSD, and the other two stored on disks are required.

New yarn features. Yarn currently has the largest problem: ResourceManager spof, which is one of the most urgent issues. If it is not solved, yarn cannot carry more types of applications as a resource management system. In version 2.3.0, this problem is basically solved. The solution is similar to namenode ha, which is implemented through zookeeper. However, we do not recommend that you use the HA solution in this version. Instead, we recommend that you use the HA solution in the next version 2.4.0. In addition to Ha, two important functions will be released in the next version: Generic Application timeline and Generic Application timeline log. The first feature is generic Application timeline, which provides a shared storage module for applications on yarn to store some of their own data, such as running events and running logs; the second feature solves the problem of application history log management on yarn. Currently, only mapreduce can view and manage history logs, and other applications, such as spark, cannot be viewed, each Framework/application needs to solve the problem on its own. To prevent repeated wheel creation, yarn simply provides a general historical log management module. For the design documents of these two functions, refer:

Https://issues.apache.org/jira/browse/YARN-1530

Https://issues.apache.org/jira/browse/YARN-321

New Features of mapreduce. In hadoop 2.0, mapreduce jar packages are packaged together with yarn and HDFS jar packages. When hadoop is deployed, the packages are distributed to each node, this actually violates the original design intention of yarn. Yarn is a resource management system. All the above applications do not need to be deployed on each node in advance. You only need to have a jar package on the client, then yarn is automatically distributed to each node. Therefore, hadoop 2.3.0 has fixed the issue. It is worth mentioning that spark and storm programs do not have this problem, so that different versions of spark and storm instances can be run in the same cluster. For details, refer:

Https://issues.apache.org/jira/browse/MAPREDUCE-4421

Original article, reprinted Please note:Reposted from Dong's blog

Link:Http://dongxicheng.org/mapreduce-nextgen/hadoop-2-3-0-new-features/

Dong, Author: http://dongxicheng.org/about/

A collection of articles in this blog:Http://dongxicheng.org/recommend/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.