Next-generation hadoop yarn: advantages over mrv1 and Yarn

Source: Internet
Author: User

Recently, I often see people on Weibo saying, "many companies do not use yarn for the time being, because the cluster size of a company is not as large as that of Yahoo or Facebook, even tens of thousands in the future ". This is completely a wrong idea. In the era of hadoop's rapid development, it must be corrected.

In fact, the above idea only shows the scalability of yarn. scalability is a feature that is available and not needed. Small and Medium Enterprises deploy yarn to small clusters (according to IBM's point of view, A cluster of less than 200 machines is called a small-and medium-sized cluster. Such companies can find more than 90% machines.) they may not enjoy the scalability advantages, but at least the following benefits can be obtained:

(1) Faster mapreduce computing

Mapreduce is still the most widely used computing framework. Yarn uses the asynchronous model to overwrite some key logic structures of the mapreduce framework (such as jobinprogress and taskinprogress), which is faster than mrv1. Of course, yarn is backward compatible. Jobs run on mrv1 can run on Yarn without any modifications.

(2) Support for multiple frameworks

Compared with mrv1, yarn is no longer a simple computing framework, but a framework manager. You can Port various computing frameworks to yarn for unified management and resource allocation, it takes a certain amount of work to port the existing framework to yarn. Currently, yarn can only run the mapreduce offline computing framework.
We know that there is no unified computing framework suitable for all application scenarios. That is to say, if you want to design a computing framework, it is impossible to perform offline computing, online computing, streaming computing, and memory computing efficiently. Since there is no all-powerful computing framework, why don't we develop a framework management platform (actually a resource management platform) that can accommodate and manage various computing frameworks? yarn is just doing this.

Yanr is essentially a unified management system of resources, which is basically consistent with mesos (http://www.mesosproject.org/) A few years ago, earlier torque (http://www.adaptivecomputing.com/products/open-source/torque. Running various frameworks on Yarn enables unified management and allocation of Framework resources so that they can share a cluster instead of "One framework and one cluster ", this greatly reduces O & M costs and hardware costs.
If you have not realized that you have entered a new era of Multi-computing frameworks, let me take a look at what has emerged as a well-known computing framework:

1) mapreduce:This framework is an offline computing framework that abstracts an algorithm into two stages: map and reduce for processing. It is very suitable for data-intensive computing.

2) spark:We know that the mapreduce computing framework is not suitable for (not impossible, not suitable, and inefficient) Iterative Computing (commonly used in machine learning, such as PageRank) and interactive computing (Data Mining, for example, SQL query), mapreduce is a disk computing framework, while spark is a memory computing framework, which puts data into the memory as much as possible to improve the computing efficiency of iterative and interactive applications. Official homepage: http://spark-project.org/

3) Storm:Mapreduce is not suitable for stream computing and real-time analysis, such as Ad click computing. Storm is better at this computing, and it is far better than the mapreduce computing framework in real time. Official homepage: http://storm-project.net/

4) S4:The stream computing framework developed by Yahoo is similar to storm. Http://incubator.apache.org/s4/

5) Open MPI:A very classic Message Processing Framework, which is very suitable for high-performance computing and is still widely used.

6) Hama:The distributed computing framework based on the BSP (bulk-synchronous parallel model) model is similar to Google's Pregel and can be used for large-scale scientific computing, such as matrices, graph algorithms, and network algorithms. official homepage: http://hama.apache.org /.

7) cloudera Impala/Apache drill:Hadoop-based faster SQL query engine (much faster than hive) and Google dremel imitators. Cloudera Impala official home: https://github.com/cloudera/impala,Apache drill official home: http://incubator.apache.org/drill/

8) giraph: Graph Algorithm Processing Framework. BSP model is used to calculate iterative algorithms such as PageRank, shared connections, and personalization-based popularity. Official homepage: http://giraph.apache.org/

Many of the above frameworks are or are preparing to migrate to yarn, see: http://wiki.apache.org/hadoop/PoweredByYarn/

(3) easier framework upgrade

In yarn, various computing frameworks are no longer deployed on each node of the cluster as a service (for example, the mapreduce framework does not need to deploy services such as jobtracler and tasktracker ), instead, it is encapsulated into a user Library (LIB) and stored on the client. to upgrade the computing framework, you only need to upgrade the user library. How easy is it!

Summary

Yarn is derived from hadoop 1.0 and has more advanced ideas and ideas than hadoop 1.0. It fully utilizes the advantages of hadoop 1.0, however, many new features and improvements are added. Even if you do not use yarn, studying yarn will be of great help to improve your current hadoop version.

It should be noted that yarn is a brand new system, which is completely different from hadoop 1.0. For general companies, porting old hadoop systems to yarn is very difficult, because different companies have made more or less changes to yarn, these transformations may no longer be compatible with mainstream hadoop versions. When upgrading to yarn, You need to fully test the compatibility to ensure that the current running job can still run normally on the transplanted system. In addition, it should be noted that yarn will bring huge troubles to O & M personnel. After all, it is a new system.

Of course, yarn is still in the brewing stage, and it is too early to discuss how to use it online. However, it is still necessary to start researching and learning it as an enterprising and prospective hadoopor.

From: http://dongxicheng.org/mapreduce-nextgen/what-can-we-benifit-from-yarn/

 

 

Basic Glossary in hadoop 2.0

When reading related hadoop 2.0 documents, many people confuse some concepts. This article will give a comprehensive introduction to the terms involved in hadoop 2.0.

(1) hadoop 1.0

The first generation of hadoop is composed of a distributed storage system HDFS and a distributed computing framework mapreduce. HDFS consists of a namenode and multiple datanode. mapreduce consists of a jobtracker and multiple tasktracker, the corresponding hadoop version is hadoop 1. X and 0.21.x, 0.22.x.

(2) hadoop 2.0

The second generation of hadoop was proposed to overcome various problems existing in HDFS and mapreduce in hadoop 1.0. In view of the scalability of HDFS restricted by a single namenode in hadoop 1.0, HDFS Federation is proposed, which allows multiple namenode to manage different directories for access isolation and horizontal scaling; aiming at the shortcomings of mapreduce in hadoop 1.0 in scalability and multi-framework support, a new resource management framework yarn (yet another resource negotiator) is proposed, which separates resource management and Job control functions in jobtracker, implemented by the components ResourceManager and applicationmaster respectively. Among them, ResourceManager is responsible for allocating resources for all applications, while applicationmaster is only responsible for managing one application. The corresponding hadoop versions are hadoop 0.23.x and 2.x.

(3) mapreduce 1.0 or mrv1 (mapreduce version 1)

The first generation of mapreduce computing framework consists of two parts: Programming Model and runtime environment ). Its basic programming model abstracts the problem into two stages: map and reduce. In the map stage, the input data is parsed into key/value. After the map () function is called iteratively, the output is in the form of key/value to the local directory. In the reduce stage, the same value of the key is normalized and the final result is written to HDFS. Its runtime environment consists of two types of services: jobtracker and tasktracker. jobtracker is responsible for resource management and control of all jobs, while tasktracker is responsible for receiving and executing commands from jobtracker.

(4) mrv2 (mapreduce version 2)

Mapreduce 2.0 or mrv2 has the same programming model as mrv1. The only difference is the runtime environment. Mrv2 is mrv1, which is processed on the basis of mrv1 and runs on the Resource Management Framework yarn. It is no longer composed of jobtracker and tasktracker, but becomes a job control process applicationmaster, applicationmaster is only responsible for the management of one job, and yarn is responsible for the management of resources.

In short, mrv1 is an independent offline computing framework, while mrv2 is mrv1 running on yarn.

(5) mapreduce 2.0, yarn, or NextGen mapreduce

The Resource Management Framework in hadoop 2.0 is a framework manager that allocates resources for various frameworks and provides runtime environments. Mrv2 is the first computing framework running on yarn. Other computing frameworks, such as spark and storm, are being transplanted to yarn. Yarn is similar to the Resource Management System mesos a few years ago and earlier torque.

(6) HDFS Federation

HDFS is improved in hadoop 2.0 so that namenode can be horizontally expanded into multiple directories. Each namenode is in charge of some directories, which not only enhances the scalability of HDFS, but also enables HDFS to be isolated.

[References]

Cloudera blog: http://blog.cloudera.com/blog/2012/10/mr2-and-yarn-briefly-explained/

From: http://dongxicheng.org/mapreduce-nextgen/hadoop-2-0-terms-explained/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.