The history and detailed analysis of Hadoop yarn

Last Update:2014-12-09 Source: Internet

Author: User

Keywords Applications applications nbsp; applications nbsp; providing application nbsp; providing executing application nbsp; providing executing large data

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Editor's note" Mature, universal let Hadoop won large data players love, even before the advent of yarn, in the flow-processing framework, the many institutions are still widely used in the offline processing. Using Mesos,mapreduce for new life, yarn provides a better resource manager, allowing the storm stream-processing framework to run on the Hadoop cluster, but don't forget that Hadoop has a far more mature community than Mesos. From the rise to the rise of the singing decline, the elephant moving large numbers has become more mature, stable, and we also believe that in the future container and other attributes to join, the Hadoop ecosystem will flourish.

CSDN: Welcome to free Subscribe to "Hadoop and Big Data Weekly" to get more Hadoop technical literature, ecological circle trend.

The following is the content of the article

Apache Hadoop with MapReduce is the backbone of distributed data processing. With its unique horizontal expansion of the physical cluster architecture and the fine processing framework originally developed by Google, Hadoop has exploded in the new field of large data processing. Hadoop also developed a rich variety of application ecosystems, including Apache Pig (a powerful scripting language) and Apache Hive (a data warehouse solution with a similar SQL interface).

Unfortunately, the ecosystem is built on a programming model that does not solve all the problems in large data. MapReduce provides a specific programming model that, although simplified through tools such as Pig and Hive, is not a panacea for large data. Let's first introduce MapReduce 2.0 (MRv2)-or verb Another Resource negotiator (YARN)-and quickly review the Hadoop architecture before YARN.

Hadoop and MRv1 Brief introduction

The Hadoop cluster can be extended from a single node (where all Hadoop entities run on the same node) to thousands of nodes (where the functionality is dispersed between nodes to increase parallel processing activity). Figure 1 illustrates an advanced component of a Hadoop cluster.

Figure 1. A simple demo of the Hadoop cluster architecture

A Hadoop cluster can be decomposed into two abstract entities: the MapReduce engine and the Distributed File system. The MapReduce engine is able to execute the MAP and Reduce tasks across the cluster and report the results, where the Distributed file system provides a storage mode that can replicate data across nodes for processing. The Hadoop Distributed File System (HDFS) is defined to support large files (where each file is typically a multiple of MB).

When a client makes a request to a Hadoop cluster, the request is managed by Jobtracker. Jobtracker works with Namenode to distribute the work as close to the data it handles as possible. Namenode is the primary system of the file system, which provides metadata services to perform data distribution and replication. Jobtracker the Map and Reduce tasks into available slots on one or more tasktracker. Tasktracker performs Map and Reduce tasks with DataNode (Distributed File System) on data from DataNode. When the Map and Reduce tasks are complete, Tasktracker tells Jobtracker that the latter determines when all tasks are completed and eventually tells the customer that the job is complete.

As you can see in Figure 1, MRV1 implements a relatively simple cluster Manager to perform MapReduce processing. MRV1 provides a tiered cluster management model in which large data jobs infiltrate a cluster in the form of a single Map and Reduce task and are eventually aggregated into jobs to report to the user. But this simplicity has some secrets, but it's not a very secret question.

The defects of

MRv1

The first version of Apreduce has both advantages and disadvantages. MRV1 is the standard large data processing system currently in use. However, this architecture is inadequate, mainly in large clusters. When the cluster contains more than 4,000 nodes (where each node may be multi-core), it is unpredictable. One of the biggest problems is cascading failures, and because of the attempt to replicate data and overloaded nodes, a failure can lead to a severe deterioration of the entire cluster through a network flooding pattern.

But the biggest problem with MRV1 is multi-tenant. As cluster size increases, a desirable approach is to use a variety of models for these clusters. MRV1 nodes are dedicated to Hadoop, so you can change their use for other applications and workloads. This capability can also be enhanced when large data and Hadoop become a more important usage model in cloud deployments, because it allows the physical use of Hadoop on the server without virtualization without adding management, calculation, and input/output overhead.

Let's look at the new architecture of YARN and see how it supports MRV2 and other applications that use different processing models.

Introduction to

YARN (MRV2)

To achieve cluster sharing, scalability, and reliability of a Hadoop cluster. Designers adopt a layered cluster framework approach. Specifically, the MapReduce feature has been replaced with a new set of daemons that open the framework to the new processing model.

Recall that the MRv1 Jobtracker and Tasktracker methods have been an important flaw due to the limitations on some of the fault patterns caused by scaling and network overhead. These daemons are also unique to the MapReduce processing model. To eliminate this limitation, Jobtracker and Tasktracker have been removed from YARN and replaced by a set of new daemons that are not known to the application.

Figure 2. New architecture for YARN

The essence of YARN layered structure is ResourceManager. This entity controls the entire cluster and manages the allocation of the application to the underlying computing resource. ResourceManager arranges each resource part (compute, memory, bandwidth, etc.) carefully to the base NodeManager (YARN per-node agent). ResourceManager also allocates resources with Applicationmaster to launch and monitor their underlying applications with NodeManager. In this context, Applicationmaster assumed some of the roles of the previous tasktracker, and ResourceManager assumed the role of Jobtracker.

Applicationmaster manages each instance of an application running within YARN. Applicationmaster is responsible for coordinating the resources from ResourceManager and monitoring the execution of the containers and the use of resources (resource allocations for CPU, memory, etc.) through NodeManager. Note that although the current resources are more traditional (CPU core, memory), the future will bring new resource types based on the task at hand (such as graphics processing units or dedicated processing devices). From the YARN point of view, Applicationmaster is the user code, so there are potential security problems. YARN assumes that applicationmaster are wrong or even malicious, so treat them as unprivileged code.

NodeManager manages each node in a YARN cluster. NodeManager provides services for each node in the cluster, from monitoring lifetime management of a container to monitoring resources and tracking node health. MRV1 manages the execution of the Map and Reduce tasks through slots, while NodeManager manages abstract containers that represent resources for each node that can be used by a particular application. YARN continues to use the HDFS layer. Its primary namenode is used for metadata services, while DataNode is used for replication storage services that are dispersed across a cluster.

To use a YARN cluster, you first need a request from a customer that contains an application. ResourceManager negotiates the necessary resources for a container and initiates a applicationmaster to represent the submitted application. By using a resource request protocol, Applicationmaster negotiates the resource containers that each node uses for the application. When executing an application, the Applicationmaster monitors the container until it completes. When the application completes, Applicationmaster logs its container from ResourceManager, and the execution cycle completes.

Through these discussions, it should be clear that the old Hadoop architecture is highly constrained by Jobtracker, Jobtracker responsible for resource management and job scheduling across the cluster. The new YARN architecture breaks this model, allowing a new ResourceManager to manage the use of resources across applications, Applicationmaster responsible for managing the execution of jobs. This change eliminates a bottleneck and improves the ability to extend the Hadoop cluster to a much larger configuration than before. In addition, unlike traditional mapreduce,yarn, standard communication modes such as message passing Interface are allowed to perform various programming models, including graphics processing, iterative processing, machine learning, and general cluster computing.

What you need to know

With the advent of YARN, you will no longer be constrained by simpler MapReduce development patterns, but can create more complex distributed applications. In fact, you can view the MapReduce model as one of several applications that the YARN schema can run, but expose more of the underlying framework for custom development. This ability is very powerful because YARN's usage model is almost unlimited and no longer needs to be isolated from other, more complex distributed application frameworks that may exist on a cluster, like MRV1. It can even be said that as the YARN becomes more robust, it has the ability to replace some of the other distributed processing frameworks, completely eliminating the resource overhead dedicated to other frameworks and simplifying the entire system.

To demonstrate the efficiencies of YARN relative to MRV1, consider the parallel problem of brute force testing of older LAN Manager hashes, a typical method used by legacy Windows® for cryptographic hashing operations. In this scenario, the MapReduce method doesn't make much sense because the mapping/reducing phase involves too much overhead. Instead, it is more reasonable to abstract job assignments so that each container has a portion of the password search space, enumerated on top of it, and notifies you if the correct password is found. The point here is that the password will be dynamically determined by a function (which is a bit tricky) without having to map all the possibilities into a data structure, making the MapReduce style unnecessary and impractical.

As a matter of fact, the problem under the MRV1 framework is only the need for an associative array, and these problems have a tendency to evolve specifically toward large data operations. However, the problem must not always be confined to this paradigm because you can now more easily abstract them, write custom clients, application main programs, and applications that match any design you want.

Development YARN Application

With the powerful new features provided by YARN and the ability to build custom application frameworks on top of Hadoop, you face new complexities. Building an application for YARN is much more complicated than building a traditional MapReduce application on top of Hadoop before YARN, because you need to develop a applicationmaster, which is the ResourceManager that starts when a client request arrives. Applicationmaster has multiple requirements, including implementing some of the required protocols to communicate with ResourceManager (for requesting resources) and NodeManager (for allocating containers). For existing MapReduce users, MapReduce applicationmaster minimizes any new work that is required, making it necessary to deploy MapReduce jobs similar to those before YARN.

In many cases, the lifecycle of an application in YARN is similar to that of a MRV1 application. YARN allocates many resources in a cluster, performs processing, exposes contact points for monitoring application progress, and eventually frees resources and performs general cleanup when the application completes. A boilerplate implementation of this lifecycle is available in a project named kitten (see Resources). Kitten is a set of tools and code that simplifies application development in YARN, allowing you to focus on the logic of your application and initially ignore the details of the limitations of negotiating and dealing with the various entities in the YARN cluster. However, if you want to study more deeply, Kitten provides a set of services that can be used to handle interactions with other cluster entities, such as ResourceManager. Kitten provides its own applicationmaster, which works well, but is provided as an example only. Kitten has used a lot of Lua scripts as its configuration service.

Next Plan

Although Hadoop continues to grow in the large data market, it has begun an evolution to address the large data workloads to be defined. YARN is still actively developing and may not be suitable for the production environment, but YARN provides an important advantage relative to the traditional MapReduce. It allows the development of new distributed applications outside of MapReduce, allowing them to coexist in the same cluster at the same time. YARN builds on existing elements of the current Hadoop cluster, but also improves jobtracker elements such as scalability and the ability to share clusters among many different applications. YARN will soon come to the Hadoop cluster near you, bringing its new features and new complexities.

Reference

Learn

The most recent news about Hadoop and other elements of its ecosystem, check out the Apache Hadoop project site. In addition to Hadoop, you'll learn how Hadoop can be scaled horizontally (with new technologies such as YARN) and vertically upgraded (with many new technologies such as Pig, Hive, etc.). As YARN Matures, you will learn the early ways to write applications using the YARN model. A useful reference is to write YARN applications. You will find in this reference a number of new complexities introduced by YARN, as well as discussions of various protocols for communication between entities in a YARN deployment. Use Apache's distributed Shell Source. View a free course on a variety of topics from big Data University, including the basics of Hadoop and text analysis, as well as SQL Access for Hadoop and real-time streaming computing. The MRv2 in Apache Hadoop 0.23 is a good introduction to the important technical details of a Jarn cluster. Kitten:for developers who-like Playing with YARN provides YARN abstract useful introduction to Hitten application development. Learn more about large data in the DeveloperWorks large data content zone. Find technical documents, guide articles, education, downloads, product information, and more.

Original link: will Hadoop YARN carry forward the broad masses (Zebian/Zhonghao)

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More