Hadoop New MapReduce Framework Yarn detailed

Source: Internet
Author: User
Tags new set hadoop ecosystem




Hadoop New MapReduce Framework Yarn detailed: http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

launched in 2005, Apache Hadoop provides the core MapReduce processing engine to support distributed processing of large-scale data workloads. 7 years later, Hadoop is undergoing a thorough inspection that not only supports MapReduce, but also supports other distributed processing models. "Editor's note" Mature, Universal let Hadoop won big data players love, even before yarn appeared, in the flow processing framework, Hadoop is still widely used by many organizations on the offline processing. Drawing on Mesos,mapreduce's new life, yarn provides a better resource manager so that the Storm stream processing framework can also run on top of the Hadoop cluster, but don't forget that Hadoop has a far more mature community than Mesos. From the rise to the decline and then to the rise of the big data on the elephant has become more mature, stable, at the same time we also believe that in the future container and other attributes added, the Hadoop ecosystem will flourish. CSDN Recommendation: Welcome to the free subscription to Hadoop and Big Data weekly for more information on Hadoop technology literature and ecosystem trends. The following is the article content with MapReduce Apache Hadoop is the backbone of distributed data processing. With its unique scale-out physical cluster architecture and the fine-grained processing framework originally developed by Google, Hadoop is exploding in new areas of big data processing. Hadoop has also developed a rich ecosystem of applications, including Apache Pig (a powerful scripting language) and Apache Hive (a data warehousing solution with a similar SQL interface). Unfortunately, this ecosystem is built on a programming paradigm that doesn't solve all the problems in big data. MapReduce provides a specific programming model, although it has been simplified through tools such as Pig and Hive, but it is not a panacea for big data. Let's first introduce the MapReduce2.0(MRV2)-or yet another Resource negotiator (yarn)-and quickly review the Hadoop architecture before YARN. Hadoop and MRv1 simple introduction Hadoop clusters can be scaled from a single node (where all Hadoop entities run on the same node) to thousands of nodes (where the functionality is dispersed across nodes to increase parallel processing activity). Figure1demonstrates the advanced components of a Hadoop cluster. Figure1. A simple demonstration of Hadoop cluster architecture a Hadoop cluster can be decomposed into two abstract entities: the MapReduce engine and the Distributed File system. The MapReduce engine can perform Map and Reduce tasks on the entire cluster and report results, where the Distributed file system provides a storage mode that replicates data across nodes for processing. Hadoop Distributed File System (HDFS) is defined to support large files where each file is typically -MB in multiples). This request is managed by Jobtracker when a client makes a request to a Hadoop cluster. Jobtracker with NameNode to distribute the work to the closest possible location to the data it is working with. NameNode is the main system of the file system, providing metadata services to perform data distribution and replication. Jobtracker the Map and Reduce tasks to an available slot on one or more tasktracker. Tasktracker and DataNode (Distributed File System) perform Map and Reduce tasks with data from DataNode. When the Map and Reduce tasks are completed, Tasktracker informs Jobtracker that it determines when all tasks are completed and ultimately informs the client that the job is complete. From the figure1as you can see, MRV1 implements a relatively simple cluster Manager to perform MapReduce processing. MRV1 provides a hierarchical cluster management model in which big data jobs infiltrate a cluster in the form of a single Map and Reduce task and are finally aggregated into jobs to report to the user. But this simplicity has some privacy, but it's not a very secretive question. The first version of MRV1 's defect MapReduce has both advantages and disadvantages. MRV1 is the standard large data processing system currently in use. However, there is a shortage of such architectures, mainly in large clusters. When the cluster contains more nodes than4, the(Each of these nodes may be multi-core), it will exhibit a certain degree of unpredictability. One of the biggest problems is cascading failures, and because of attempts to replicate data and reload the active nodes, a single failure can cause a severe deterioration of the entire cluster through the network flooding pattern. But the biggest problem with MRV1 is multi-tenancy. As the cluster size increases, a desirable approach is to use a variety of different models for these clusters. MRV1 nodes are dedicated to Hadoop, so you can change their use for other applications and workloads. This ability is also enhanced when big data and Hadoop become a more important usage model in cloud deployments, because it allows Hadoop to be physically deployed on the server without virtualization and without adding management, compute, and input/output overhead. Let's look at YARN's new architecture and see how it supports MRV2 and other applications that use different processing models. --------------------------------------------------------------------------------YARN (MRV2) introduction in order to achieve cluster sharing, scalability, and reliability for a Hadoop cluster. Designers have adopted a hierarchical approach to cluster framework. Specifically, MapReduce-specific functionality has been replaced by a new set of daemons that open the framework to a new processing model. Recall that the MRv1 Jobtracker and Tasktracker methods were an important flaw in limiting some of the failure patterns caused by scaling and network overhead. These daemons are also unique to the MapReduce processing model.   To eliminate this limitation, Jobtracker and Tasktracker have been removed from YARN and replaced by a new set of daemons that are not known to the application. Figure2. Yarn's new architecture yarn layered structure is essentially ResourceManager. This entity controls the entire cluster and manages the allocation of the application to the underlying computing resources. ResourceManager arranges the various resource parts (compute, memory, bandwidth, etc.) to the base NodeManager (per-node agent of YARN). ResourceManager also allocates resources together with Applicationmaster to launch and monitor their underlying applications together with NodeManager. In this context, Applicationmaster assumed some of the roles of previous Tasktracker, ResourceManager assumed the role of Jobtracker. Applicationmaster manages each instance of an application that runs within YARN. Applicationmaster is responsible for coordinating resources from ResourceManager and monitoring the execution of containers and resource usage (CPU, memory, etc.) through NodeManager. Note that while the current resources are more traditional (CPU cores, memory), the future will bring new resource types based on the task at hand (a specific processing unit or a dedicated processing device). From the YARN perspective, Applicationmaster is a user code, so there is a potential security issue. YARN assumes that applicationmaster are faulty or even malicious, so treat them as unprivileged code. NodeManager manages each node in a YARN cluster. NodeManager provides services for each node in the cluster, from overseeing lifetime management of a container to monitoring resources and tracking node health. MRV1 manages the execution of Map and Reduce tasks through slots, while NodeManager manages abstract containers that represent resources for each node that can be used by a particular application. YARN continues to use the HDFS layer. Its primary NameNode is used for metadata services, while DataNode is used for replication storage services scattered across a cluster. To use a YARN cluster, you first need a request from a customer that contains an application. ResourceManager negotiates the necessary resources for a container and initiates a applicationmaster to represent the submitted application. By using a resource request protocol, Applicationmaster negotiates the resource containers that each node uses for the application. When executing the application, ApplicationmasterMonitor the container until it is complete. When the application finishes, Applicationmaster logs its container from ResourceManager, and the execution cycle is complete. With these discussions, it should be clear that the old Hadoop architecture is highly constrained by the Jobtracker, and Jobtracker is responsible for resource management and job scheduling for the entire cluster. The new YARN architecture breaks this model, allowing a new ResourceManager to manage resource usage across applications, Applicationmaster responsible for managing job execution. This change eliminates a bottleneck and improves the ability to extend the Hadoop cluster to a much larger configuration than before. In addition, unlike traditional mapreduce,yarn allows standard communication patterns such as the Message passing Interface to be used while performing a variety of programming models, including graphics processing, iterative processing, machine learning, and general cluster computing.--------------------------------------------------------------------------------What you need to know with the advent of YARN, you are no longer constrained by the simpler MapReduce development pattern, but can create more complex distributed applications. In fact, you can see the MapReduce model as one of the many applications that the YARN schema can run, but it exposes more of the underlying framework for custom development. This ability is very powerful because YARN's usage model is virtually unlimited, and no longer needs to be isolated from other more complex distributed application frameworks that may exist on a cluster, just like MRV1. It can even be said that as YARN becomes more robust, it has the ability to replace some of the other distributed processing frameworks, completely eliminating the cost of resources dedicated to other frameworks, while simplifying the entire system. To demonstrate the efficiency of YARN relative to MRv1, consider brute force testing of the parallel problem of older LAN Manager hashes, which is a typical way of using legacy Windows® for cryptographic hashing operations. In this scenario, the MapReduce method does not make much sense because the Mapping/The reducing phase involves too much overhead. Instead, a more sensible approach is to abstract job assignments so that each container has a portion of the password search space on top of it, enumerates it, and notifies you if the correct password was found. The point here is that the password will be dynamically determined by a function (which is a bit tricky) without having to map all the possibilities into a data structure, making the MapReduce style unnecessary and impractical. As a matter of fact, the problem in the MRV1 framework requires only an associative array, and these problems have a tendency to evolve in the direction of big data manipulation. However, problems must not always be confined to this paradigm, because you can now abstract them more simply, write custom clients, application main programs, and applications that match any design you want. --------------------------------------------------------------------------------you also face new complexities in developing yarn applications that use the powerful new features that yarn provides and the ability to build custom application frameworks on top of Hadoop. Building an application for yarn is much more complex than building a traditional MapReduce application on top of the yarn prior to Hadoop, because you need to develop a applicationmaster, which is the ResourceManager that starts when a client request arrives. Applicationmaster has a variety of requirements, including implementing some required protocols to communicate with ResourceManager (for requesting resources) and NodeManager (for allocating containers). For existing MapReduce users, the MapReduce applicationmaster minimizes any new work required, so that the effort required to deploy the MapReduce job is similar to the Hadoop before YARN. In many cases, the life cycle of an application in YARN is similar to a MRV1 application. YARN allocates many resources in a cluster, performs processing, exposes touchpoints for monitoring application progress, and eventually frees resources and performs general cleanup when the application finishes. A boilerplate implementation of this life cycle is available in a project called Kitten (see Resources). Kitten is a set of tools and code that simplifies application development in YARN, allowing you to focus on the logic of your application and initially ignore the details of negotiating and dealing with the limitations of the various entities in the yarn cluster. However, if you want to study more deeply, Kitten provides a set of services that can be used to handle interactions with other clustered entities, such as ResourceManager. Kitten provides its own applicationmaster, which is appropriate, but is provided only as an example. Kitten uses Lua scripts extensively as its configuration service. --------------------------------------------------------------------------------Next plan while Hadoop continues to grow in the big data market, it has begun an evolution to address the large data workloads that need to be defined. Yarn is still actively evolving and may not be suitable for production environments, but yarn offers an important advantage over traditional MapReduce. It allows new distributed applications to be developed outside of MapReduce, allowing them to coexist in the same cluster at the same time. YARN builds on existing elements of the current Hadoop cluster, but it also improves the jobtracker and other elements that improve scalability and enhance the ability of many different applications to share clusters. YARN will soon come to the Hadoop cluster near you, bringing new features and complexities to it. --------------------------------------------------------------------------------Reference Learning • For the latest news on Hadoop and other elements of the ecosystem, check out the Apache Hadoop project site. In addition to Hadoop, you'll learn how Hadoop is scaled up (with new technologies such as YARN) and vertically upgraded with new technologies such as Pig, Hive, and more. • As yarn Matures, you'll learn the early ways to write applications using the yarn model. A useful reference is to write YARN applications. In this reference you will find some of the new complexities introduced by yarn, as well as a discussion of the various protocols used for inter-entity communication in a yarn deployment. • Use Apache's distributed Shell Source. • View free courses on a wide range of topics from Big Data University, including Essentials for Hadoop Foundation and text Analytics, and SQL Access forHadoop and real-time stream computing. Apache Hadoop0.23MRv2, which is a good introduction to the important technical details of a Jarn cluster. Kitten:for developers who like Playing with yarn provides a useful introduction to the Hitten abstraction of yarn application development. • Learn more about Big data in the DeveloperWorks Big data content area. Find technical documentation, guide articles, education, downloads, product information, and more. 

Hadoop New MapReduce Framework Yarn detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.