Yet Another Resource negotiator Introduction
Apache Hadoop with MapReduce is the backbone of distributed data processing. With its unique horizontal expansion of the physical cluster architecture and the fine processing framework originally developed by Google, Hadoop has exploded in the new field of large data processing. Hadoop also developed a rich variety of application ecosystems, including Apache Pig (a powerful scripting language) and Apache Hive (a data warehouse solution with a similar SQL interface).
Unfortunately, the ecosystem is built on a programming model that does not solve all the problems in large data. MapReduce provides a specific programming model that is not a panacea for large data, although it has been simplified through tools such as Pig and Hive. Let's first introduce MapReduce 2.0 (MRV2)-or yet Another Resource negotiator (YARN)-and quickly review the Hadoop architecture before YARN.
A simple introduction to Hadoop and MRV1
The Hadoop cluster can be extended from a single node (where all Hadoop entities are running on the same node) to thousands of nodes (where the functionality is dispersed between nodes to increase parallel processing activity). Figure 1 illustrates an advanced component of a Hadoop cluster.
Figure 1. A simple demo of the Hadoop cluster architecture
A Hadoop cluster can be decomposed into two abstract entities: the MapReduce engine and the Distributed File system. The MapReduce engine is able to execute the MAP and Reduce tasks across the cluster and report the results, where the Distributed file system provides a storage mode that can replicate data across nodes for processing. The Hadoop Distributed File System (HDFS) is defined to support large files (where each file is typically a multiple of MB).
When a client makes a request to a Hadoop cluster, the request is managed by Jobtracker. Jobtracker and Namenode jointly distribute the work to as close to the data as it is working. Namenode is the primary system of the file system, which provides metadata services to perform data distribution and replication. Jobtracker the Map and Reduce tasks into available slots on one or more tasktracker. Tasktracker performs Map and Reduce tasks with Datanode (Distributed File System) on data from Datanode. When the Map and Reduce tasks are complete, Tasktracker tells Jobtracker that the latter determines when all tasks are completed and eventually tells the customer that the job is complete.
Infosphere biginsights Quick Start Edition
Infosphere biginsights Quick Start Edition is a free downloadable version of IBM's Hadoop based product Infosphere biginsights. With the Quick Start Edition, you can try IBM's developed features to expand the value of open source Hadoop, such as Big SQL, text analysis, and bigsheets. Guided learning can make your experience as smooth as possible, including step-by-step, self-paced tutorials, and videos to help you get started with Hadoop. Without time or data limitations, you can schedule your own time to experiment on a large amount of data.
As you can see in Figure 1, MRV1 implements a relatively simple cluster Manager to perform MapReduce processing. MRV1 provides a tiered cluster management model in which large data jobs infiltrate a cluster in the form of a single Map and Reduce task and are eventually aggregated into jobs to report to the user. But this simplicity has some secrets, but it's not a very secret question.
MRV1 's flaws
The first version of MapReduce has both advantages and disadvantages. MRV1 is the standard large data processing system currently in use. However, this architecture is inadequate, mainly in large clusters. When the cluster contains more than 4,000 nodes (where each node may be multi-core), it can be unpredictable. One of the biggest problems is cascading failures, and because of the attempt to replicate data and overloaded nodes, a failure can lead to a severe deterioration of the entire cluster through a network flooding pattern.
But MRv1 's biggest problem is multiple tenants. As cluster size increases, a desirable approach is to use a variety of models for these clusters. MRV1 nodes are dedicated to Hadoop, so you can change their use for other applications and workloads. This capability can also be enhanced when large data and Hadoop become a more important usage model in cloud deployments, because it allows the physical use of Hadoop on the server without virtualization without adding management, calculation, and input/output overhead.
Let's look at the new architecture of YARN and see how it supports MRV2 and other applications that use different processing models.
YARN (MRV2) Introduction
To achieve cluster sharing, scalability, and reliability of a Hadoop cluster. Designers adopt a layered cluster framework approach. Specifically, the MapReduce feature has been replaced with a new set of daemons, opening the framework to the new processing model.
Where can I find YARN?
YARN was introduced in Hadoop in the hadoop-0.23 version. As the overhaul progresses, you'll find that the framework is constantly being updated.
Recall that the MRv1 Jobtracker and Tasktracker methods have been an important flaw due to the limitations on some of the fault patterns caused by scaling and network overhead. These daemons are also unique to the MapReduce processing model. To eliminate this limitation, Jobtracker and Tasktracker have been removed from YARN and replaced by a set of new daemons that are not known to the application.
Figure 2. New architecture for YARN
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/