3. What do you think of hadoop2.x yarn?
In view of my interview mentioned this question: the difference between hadoop1.x and hadoop2.x, the beginning there are a lot of vague places, hope to see this article, can have a basic understanding of Hadoop.
Reprinted from Http://www.aboutyun.com/forum.php?mod=viewthread&tid=19771&extra=page%3D1
Because now everyone is in contact with the hadoop2.x. For hadoop1.x understanding is still relatively small.
Many people ask if there is no basis for 1.x, can learn hadoop2.x. The answer is yes. But knowing hadoop1.x helps us understand hadoop2.x.
Let's look at what's wrong with hadoop1.x.
Hadoop has jobtracker,trasktracker. For Jobtracker,trasktracker just contact is actually more abstract. may have encountered many times. But the understanding and understanding of it is still relatively vague.
We have a metaphor here: in an organizational structure, there are both managers and performers. The Jobtracker,trasktracker is the manager, the performer is the map task and the reduce task. The
Trasktracker is like a middle-level manager that monitors both the performer--map task and the reduce task, and if the map task or reduce job is updated, it tells Trasktracker through the heartbeat (typically 3 seconds) Trasktracker again through the heartbeat (generally at least 5s, because the cost is relatively large) told Jobtracker. The
Jobtracker is the top-level manager who accepts Trasktracker's heartbeat and is responsible for resource management and job scheduling.
If your mind is dense, you can see that if the top management Jobtracker hangs up, the entire cluster will be paralyzed.
1. The job cannot be submitted.
2. Unable to allocate resources
3.job Unable to schedule
This is a bit like a country's leader, then who will be responsible for the operation of the country. If you understand the operating mechanism, there are also plans. And this solution is a highly available solution that we are familiar with. And if Jobtracker hangs up, it's obvious that the Hadoop cluster is dead. So obviously the hadoop1.x is flawed. If the
is flawed, how can we make up for it? If it is still in the original frame to modify, get two jobtracker whether it is possible. This is certainly a scheme. But Hadoop is ambitious, too. As the first pioneer of big data, spark,storm are very active.
So we've listed the following requirements for Hadoop:
1.hadoop There is a single point of failure
2.hadoop can be unified Spark,storm
from above we see that Hadoop itself has problems that need to be transformed, And you want to unify spark and storm. So Hadoop desperately needs to upgrade.
There are a number of solutions in mind here.
Scenario 1: Two Jobtraker
Hadoop itself, since there is a single point of failure, so we can create two jobtraker, whether it is possible. The answer is yes. Because once one hangs off. It is also appropriate for us to enable another jobtraker. But there is another problem, that is, how to unify spark and storm. If spark and Storm are running, are two jobtraker possible? The answer is no, because Jobtraker is still not out of his own frame, only to run the Hadoop map and reduce. The topology of the DAG and Storm for Spark is still not operational. Well, if you say we're not going to join in the Jobtraker. But this is quite troublesome, Jobtraker certainly will be exhausted, his task is too many. Clearly required separation of duties.
Two Jobtraker is out of the question, then detach from the Jobtraker function and resolve the existing problem
1. Performance issues
2. Single point of Failure
3. Can run Mapreduce,spark, Storm
So yarn is generated at this time.
Add to Hadoop's history
(1) Hadoop 1.0
Hadoop 1.0, the first generation of Hadoop, consists of the distributed storage System HDFS and the distributed computing Framework MapReduce, where HDFs consists of a namenode and multiple datanode, MapReduce consists of a jobtracker and multiple tasktracker, corresponding to the Hadoop version of Apache Hadoop 0.20.x, 1.x, 0.21.X, 0.22.x, and CDH3.
(2) Hadoop 2.0
Hadoop 2.0, the second generation of Hadoop, is designed to overcome various problems with HDFs and MapReduce in Hadoop 1.0. In view of the scalability problem of single namenode restricting HDFs in Hadoop 1.0, the HDFs Federation is proposed, which allows multiple namenode to separate directories in order to achieve access isolation and scale-out, and it completely solves the problem of Namenode single point of failure. , for the lack of extensibility and multi-frame support for MapReduce in Hadoop 1.0, it separates the resource management and job control functions in Jobtracker, which are implemented by component ResourceManager and Applicationmaster, respectively. Where ResourceManager is responsible for resource allocation for all applications, and Applicationmaster is only responsible for managing one application, thus giving birth to a new generic resource management framework yarn. Based on yarn, users can run various types of applications (no longer as much as 1.0 is limited to the MapReduce class of applications), from off-line computed mapreduce to on-line computing (streaming) storm. Hadoop 2.0 corresponds to the Hadoop version of Apache Hadoop 0.23.x, 2.x, and CDH4.
(3) MapReduce 1.0 or MRV1
The MapReduce 1.0 Computing framework consists of three parts, namely the programming model, the data processing engine and the runtime environment. Its basic programming model is to abstract the problem into the map and reduce two stages, in which the map phase parses the input data into Key/value, iterates over the map () function, and then outputs it to the local directory in key/value form. In the reduce phase, the same value of key is processed, and the final result is written to HDFs, and its data processing engine consists of maptask and Reducetask, which are responsible for the process of map phase logic and reduce phase logic respectively. Its runtime environment consists of (one) Jobtracker and (several) Tasktracker two types of services, where Jobtracker is responsible for resource management and control of all jobs, and Tasktracker is responsible for receiving commands from Jobtracker and executing it. The framework is insufficient in the aspects of extensibility, fault tolerance and multi-frame support, which also promotes the MRv2.
The MRV2 has the same programming model and data processing engine as the MRV1, and the only difference is the runtime environment. MRV2 is a computational framework mapreduce that runs on the resource management framework yarn after processing on MRV1 basis. Its runtime environment is no longer composed of services such as Jobtracker and Tasktracker, but instead becomes the generic resource management system yarn and job control process applicationmaster, in which yarn is responsible for resource management and scheduling, And Applicationmaster is only responsible for the management of one job. In short, MRV1 is only a standalone offline computing framework, while MRV2 is a mapreduce running on yarn.
Yarn is a resource management system in Hadoop 2.0, a common resource management module for resource management and scheduling for a variety of applications. Yarn is not limited to the use of MapReduce as a framework, but can also be used by other frameworks, such as Tez (which will be introduced in chapter 9th), Spark, Storm (which will be introduced in chapter 10th), and so on. Yarn is similar to the resource management system Mesos (which will be introduced in Chapter 12) and earlier torque (which will be introduced in Chapter 6) a few years ago. Thanks to the versatility of yarn, the core of next-generation MapReduce has shifted from a simple computing framework that supports single applications to a common resource management system yarn.
(6) HDFS Federation
In Hadoop 2.0, HDFs has been improved to allow the namenode to scale horizontally into multiple, each namenode part of the directory, resulting in HDFs Federation, the introduction of this mechanism not only enhances the scalability of HDFs, but also makes HDFS with isolation.