Big talk Hadoop1.0, Hadoop2.0 and yarn platform

Source: Internet
Author: User

December 14, 2016 21:37:29
Author: Zhangmingyang
Blog Link: http://blog.csdn.net/a2011480169/article/details/53647012

Recently these days have been busy with hbase experiment, nor too quiet to precipitate themselves, today intends to write a blog about Hadoop1.0, Hadoop2.0 and yarn, from the overall grasp of the links between the three, blog content if there is a problem, welcome message! OK, enter the topic ...
When it comes to Hadoop, maybe everyone has the following concept for Hadoop: Hadoop is made up of two parts, one is the distributed file system that is responsible for storing and managing files, and the other is the computational framework of the mapreduce responsible for processing and computing. That is, Hadoop is able to store massive amounts of data through HDFS, and can achieve a distributed computing through MapReduce, in one sentence to summarize Hadoop is: Hadoop is a platform for distributed storage and computing for big data.

Here, let's look at the architecture of HDFs:

HDFs Architecture is a master-slave structure, the main node Namenode only one, from the node Datanode there are many, everyone in here to note: The main node Namenode and from the node Datanode actually refers to the different physical machines, That is, there is a machine running on the process is Namenode, many machines running on the process is Datanode, that is to sayThe role of the server is determined by the role of the process on which it is running, or what type of server it is, otherwise everyone is a bunch of physical machines., it is important to make a clear distinction about this concept.
Next we discuss the role of Namenode, Secondarynamenode, and Datanode in HDFs in the cluster:
The role of Namenode:
1>namenode manages the entire file system and is responsible for receiving user requests for operations
2>namenode manages the directory structure of the entire file system, and the so-called directory structure is similar to our Windows operating system architecture
3>namenode manages the metadata information of the entire file system, so-called metadata information is specified in addition to the data itself related to the file itself relevant information
4>namenode maintains the correspondence between the file and block sequence and the correspondence between block blocks and Datanode nodes.
In a word to summarize our namenode:namenode in HDFs is responsible for the management of the work.
The role of Datanode:
1>datanode only do one thing in HDFs: The data is stored, and the files in HDFs are cut into block blocks for storage, unlike our windows, where the files are cut into block blocks for storage. Also for ease of maintenance and management.
We should pay special attention to:in HDFs, our real data is stored by the Datanode, but the data is stored to which datanode node and other meta-data information is stored by our Namenode
The role of Secondarynamenode:
Secondarynamenode only do one thing in HDFs: merge the Fsimage and edits in the Namenode node, as shown in the merge process:

First, Secondarynamenode copies a copy of Fsimage and edits to their own process from the Namenode, then merges fsimage with edits to generate a new fsimage, And pushes the newly generated fsimage to a copy of the Namenode node and empties the contents of the edits in Namenode.
Here's what you should note:Namenode himself does not merge fsimage and edits in order to respond more quickly to user requests for operations
The article writes here, we think about, in the Hadoop1.0 of HDFs what is the defect? We can summarize a few points:
Single point of failure problem in 1>namenode
2> because Namenode contains all of the metadata information that our users store files, when our Namenode cannot load all the metadata information in memory, the life of the cluster ends, and we generalize this to the problem that the memory capacity of namenode is insufficient.
The authority design in 3>hdfs is not exhaustive, that is, the data isolation of HDFS is not very good
4> If HDFs stores small files in large quantities, it can cause namenode memory pressure to surge.
The above four points of fault in the Hadoop2.0 has been resolved, as to how to solve, we will talk in a while, here we first look at the Hadoop1.0 in the MapReduce.
MapReduce is a distributed computing framework in Hadoop1.0, consisting of two phases: the mapper phase and the reducer phase, where the user simply implements the map function and the reduce function for distributed computing, which is very simple, and then we look at the architecture of MapReduce:

MapReduce architecture is also a master-slave structure, the main node Jobtracker only one, from the node Tasktracker there are many, The role of Jobtracker and Tasktracker in MapReduce like the project manager and the developer, Jobtracker's specific responsibilities are as follows:
1>jobtracker responsible for receiving the computing tasks submitted by the user
2> Assign the calculation task to our Tasktracker for execution
3> track the execution status of task tasks for monitoring Tasktracker
Of course Tasktracker's role is to perform task Jobtracker assigned to the compute task.
Now let's think about the drawbacks of MapReduce in Hadoop1.0:
1>mapreduce in the jobtracker responsibilities too much, both need to allocate resources, but also to monitor the operation of tasks under each job, which often results in a huge waste of memory and resources
2> for real-time jobs and batch jobs, In the Hadoop1.0 need to build different cluster environment, each cluster environment run different job types, which often lead to the resource utilization of the cluster is not high, in the actual business, we mapreduce processing the main business for some deferred batch operations, that is, due to the design of 1.0 MapReduce The resource utilization of the cluster is not high.
OK, with the bug of HDFs in 1.0 and the flaw of MapReduce we entered into the Hadoop2.0, we first talked about HDFS in 2.0.
In Hadoop2.0, the following improvements were made in 2.0 for insufficient memory capacity and Namenode single point of failure in the HDFS1.0 Namenode:
1> In 1.0, since a namenode will lead to insufficient memory capacity, we introduce two namenode to form the HDFs Federation, so that the namenode stored metadata information can be doubled, so-called HDFs Federation is a multi-hdfs cluster working simultaneously, the data node Datanode stored data is the service In two HDFs file systems, as shown in the architecture:

2> In 2.0, for the single point of failure in the 1.0 namenode, a new HA mechanism was introduced in 2.0: that is, if the active Namenode node is hung, the Namenode node in standby will replace it to continue working, the following diagram is convenient for everyone to understand:

Here we must pay attention to:2.0 in the HDFs Federation is also two namenode nodes, in Ha is also two namenode nodes, but the Federation of two namenode nodes because of the use of a different namespace (name space), Therefore, the metadata information stored by the two Namenode nodes is not the same, but the two namenode nodes in Ha have the same namespace because of the same namespaces, so the metadata information stored by the two Namenode nodes is the same
OK, after the introduction of 2.0 in the Federation and HA, we enter 2.0 in the MapReduce that is 2.0 yarn, in Hadoop2.0, yarn Platform is 2.0 resource management system, the architecture as shown:

Yarn is a resource management system in Hadoop2.0, its basic design idea is to split the Jobtracker in MRV1 into two separate services: one is the global resource Manager Resoucemanager and each application-specific appmaster.
In the yarn platform, the detailed functions of each component are as follows:
1>resoucemanager is a global resource manager that is responsible for the resource management and allocation of the entire system, resoucemanager equivalent to some of the functions of Jobtracker in Hadoop1.0: resource allocation.
2>appmaster is responsible for managing a single application, which is responsible for all the work in one job life cycle, and re-initiates the task by requesting resources for the task when it fails to run. Appmaster resembles some of the functions of the jobtracker in the old framework: task assignment and task monitoring.
Special Note: Each job (and not each) has a corresponding appmaster,appmaster that can run on a machine other than the primary node Resoucemanager node, but in Hadoop1.0, the location of the jobtracker is fixed.
3>nodemanager is the resource and Task Manager on each node, on the one hand: it will periodically report to Resoucemanager on the resource usage on this node and the running state of each container. On the other hand: it will accept and handle various requests from Appmaster container to start, stop, etc.
Here, let's talk about the user's application (in the case of the MapReduce program) the operating mechanism above the yarn platform:

1> first, the user's application submits our application to our yarn platform via the Yarn platform's client program
After the resoucemanager of the 2>yarn platform accepts the application submitted by our client, it gives the application to a NodeManager and starts a new process on it appmaster
3>appmaster first registers the application in Resoucemanager so that the user can view the progress of the application's execution through Resoucemanager
After the 4> is registered, Appmaster will request resources and receive resources through the RPC protocol to Resoucemanager
5> The resource is acquired, Appmaster facilitates communication with the corresponding NodeManager node, requiring it to initiate the corresponding task
6> each task during execution will report its progress and execution to appmaster through the RPC protocol, so that Appmaster can keep track of the execution of each task, and can restart the task if the task fails.
After the 7>mapper and reducer tasks are executed, Appmaster logs off to the Resoucemanager node and shuts itself down, when the resource is reclaimed and the application is executed.
In the end, we summarize the 4 advantages of the yarn framework compared to the old MapReduce framework:
1> In the Hadoop1.0, Jobtracker's responsibilities are too high, both the need to allocate resources, and the need to track the performance of tasks under each job, which often results in significant waste of memory and resources, in Hadoop2.0, yarn framework design greatly reduces the jobtracker (which is now Resoucemanager) resource consumption, and let tracking monitor the execution of tasks under each job is the responsibility of the corresponding Appmaster, And in Resoucemanager, there is a module called applicationsmasters (note not applicationmaster), it is only responsible for tracking monitoring the health of each applicationmaster, If there is a problem with appmaster running, it will restart the process on the other machine.
2> in the new yarn platform, Appmaster is a changeable part, users can write different types of applications, the new yarn framework allows more types of programming models to run in the Hadoop cluster, The yarn platform solves the problem of different types of applications that can run on the same platform. After the resoucemanager of the yarn platform receives the various types of applications submitted by our clients, the various types of applications will be run by the yarn platform as a normal application, which can run different job types on a yarn platform.
3> in the new yarn platform, the representation of the resource is in memory, which is more reasonable than the number of slots remaining.
4>container is a new framework that yarn proposes in order to isolate resources in the future.
OK, the blog is written here, if you have any questions, welcome to the message!

Big talk Hadoop1.0, Hadoop2.0 and yarn platform

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.