Hadoop 2.0 Working Principle Learning

Last Update:2016-07-15 Source: Internet

Author: User

Tags ack shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 about HDFS 1.1 Hadoop 2.0 Introduction

Hadoop is a distributed system infrastructure for Apache that provides storage and computing for massive amounts of data. Hadoop 2.0, the second-generation Hadoop system, has the most central design of HDFs, MapReduce, and yarn. HDFS provides storage for massive amounts of data, and MapReduce is used for distributed computing, and yarn is used for resource management.

Comparison of the structure of Hadoop 1.0 and Hadoop 2.0:

The major improvements to Hadoop 2.0 are:

1, through yarn to achieve resource scheduling and management, so that Hadoop 2.0 can run more kinds of computing frameworks, such as Spark.

2, the implementation of Namenode ha scheme, that is, there are 2 namenode (one active another standby), if Activenamenode hangs, another namenode will be transferred to the active state to provide services, Guaranteed high availability for the entire cluster.

3, the implementation of HDFs Federation, due to the meta-data in Namenode memory, memory limits the size of the entire cluster, through the HDFS Federation to make a number of Namenode to form a federation common management Datanode, This will increase the size of the cluster.

4, Hadoop RPC serialization scalability is good, by the data type module from the RPC independent, become a separate pluggable module.

1.2 HDFs Overview

HDFs is a distributed file system with high fault-tolerant characteristics. It can be deployed on inexpensive, general-purpose hardware, providing high throughput data access for applications that need to process large datasets.

Main Features:

1, support large files: Terabytes of data file support.

2, detection and rapid response to hardware failure: HDFS detection and redundancy mechanism very hospitable to a large number of common hardware platform hardware failure problem.

3. High throughput: Batch processing of data.

4. Simplified consistency model: A file processing model that writes multiple reads at a time facilitates throughput.

HDFs Unsuitable scenario: Low latency data access, large number of small files, multiple users writing files, modifying files.

The composition of HDFs: Namenode holds the HDFs namespace for any modifications to the file system metadata; Datanode stores the HDFs data as a file in the local file system, and it does not know about the HDFs file.

Data BLOCK: The data block is the HDFs file storage processing unit, in Hadoop 2.0 The default size is 128MB, can be configured according to business conditions. The existence of data blocks enables HDFs to hold large files larger than the storage node's single disk, and simplifies storage management, facilitates fault tolerance, and facilitates data replication.

1.3 HDFs Read-write process

Read the process of the file: 1, the client clients use the Open function opened the file, 2, Distributedfilesystem with RPC call metadata node, get the data block information of the file; 3. For each block, the metadata node returns the address of the data node that holds the data block; 4 , Distributedfilesystem returns Fsdatainputstream to the client, which is used to read the data; 5, the client calls the read function of Fsdatainputstream to begin reading data; 6. Fsdatainputstream connect the most recent data node that holds the first chunk of this file, 7. Data is read from the node to the client and 8, when the block is read, Fsdatainputstream closes the connection to this data node, Then connect this file to the nearest data node of the next block of data, 9, when the client reads the data finished, call Fsdatainputstream's close function, 10, in the process of reading data, if the client in the communication with the data node error, You try to connect to the next data node that contains this block of data. The failed data node is logged and is no longer connected. The read file flow for HDFs is as follows:

The process of writing the file: 1, client clients call the CREATE function created file, 2, Distributedfilesystem with RPC Call metadata node, in the file System namespace to create a new file, 3, the metadata node first determine whether the file exists, And the client has permission to create the file, and then create a new file, 4, Distributedfilesystem returns Fsdataoutputstream to the client for writing data, 5, the client begins to write data, The fsdataoutputstream divides the data into chunks, writes it to data queue;6, the data queue is read by Datastreamer, and notifies the metadata node to allocate the node for storing data blocks (each block is duplicated by default of 3 blocks). The assigned data node is placed in a pipeline; 7, Datastreamer writes the data block to the first data node in the pipeline, the first data node sends the data block to the second data node, and the second data node sends the data to the third data node; 8. Fsdataoutputstream saves the ACK queue for the emitted data block, waits for the data node in the pipeline to tell that the data has been successfully written, and 9. If the data node fails during the write process, do the following: One is to close the pipeline and the ACK The block of data in the queue is placed in the beginning of the data queue, and the current data block is given a new flag by the metadata node in the node being written, the error node is restarted and the data block becomes obsolete, and the failed data node is removed from the pipeline. The other data blocks are written to the other two data nodes in the pipeline, and the metadata node is notified that the data block has insufficient number of copy blocks to create a third backup; 10. When the client finishes writing the data, call the close function to write all the data blocks to the data node in the pipeline. and waits for the ACK queue to return successfully, and finally notifies the metadata node that the write is complete. The write file flow for HDFs is as follows:

2 Yarn Principle Introduction 2.1 yarn production background

The drawbacks of Hadoop 1.0 include:

1, poor extensibility: Jobtracker at the same time, both resource management and job control two functions, which is the biggest bottleneck of the whole system, it seriously restricts the expansion of the whole cluster.

2, poor reliability: Jobtracker There is a single point of failure, jobtracker problems will result in the entire cluster is not available.

3, resource utilization is low: resources can not be shared among multiple tasks or reasonable distribution, resulting in the inability to effectively use a variety of resources.

4, unable to support a variety of calculation framework: HADOOP1 only support mapreduce This offline batch calculation mode, but not support memory calculation, streaming calculation, iterative calculation and so on.

Because of these drawbacks of HADOOP1, Hadoop 2.0 introduces the resource Manager yarn, which solves these problems effectively.

2.2 Yarn Basic Architecture

Yarn is a resource manager for Hadoop 2.0. It is a general-purpose resource management system, which can provide unified resource management and scheduling for the upper application, which brings great benefits to the cluster in the utilization, the unified management of resources and the data sharing.

Yarn's basic design idea is to split the Jobtracker in Hadoop 1.0 into two separate services: a global resource manager ResourceManager and each application-specific applicationmaster. Where ResourceManager is responsible for the resource management and allocation of the entire system, and Applicationmaster is responsible for the management of the individual application, the basic architecture is as follows:

Yarn is still a master/slave structure in general. Throughout the resource management framework, ResourceManager is master,nodemanager as slave, and ResourceManager is highly available through the HA scheme. ResourceManager is responsible for the unified management and scheduling of resources on each nodemanager. When a user submits an application, it is necessary to provide a applicationmaster to track and manage the program, which is responsible for requesting resources from ResourceManager and asking Nodemanger to start a task that can occupy a certain resource. Because different applicationmaster are distributed across different nodes, they do not affect each other.

ResourceManager: It is a global resource manager that is responsible for the resource management and allocation of the entire system, consisting mainly of two components of the scheduler and the Application Manager.

Scheduler: Allocates resources from the system to each running application based on constraints such as capacity, queues, and so on. The scheduler only allocates resources based on the resource requirements of the application, and the resource allocation units are represented by an abstract concept "resource container" (container), container is a dynamic resource allocation unit that encapsulates resources such as memory, CPU, disk, network, etc. This limits the amount of resources that each task uses.

Application Manager: Responsible for managing all applications throughout the system, including application submissions, negotiating resources with the scheduler to start Applicationmaster, monitoring applicationmaster running state, and restarting it on failure.

Applicationmaster: Each application submitted by the user contains 1 Applicationmaster, and the main functions are to negotiate with the ResourceManager scheduler to get the resources, to further assign the resulting tasks to the internal tasks, Communicate with NodeManager to start/stop tasks, monitor all task run states, and re-request resources for tasks to restart tasks when a task fails to run.

NodeManager: It is the resource and Task Manager on each node, which not only periodically reports to ResourceManager the resource usage on this node and the operational status of each container. It also receives and processes various requests, such as container start/stop, from Applicationmaster.

Container: It is a resource abstraction in yarn that encapsulates a multidimensional resource on a node, such as memory, CPU, disk, network, etc., when applicationmaster requests a resource from ResourceManager The returned resource is represented by container. Yarn assigns a container to each task, and the task can only use the resources described in the container.

2.3 Yarn Work Flow

The workflow for yarn is as follows:

Step 1: The user submits the application to yarn, which includes the user program, the Applicationmaster program, the Applicationmaster startup command, and so on.

Step 2:resourcemanager assigns the first container to the application and communicates with the corresponding NodeManager, requiring it to launch the applicationmaster of the application in this container.

Step 3:applicationmaster first registers with ResourceManager so that the user can view the running state of the application directly through ResourceManager, and then applicationmaster request resources for each task. and monitor their operational status until the end of the run, that is, repeat step 4-7.

Step 4:applicationmaster Use a polling method to request and collect resources from the ResourceManager through the RPC protocol.

Step 5: Once the Applicationmaster has successfully applied to the resource, it begins to communicate with the corresponding NodeManager and requires it to initiate the task.

Step 6:nodemanager after setting the environment for the task (including environment variables, jar packages, binaries, etc.), write the Task Initiation command to a script and start the task by running the script.

Step 7: Each task reports its status and progress to Applicationmaster through an RPC protocol to allow Applicationmaster to keep abreast of the running state of individual tasks so that the task can be restarted when a task fails. At any time during the application run, the user can query the Applicationmaster for the current running state of the application through RPC.

Step 8: After the application finishes running, Applicationmaster logs off to ResourceManager and shuts itself down via the RPC protocol.

3 Introduction to 3.1 MapReduce Introduction to the principle of mapreduce

MapReduce is a parallel computing model and method for large-scale data processing developed by Google, and is a computational model, framework, and platform for Hadoop-oriented parallel processing.

The MapReduce execution flow includes input, map, shuffle, reduce, and output in a total of 5 processes, as shown in:

3.2mapreduce2 Operating principle

The MapReduce workflow under the yarn framework is as follows:

Step 1: The client submits the job to the cluster.

Step 2:job Get the new job application ID from ResourceManager.

Step 3: The client checks the output of the job, computes the input shard, and copies the job jar, configuration, shard information, and so on to HDFs.

Step 4:job Submit the job to ResourceManager.

Step 5:resourcemanager After the job is received, the job request is passed to the scheduler, the scheduler assigns a container based on the job information to ResourceManager, and ResourceManager under NodeManager management , start a applicationmaster process in container.

Step 6:applicationmaster initializes the job, keeps track of the job, and determines whether the job is complete.

Step 7:applicationmaster determine the number of map and reduce based on the Shard information stored in HDFs.

Step 8:applicationmaster request container for the map and reduce for this job to poll ResourceManager.

Step 9:applicationmaster after getting to container, communicate with NodeManager to start container.

Step 10:container to localize the resources required by the task by getting the jar package, configuration, and distributed cache files for the job from HDFs.

Step 11:container start the map or reduce task.

3.3 Shuffle and sorting

The map-side output of MapReduce is passed as input to the reduce side, and the key ordering process is called shuffle. The literal meaning of shuffle is shuffling, and the data generated by the map is assigned to different reduce tasks through partitioning, sequencing, and other processes. The data processing flow for MapReduce is as follows:

Map phase:

1, each input shard will let a map task to handle, by default, the size of a block in HDFs (the default is 64M, can be set) as a shard. The result of the map output is temporarily placed in a ring memory buffer (the buffer size defaults to 100M and is controlled by the Io.sort.mb property). When the buffer is about to overflow (by default, 80% of the buffer size, controlled by the Io.sort.spill.percent property), an overflow file is created in the local file system that writes the data from that buffer to the file.

2. Before writing to disk, the thread first divides the data into the same number of partitions based on the number of reduce tasks, which is the data for one partition for a reduce task. This is done to avoid some of the reduce tasks being allocated to large amounts of data, while some reduce tasks have little or no data embarrassment. The data in each partition is then sorted, and if combiner is set at this point, the sorted results are combine, which can effectively reduce disk IO and network IO.

3, when the map task output last record, there may be a lot of overflow files, you need to merge these files. In the process of merging, sorting and combine operations are performed continuously to minimize the amount of data that is written to disk each time and minimize the amount of data transmitted over the next replication phase of the network. Finally, it is merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed, as long as the mapred.compress.map.out is set to true.

4. Copy the data in the partition to the corresponding reduce task. So how does the data in the partition know which reduce it corresponds to? Applicationmaster saves the macro information for the entire job, as long as the reduce task gets the corresponding map output location to the Applicationmaster.

Reduce phase:

1. Reduce will receive data from different map missions, and the data from each map is ordered. If reduce accepts a fairly small amount of data, it is stored directly in memory, and if the amount of data exceeds a certain percentage of the buffer size, the data is merged and then overflowed to disk.

2. As overflow files increase, background threads merge them into a larger, ordered file to save time for subsequent merges. In fact, in either the map or the reduce side, MapReduce is repeated to perform sorting, merging operations, so that the sort is the soul of Hadoop.

3, in the process of merging will produce a lot of intermediate files (write to disk), but MapReduce will write to the disk as little as possible, and the result of the last merge is not written to disk, but directly into the reduce function.

Hadoop 2.0 Working Principle Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More