Hadoop Data Flow (lifecycle)

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp; function run fragment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; To learn about Hadoop, you first need to understand the data flow of Hadoop, just as you know the lifecycle of a servlet. Hadoop is a distributed storage (HDFS) and Distributed Computing Framework (MapReduce), but Hadoop also has an important feature: Hadoop moves mapreduce computing to the machines that store some of the data.

term MapReduce job is a unit of work that a client needs to perform: it includes input data, MapReduce programs, and configuration information. Hadoop does this by dividing the job into tasks that include two types of tasks: the map task and the reduce task.

has two types of nodes that control the operation process: a Jobtracker and a series of tasktracker. Jobtracker coordinates all jobs running on the system by scheduling tasks running on Tasktracker. Tasktracker sends a running progress report to Jobtracker,jobtracker while the task is running to record the overall progress of each job task. If one of the tasks fails, Jobtracker can reschedule the task on another Tasktracker node.

input Hadoop divides the input data of the mapreduce into equal-length small chunks, called input partitions or fragments. Hadoop constructs a map task for each fragment, and the task runs the user-defined map function to handle each record in the fragment.

for most jobs, a reasonable slice size tends to be the size of a HDFS block, the default is 64M, but this default value can be adjusted for the cluster. The size of the fragment must be based on the task to be run, if the fragmentation is too small, then the total time to manage the fragmentation and the total time to build map tasks will determine the entire execution time of the job.

Hadoop runs the map task on a node that stores input data, and can achieve optimal performance, which is called Data localization optimization. Because blocks are the smallest units that HDFs store data, each block can exist at the same time on multiple nodes (backups), and each block of a file is randomly partitioned across multiple nodes, so if the input fragment of a map task spans multiple blocks of data, So basically no node can happen to have these contiguous blocks of data at the same time, then the map task needs to first remotely replicate the data blocks that do not exist on the node to this node and then run the map function, so the task is clearly inefficient.

The
output Map task writes its output to the local disk, not to the HDFs. This is because the output of the map is an intermediate result: the intermediate result is processed by the reduce task to produce the final result (saved in HDFs). Once the job is completed, the output of the map can be deleted.

The
reduce task does not have a data localization advantage: the input of a single reduce task usually comes from the output of all mapper tasks. The output of the reduce task is usually stored in HDFS for reliable storage.

Data flow jobs vary according to the number of reduce tasks that are set, but they are very similar. The number of reduce tasks is not determined by the size of the input data, but can be specified by manually configuring.

Single reduce task

multiple reduce tasks if it is multiple reduce tasks, each map task partitions its output (partition), which creates a partition for each reduce task. Partitions have user-defined partition function control, and the default partition (Partitioner) is partitioned by a hash function.

The data flow between the
map task and the reduce task is called shuffle (mixed).

Without the reduce task, there may be situations where you do not need to perform a reduce task, where the data can be fully parallel.

combiner (merge function)

By the way, Combiner, Hadoop. Run the user specifies a merge function for the output of the map task, and the output of the merge function is entered as the reduce function. In fact, the merging function is an optimization solution, which is to reduce the amount of network traffic by executing the merge function (usually the copy of the reduce function) after the map task executes.

Original link: http://blog.csdn.net/chaofanwei/article/details/39695743

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More