Hadoop Data Flow (lifecycle)

Source: Internet
Author: User
Keywords nbsp; function run fragment

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; To learn about Hadoop, you first need to understand the data flow of Hadoop, just as you know the lifecycle of a servlet. Hadoop is a distributed storage (HDFS) and Distributed Computing Framework (MapReduce), but Hadoop also has an important feature: Hadoop moves mapreduce computing to the machines that store some of the data.


term MapReduce job is a unit of work that a client needs to perform: it includes input data, MapReduce programs, and configuration information. Hadoop does this by dividing the job into tasks that include two types of tasks: the map task and the reduce task.





has two types of nodes that control the operation process: a Jobtracker and a series of tasktracker. Jobtracker coordinates all jobs running on the system by scheduling tasks running on Tasktracker. Tasktracker sends a running progress report to Jobtracker,jobtracker while the task is running to record the overall progress of each job task. If one of the tasks fails, Jobtracker can reschedule the task on another Tasktracker node.


input Hadoop divides the input data of the mapreduce into equal-length small chunks, called input partitions or fragments. Hadoop constructs a map task for each fragment, and the task runs the user-defined map function to handle each record in the fragment.


for most jobs, a reasonable slice size tends to be the size of a HDFS block, the default is 64M, but this default value can be adjusted for the cluster. The size of the fragment must be based on the task to be run, if the fragmentation is too small, then the total time to manage the fragmentation and the total time to build map tasks will determine the entire execution time of the job.





Hadoop runs the map task on a node that stores input data, and can achieve optimal performance, which is called Data localization optimization. Because blocks are the smallest units that HDFs store data, each block can exist at the same time on multiple nodes (backups), and each block of a file is randomly partitioned across multiple nodes, so if the input fragment of a map task spans multiple blocks of data, So basically no node can happen to have these contiguous blocks of data at the same time, then the map task needs to first remotely replicate the data blocks that do not exist on the node to this node and then run the map function, so the task is clearly inefficient.

The
output Map task writes its output to the local disk, not to the HDFs. This is because the output of the map is an intermediate result: the intermediate result is processed by the reduce task to produce the final result (saved in HDFs). Once the job is completed, the output of the map can be deleted.

The
reduce task does not have a data localization advantage: the input of a single reduce task usually comes from the output of all mapper tasks. The output of the reduce task is usually stored in HDFS for reliable storage.


Data flow jobs vary according to the number of reduce tasks that are set, but they are very similar. The number of reduce tasks is not determined by the size of the input data, but can be specified by manually configuring.


Single reduce task


multiple reduce tasks if it is multiple reduce tasks, each map task partitions its output (partition), which creates a partition for each reduce task. Partitions have user-defined partition function control, and the default partition (Partitioner) is partitioned by a hash function.

The data flow between the
map task and the reduce task is called shuffle (mixed).


 


Without the reduce task, there may be situations where you do not need to perform a reduce task, where the data can be fully parallel.


 


combiner (merge function)

By the way, Combiner, Hadoop. Run the user specifies a merge function for the output of the map task, and the output of the merge function is entered as the reduce function. In fact, the merging function is an optimization solution, which is to reduce the amount of network traffic by executing the merge function (usually the copy of the reduce function) after the map task executes.

Original link: http://blog.csdn.net/chaofanwei/article/details/39695743

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.