Hadoop in the Big Data Era (iii): Hadoop Data Flow (life cycle)

Source: Internet
Author: User

To understand Hadoop, you first need to understand the data flow of Hadoop, just as you know the life cycle of a servlet. Hadoop is a distributed storage (HDFS) and Distributed Computing Framework (MapReduce), but Hadoop also has a very important feature:Hadoop will move the mapreduce computation to each machine that stores some data .


Terminology The MapReduce Job (Job) is a unit of work that the client needs to perform: it includes input data, MapReduce programs, and configuration information. Hadoop executes the job into several small tasks, including two types of tasks: the map task and the reduce task .


There are two types of nodes that control the job execution process: a Jobtracker and a series of Tasktracker. Jobtracker coordinates all jobs running on the system by dispatching tasks running on the Tasktracker. Tasktracker The running progress report to Jobtracker,jobtracker to record the overall progress of each job task while running the task. If one of the tasks fails, Jobtracker can reschedule the task on another Tasktracker node.
input Hadoop divides the input data of mapreduce into equal-length small chunks of data, called input shards or shards . Hadoop builds a map task for each shard , and the task runs the user-defined map function to process each record in the Shard.
For most jobs, a reasonable shard size tends to be the size of a block in HDFs, which is 64M by default, but this default can be adjusted for the cluster. The size of the shards must be based on the tasks that are running, and if the shards are too small, the total time to manage the shards and the total time to build the map task will determine the overall execution time of the job.

Hadoop runs the map task on the node where the input data is stored, and can get the best performance, which is known as data localization optimization . Because blocks are the smallest unit of data that HDFs stores, each block can exist simultaneously (back up) on multiple nodes, and a file is partitioned into chunks that are randomly divided across multiple nodes, so if the input shards of a map task span multiple chunks, So there is basically no node that can have these contiguous blocks of data at the same time, then the map task will need to remotely replicate the data blocks that are not present on this node to this node and then run the map function through the network, so this task is obviously very inefficient.
  OutputThe Map task writes its output to a local disk, not to HDFs. This is because the output of the map is an intermediate result: the intermediate result is processed by the reduce task to produce the final result (stored in HDFs). Once the job is completed, the map output can be deleted.
The reduce task does not have the data localization advantage: the input of a single reduce task usually comes from the output of all mapper tasks. The output of the reduce task is typically stored in HDFS for reliable storage.
  Data Flow jobs vary according to the number of reduce tasks that are set, but the data flow is different, but similar. The number of reduce tasks is not determined by the size of the input data, but can be specified by manual configuration.
 Single reduce task
  Multiple reduce tasks In the case of multiple reduce tasks, each map task partitions its output (partition), creating a partition for each reduce task . The partition has a user-defined partition function control, and the default partition (Partitioner) is partitioned by a hash function.
The data flow between the map task and the reduce task is called Shuffle (mixed-wash).
 
 No reduce taskof course, there may be situations where you do not need to perform the reduce task, where the data can be completely parallel.
 
 combiner (merge function)by the way, let's just say combiner. Hadoop runs the user to specify a merge function for the output of the map task, and the output of the merge function as input to the reduce function. In fact, the merging function is an optimization scheme, which is to reduce the amount of network traffic by executing the merge function (usually a copy of the reduce function) in the local computer after the map task executes.

Hadoop in the Big Data Era (iii): Hadoop Data Flow (life cycle)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.