Understanding MapReduce Data Flow

Source: Internet
Author: User
Tags shuffle

first, understand the MapReduce job compositionA complete mapreduce job is called Job, it consists of three parts:
      1. Input data
      2. MapReduce Program
      3. Configuration information
When Hadoop works, it divides the job into a number of Task: The map task and the reduce task have two types of nodes that control the process of job execution: JobtrackerAnd Tasktracker
      • Jobtracker: Record the overall progress of the job and dispatch the Tasktracker
      • Tasktracker: Perform task tasks and report to Jobtracker
second, large chunks of data flow first to mapHadoop divides the input data into equal-length chunks and becomes Data sharding。 Hadoop builds a map task for each shard. Parallel processing time will certainly be less than the time to process the entire large data block, but due to the performance of each node and the operation of the job, the processing time of each shard may be different, Therefore, better load balancing can be achieved by slicing the data more finely.。 But on the other hand, if the shards are too small, the time to manage the shards and build the map task will increase。 Therefore, there is a tradeoff between Hadoop shard size and processing shard time. For most jobs, a shard size of 64MB is more appropriate, in fact, Hadoop's default Block SizeAlso 64MB. We saw it above. Hadoop has the same chunk size as the best shard size, so that data fragmentation is not easy to store across blocks of data, so the input shard of a map task can read the local data block directly, which avoids the ability to read the Shard data from other nodes, thus saving network overhead. Map's task output is written to the local disk rather than HDFsOf So why? Because the map task outputs an intermediate result, once the map task is completed it is deleted, and it is a bit blowing out if it is stored in HDFs and the backup fault tolerance is achieved. If one of the map tasks fails, Hadoop will restart the map task with another node. iii. data inflow from map to reduceandThe reduce task does not have Data LocalizationAdvantages--the input of a single reduce task usually comes from all mapper outputs. the generally sorted map output needs to be sent over the network to the node running the reduce task and merged on the reduce side。 The output of reduce typically needs to be stored in HDFS for reliable storage. The first copy of each reduce output HDFs block is stored on the local node, while the other replicas are stored on the other node, so reduce output also needs to occupy network bandwidth
Example: MapReduce task data flow for a reduce task

The number of reduce tasks is not determined by the size of the input data, but is specifically specified. If you have multiple reduce tasks, each map task will have its output Partitioning (partition), because each reduce task will build a partition. Records with the same key will be partition to the same partition. The specific partitioning method is controlled by the partition function, which is generally partitioned by the hash function.we call the data flow between the map task and the reduce task Shuffle, because the input of each reduce task comes from more than one map task, so this stage is more complex, and the parameter adjustment in the shuffle process has a very big impact on the total time of the job operation, the general MapReduce tuning is mainly to adjust the parameters of the shuffle stage.such as: Data flow for multiple reduce tasks

Iv. How to reduce the amount of data from map to reduceThe available bandwidth on the cluster limits the number of MapReduce jobs because the intermediate results of the map are passed to reduce for transmission over the network, so the most important point is to minimize the amount of data transferred between the map and reduce tasks . However, Hadoop allows users to specify a merge function (combiner)for the output of the map task, using the output of the merge function as input to the reduce function, but be aware that the use of the merge function should not change the results of the reduce function. For example, the output of two maps is map1={0,20,10};map2={15,25}, for maximum, we can merge the data of each map first, and then pass it to reducer after the merge is complete:     map1={0,20,10}->combiner->{20};     map2={15,25}->combiner->{25};reducer->{25} that is, Max (0,20,10,15,25) =max (max ( 0,20,10), Max (15,25)) =25
such as: the output after combiner as the input of reducer
     However, it is important to note that not all scenarios can be combiner , such as changing the above example to mean:
      • results of the reducer after combiner: avg (avg (0,20,10), avg (15,25)) =avg (10,20) =15;
      • No reducer results for combiner: avg (0,20,10,15,25) =14;































Understanding MapReduce Data Flow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.