first, understand the MapReduce job compositionA complete mapreduce job is called
Job, it consists of three parts:
-
- Input data
- MapReduce Program
- Configuration information
When Hadoop works, it divides the job into a number of
Task: The map task and the reduce task have two types of nodes that control the process of job execution:
JobtrackerAnd
Tasktracker
-
- Jobtracker: Record the overall progress of the job and dispatch the Tasktracker
- Tasktracker: Perform task tasks and report to Jobtracker
second, large chunks of data flow first to mapHadoop divides the input data into equal-length chunks and becomes
Data sharding。 Hadoop builds a map task for each shard. Parallel processing time will certainly be less than the time to process the entire large data block, but due to the performance of each node and the operation of the job, the processing time of each shard may be different,
Therefore, better load balancing can be achieved by slicing the data more finely.。 But on the other hand,
if the shards are too small, the time to manage the shards and build the map task will increase。 Therefore, there is a tradeoff between Hadoop shard size and processing shard time. For most jobs, a shard size of 64MB is more appropriate, in fact, Hadoop's default
Block SizeAlso 64MB. We saw it above.
Hadoop has the same chunk size as the best shard size, so that data fragmentation is not easy to store across blocks of data, so the input shard of a map task can read the local data block directly, which avoids the ability to read the Shard data from other nodes, thus saving network overhead.
Map's task output is written to the local disk rather than HDFsOf So why? Because the map task outputs an intermediate result, once the map task is completed it is deleted, and it is a bit blowing out if it is stored in HDFs and the backup fault tolerance is achieved. If one of the map tasks fails, Hadoop will restart the map task with another node.
iii. data inflow from map to reduceandThe reduce task does not have
Data LocalizationAdvantages--the input of a single reduce task usually comes from all mapper outputs.
the generally sorted map output needs to be sent over the network to the node running the reduce task and merged on the reduce side。 The output of reduce typically needs to be stored in HDFS for reliable storage. The first copy of each reduce output HDFs block is stored on the local node, while the other replicas are stored on the other node, so
reduce output also needs to occupy network bandwidth。
Example: MapReduce task data flow for a reduce task
The number of reduce tasks is not determined by the size of the input data, but is specifically specified. If you have multiple reduce tasks, each map task will have its output
Partitioning (partition), because each reduce task will build a partition. Records with the same key will be partition to the same partition. The specific partitioning method is controlled by the partition function, which is generally partitioned by the hash function.we call the data flow between the map task and the reduce task
Shuffle, because the input of each reduce task comes from more than one map task, so this stage is more complex, and the parameter adjustment in the shuffle process has a very big impact on the total time of the job operation, the general MapReduce tuning is mainly to adjust the parameters of the shuffle stage.such as: Data flow for multiple reduce tasks
Iv. How to reduce the amount of data from map to reduceThe
available bandwidth on the cluster limits the number of MapReduce jobs because the intermediate results of the map are passed to reduce for transmission over the network, so the most important point is to minimize the
amount of data transferred between the map and reduce tasks . However, Hadoop allows users to specify a
merge function (combiner)for the output of the map task, using the output of the merge function as input to the reduce function, but be aware that the use of the merge function should not change the results of the reduce function. For example, the output of two maps is map1={0,20,10};map2={15,25}, for maximum, we can merge the data of each map first, and then pass it to reducer after the merge is complete: map1={0,20,10}->combiner->{20}; map2={15,25}->combiner->{25};reducer->{25}
that is, Max (0,20,10,15,25) =max (max (
0,20,10), Max (15,25)) =25
such as: the output after combiner as the input of reducer
However, it is important to note that
not all scenarios can be combiner , such as changing the above example to mean:
-
- results of the reducer after combiner: avg (avg (0,20,10), avg (15,25)) =avg (10,20) =15;
- No reducer results for combiner: avg (0,20,10,15,25) =14;
Understanding MapReduce Data Flow