[Hadoop]mapreduce principle Brief

Source: Internet
Author: User
Tags shuffle

1, for the input of the map, the input data is first cut into equal shards, for each shard to create a map worker, where the size of the tile is not arbitrarily set, is generally the same size as the HDFS block, the default is 64MB, The maximum size of an input data slice that is stored on a node is the block size of HDFs, and when the size of the tile is larger than the HDFS block, it causes the transfer between nodes, which takes up bandwidth.

2. Map worker calls the user-written map function to process each shard and outputs the processing results to local storage (non-HDFS)

3, the output of the map combiner operation, here the combiner is mainly to reduce the data transmission between the map and reduce, not the necessary steps, can be cited as a "hadoop:the definitive guide" The maximum temperature is calculated in the example.

Fir Map Output:

(1950, 0)

(1950, 20)

(1950, 10)

Sec Map Output:

(1950, 25)

(1950, 15)

In the case where combiner is not called, the output data of the map will be transferred to reduce, and when reduce is processed, the following data will be used as input:

(1950, [0, 20, 10, 25, 15])

In the case of calling Combiner, the output data is now processed locally on each map (the maximum temperature of the current map is calculated) and then lost to reduce, as follows:

Fir Map Combined:

(1950, 20)

Sec Map Combined:

(1950, 25)

At this point, reduce will use the following data as input, thereby reducing the amount of data transferred between map and reduce:

(1950, [20, 25])

4, the combiner processing data or map output data shuffle processing, so-called shuffle processing is the record in the data through the partition operation map to reduce, so that each reduce processing is the same key record. Note that the partition function can be customized, or the default partition function can be used, and the default partition is to map records of the same key to the same reduce using a hash map operation.

5. Reduce calls the user-defined reduce function to process the data, and the output is stored in HDFs.

[Hadoop]mapreduce principle Brief

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.