A Brief and Workflow of MapReduce

Source: Internet
Author: User
Keywords mapreduce hadoop
Tags hadoop mapreduce writable interface big date storage map function
The execution steps of the MR programming model:
1. Prepare the input data for map processing
2, mapper processing
3, Shuffle
4, Reduce processing
5, the result output

(input)<k1,v1> -> map -><k2,v2> -> combine -> <k2,v2> ->reduce -> <k3,v3>(output)

img

Process flow:

img
Process:

1. input text information, by InputFormat -> FileInputFormat -> TextInputFormat, get the Split array through the getSplits method, and then use the getRecordReader method to handle the Split, each line is assigned to a map processing

2. All maps on each node are processed by the Partitioner on the node (Shuffling process), and the map is placed on other nodes by key or continues to be processed under the node.

3. sort

4. the results are handled by reduce

5. after processing is written to Local or Hadoop by OutputFormat -> FileOutputFormat -> TextOutputFormat

Split: The data block processed by the MR, the smallest calculation unit in the MR. The default is one-to-one correspondence with the Block in HDFS (the smallest storage unit in HDFS, the default 128M), or it can be set manually (not recommended)

InputFormat: Splits the input data (Split) InputSplit[] getSplits(JobConf var1, int var2)

TextInputFormat: used to process data in text format

OutputFormat: output

img
The diagram above shows:

In general, one Split corresponds to one block, but the above picture is a set.

A file file is divided into n blocks, which corresponds to 2n Splits. After InputFormat processing, each Split is processed by a Mapper. After Shuffling grouping and sorting, multiple Reducers are generated, and each Reducer will generate one. file
MapReduce 1.x architecture: one JobTracker + multiple taskTracker

JobTracker: responsible for resource management and job scheduling

TrakTracker: Regularly report the health, resources, and job status of the node to the JobTracker, and receive JT commands, such as starting/killing tasks.

MapReduce 2.x:
img

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.