MapReduce Execution Process

Source: Internet
Author: User

Mapper The execution process of the task:
  • The first stage is the input file according to a certain standard Shard (inputsplit), the size of each input piece is fixed. By default, the size of the input slice (inputsplit) is the same as the size of the data block (block). If the size of the block (block) is the default value of 64MB, the input file has two, one is 32MB, and the other is 72MB. So the small file is an input piece, the large file will be divided into two pieces of data, then two input pieces. Altogether three input slices are produced. Each input piece consists of a Mapper process processing . Here are three input pieces, there will be three mapper process processing.
  • The second stage is to parse the records in the input slices into key-value pairs according to certain rules. A default rule is to parse each line of text into a key-value pair. The "key" is the starting position (in bytes) of each row, and the value is the text content of the bank.
  • The third stage is to call the map method in the Mapper class. The second phase resolves each key-value pair, calling a map method. If there are 1000 key-value pairs, the map method is called 1000 times. Each time the map method is called, 0 or more key-value pairs are output.
  • The Forth stage is to partition the key-value pairs of the third stage output according to certain rules. The comparison is based on the key. For example, our key indicates provinces (such as Beijing, Shanghai, Shandong, etc.), then can be divided according to different provinces, the same province of the key-value pairs into a region. The default is only one zone . the number of partitions is Reducer the number of task runs . There is only one reducer task by default.
  • The fifth stage is to sort the key-value pairs in each partition. First, sort by key, and for key-value pairs with the same key, sort by value. For example, three key values for <2,2>, <1,3>, <2,1>, and keys and values are integers respectively. Then the result of sorting is <1,3>, <2,1>, <2,2>. If there is a sixth stage, then the sixth stage is entered, and if not, the output is directly to the local Linux file.
  • The sixth stage is the processing of data, that is, reduce processing. Key -value pairs with equal keys are called once Reduce method . By this stage, the amount of data will be reduced. The data is output to a local Linxu file. This stage is not the default and requires the user to add the code for this phase themselves .

Reducer the execution process of a task

    • The first stage is that the reducer task will proactively replicate its output key-value pairs from the mapper task. Mapper tasks can be many, so reducer copies the output of multiple mapper.
    • The second stage is to merge all the data that is copied into the reducer, merging the scattered data into one large data. Then sort the merged data.
    • The third stage is to call the reduce method on the sorted key-value pair. Key -value pairs with equal keys are called once Reduce method , each call produces 0 or more key-value pairs. Finally, these output key-value pairs are written to the HDFs file.
    • Throughout the development of the MapReduce program, the greatest effort was to override the map function and overwrite the reduce function.

MapReduce Execution Process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.