Detailed Implementation of hadoopmapreduce-Map-Reduce

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original article: http://wiki.apache.org/lucene-hadoop/HadoopMapReduce

Keyword:
Filesplit: a subset of a file-the object segmentation body.

Introduction:

This document describes how map and reduce operations are completed in hadoop. If you are not familiar with Google's mapreduce patterns, see mapreduce-http://labs.google.com/papers/mapreduce.html first

Map

Since map operates the input file set in parallel, its first step (filesplit) is to divide the file set into several subsets. if a single file is large enough to affect the search efficiency, it will be divided into small split bodies. It should be noted that the split step does not know the internal logic structure of the input file. For example, text files separated by the behavior logic will be separated by any byte boundary, therefore, you must specify the specific split by yourself. You can also use the simple split defined by hadoop. Each file split body has a new map task.

When a single map task starts, it starts a new output writer for each configured reduce task ). then it (writer) reads its file split body using the recordreader obtained from the specified inputformat. Inputformat analyzes the input file and generates key-value pairs. At the same time, inputformat must process records at the boundary when splitting files. For example, textinputformat will read the last row of the object split boundary. If the read object is not the first, textinputformat will ignore the content of the first row.

The inputformat class does not need to generate meaningful key-value pairs. For example, the default output of the textinputformat class uses the row content of the input text as the value and the row offset as the key. Most applications only use the offset but seldom use the offset.

The key-value pairs passed to the er configured by the user are read from the recordreader. The Mapper class provided by the user can perform any operation on the key-value pairs and then call outputcollector. collect method to re-collect key-value pairs defined by yourself. The generated output must use a key class and a value class, because the map output result will be written to the disk in the form of sequencefile, this form includes the type information of each file and all records are of the same type (if you want to output different data structures, You can inherit from sub-classes ). The input and output key-value pairs of map do not need to be related to the type.

When mapper outputs are collected, they are output to the output file in a specified way by the partitioner class. By default, the hashcode generated by the hash function of the key class is distinguished by hashpartitioner class (therefore, a good hash function is required to balance the load of each reduce task ). For details, you can view the maptask class. N inputs can generate M map tasks to run. Each map task generates several output files for the configured reduce task. Each output file targets a specific reduce task and all key-value pairs generated from the map task are sent to reduce. Therefore, all key-value pairs of a given key are processed in a specific reduce task.

Combine

When the map operation outputs its key-value pairs, they already exist in the memory. For the sake of performance and efficiency, it is advantageous to provide a synthesizer with the reduce function. If there is a synthesizer, the map key-value pairs will not be immediately written into the output, they will be collected in the list, a key value and a list, when a certain number of key-value pairs are written, the buffer is sent to the synthesizer, all values of each key are sent to the reduce method of the synthesizer, just like the key-value pairs output by the original map.

For example, for the word count program in the hadoop case, its map operation outputs a (word, 1) Key-value pair, and the word count in the input can be accelerated by a synthesizer. A merging operation collects and processes lists in the memory, and a word is a list. When a certain number of key-value pairs are output to the memory, the reduce method of the merging operation is called. Each time a unique word is used as the key, values is the list iterator. Then the synthesizer outputs (word, Count-in-this-part-of-the-input) Key-value pairs. From the reduce operation point of view, the synthesizer also has the same information in the map output, but this will greatly reduce the hard disk read/write.

Reduce

When a reduce task starts, its input is scattered in the map output file on each node. In distributed mode, they need to first copy to the local file system in the copy step. For details, see the cetcetaskrunner class.

Once all the data is valid locally, it is added to a file in the Add step. Then, the file will be merged into categories so that key-value pairs of the same key can be arranged together (classification steps ). In this way, the real reduce operation can be simplified. This file will be read in sequence, and the value (values) will be passed to the reduce method through an iterator in the input file until the next key. For more information, see the cetcetask class.

Finally, the output consists of the output files of each reduce task. Their format can be specified by the jobconf. setoutputformat class. If the jobconf. setoutputformat class is used, the output key class and value class must be both specified.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More