MapReduce format and type

Last Update:2016-06-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MapReduce Types

MapReduce is a simple data processing model in which both the input and output types of map and reduce are key-value pairs in the form of Key-value.

Map: (K1, V1) →list (K2, V2) Reduce: (K2, List (V2)) →list (K3, V3)

In general, the input key of the map and the output value type (K1,V1) are different from the output type of map (K2,V2). The input type of reduce is consistent with the output type of map, and the output type of reduce may have a different form (K3,V3). Here is the Java API:

public class Mapper<keyin, Valuein, Keyout, valueout> {public  class Context extends Mapcontext<keyin, VALUE In, Keyout, valueout> {    //...  }  protected void Map (Keyin key, Valuein value,                      context context) throws IOException, interruptedexception {    //... 
   }}public class Reducer<keyin, Valuein, Keyout, valueout> {public  class Context extends reducercontext< Keyin, Valuein, Keyout, valueout> {    //...  }  protected void reduce (Keyin key, iterable<valuein> values,                        context context) throws IOException, interruptedexception {    //...  }}

The Write () method is eventually called by the context to key-value pairs output

public void Write (Keyout key, valueout value)    throws IOException, interruptedexception

Mapper and Reducer are two different classes, each with a different type of entry, the type of the mapper can be different from the type of the reducer, for example, the Mapper key to the longwritable,reduce of the text of the parameter .

Here's one thing, if you call the Combine method in the map phase, it's the same as the import parameter of reduce

Map: (K1, V1) →list (K2, V2) Combine: (K2, List (V2)) →list (K2, V2) Reduce: (K2, List (V2)) →list (K3, V3)

When you use the Parition method to manipulate the key and value of the intermediate result, the position of the Parition (index) will be returned, and the parition will be determined by the sorted key

Public interface Partitioner<k2, v2> extends jobconfigurable {  int getpartition (K2 key, V2 value, int numpartiti ONS);}

The default partition type is Hashpartitioner, which determines which partition the key belongs to, and each partition belongs to a reduce task, so the number of partitions determines the number of reduce tasks

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  public int getPartition(K key, V value,                          int numReduceTasks) {    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  }}

When you need multiple reduce tasks jobs, Hashpartitioner is important because the results of the map will be passed to multiple reduce, and the same key will be distributed to different reduce tasks, greatly improving the efficiency of the job. Then reduce the number of the decision of the overall degree of parallelism, some people will ask, that map number, map number is determined by the number of blocks of the file, the specific following ~

Then the reducer number of the grasp will be an art-increase the number of reducer equivalent to increase the degree of parallelism.

Smaller files with Combinefileinputformat

Hadoop jobs work for larger files because Fileinputformat is split entire file or split individual file, if the file is too small (this refers to a block size smaller than HDFS) and has many such files, The performance cost of opening the file is increased. At the same time, a large number of small files will also increase the storage cost of namenode metadata,

MapReduce format and type

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce format and type

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support