MapReduce format and type

Source: Internet
Author: User

MapReduce Types

MapReduce is a simple data processing model in which both the input and output types of map and reduce are key-value pairs in the form of Key-value.

Map: (K1, V1) →list (K2, V2) Reduce: (K2, List (V2)) →list (K3, V3)

In general, the input key of the map and the output value type (K1,V1) are different from the output type of map (K2,V2). The input type of reduce is consistent with the output type of map, and the output type of reduce may have a different form (K3,V3). Here is the Java API:

public class Mapper<keyin, Valuein, Keyout, valueout> {public  class Context extends Mapcontext<keyin, VALUE In, Keyout, valueout> {    //...  }  protected void Map (Keyin key, Valuein value,                      context context) throws IOException, interruptedexception {    //... 
   }}public class Reducer<keyin, Valuein, Keyout, valueout> {public  class Context extends reducercontext< Keyin, Valuein, Keyout, valueout> {    //...  }  protected void reduce (Keyin key, iterable<valuein> values,                        context context) throws IOException, interruptedexception {    //...  }}

The Write () method is eventually called by the context to key-value pairs output

public void Write (Keyout key, valueout value)    throws IOException, interruptedexception

Mapper and Reducer are two different classes, each with a different type of entry, the type of the mapper can be different from the type of the reducer, for example, the Mapper key to the longwritable,reduce of the text of the parameter .

Here's one thing, if you call the Combine method in the map phase, it's the same as the import parameter of reduce

Map: (K1, V1) →list (K2, V2) Combine: (K2, List (V2)) →list (K2, V2) Reduce: (K2, List (V2)) →list (K3, V3)

When you use the Parition method to manipulate the key and value of the intermediate result, the position of the Parition (index) will be returned, and the parition will be determined by the sorted key

Public interface Partitioner<k2, v2> extends jobconfigurable {  int getpartition (K2 key, V2 value, int numpartiti ONS);}

The default partition type is Hashpartitioner, which determines which partition the key belongs to, and each partition belongs to a reduce task, so the number of partitions determines the number of reduce tasks

public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; }}

When you need multiple reduce tasks jobs, Hashpartitioner is important because the results of the map will be passed to multiple reduce, and the same key will be distributed to different reduce tasks, greatly improving the efficiency of the job. Then reduce the number of the decision of the overall degree of parallelism, some people will ask, that map number, map number is determined by the number of blocks of the file, the specific following ~

Then the reducer number of the grasp will be an art-increase the number of reducer equivalent to increase the degree of parallelism.

Smaller files with Combinefileinputformat

Hadoop jobs work for larger files because Fileinputformat is split entire file or split individual file, if the file is too small (this refers to a block size smaller than HDFS) and has many such files, The performance cost of opening the file is increased. At the same time, a large number of small files will also increase the storage cost of namenode metadata,

MapReduce format and type

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.