The "Hadoop authoritative Guide" notes chapter I & Chapter II

Last Update:2015-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using MapReduce

Import java.io.IOException;

// is the type of Hadoop optimized for stream processing

Import org.apache.hadoop.io.IntWritable;

Import org.apache.hadoop.io.LongWritable;

Import Org.apache.hadoop.io.Text;

// will inherit this base class

Import Org.apache.hadoop.mapred.MapReduceBase;

// will implement this interface

Import Org.apache.hadoop.mapred.Mapper;

// processed data is collected by it

Import Org.apache.hadoop.mapred.OoutputCollector;

Import Org.apache.hadoop.mapred.Reporter;

// Although the system has not yet started to learn java Syntax , I suspect that extends is the base class that inherits

//Implements is an implementation interface , and Java separates them syntactically.

public class Maxtemperaturemapper extends Mapreducebase

//Mapper is a generic interface

Implements Mapper<longwritable, text, text, intwritable> {

Mapper is a generic interface :

mapper<longwritable, text, text, intwritable>

it has 4 type of parameter , respectively is Map input keys for functions , Input Value , type of output key and output value .

for now, , input key is a long integer offset , input value is a line of text , output key is year , the output value is the temperature ( integer ).

Hadoop provides a set of basic types to optimize network serialization transport , do not use directly Java built-in types . in here , Longwritable equivalent Long, Intwritable equivalent Int, Text equivalent String.

map () the input to the method is a key and a value .

map () also provides a Outputcollector instance is used to write output content .

Reduce The input key value pair of the function must match the Map match the output key value pair of the function .

The third part of the code is the job that is responsible for running MapReduce .

Import java.io.IOException;

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.IntWritable;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapred.FileInputFormat;

Import Org.apache.hadoop.mapred.FileOutputFormat;

Import org.apache.hadoop.mapred.JobClient;

Import org.apache.hadoop.mapred.JobConf;

public class Maxtemperature {

public static void Main (string[] args) throws IOException {

if (Args.length!=2) {

System.err.println ("Usage:maxtemperature <input path> <output path>");

System.exit (-1);

}

jobconf conf = new jobconf (maxtemperatuer.class);

Conf.setjobname ("Max temperature");

Fileinputformat.addinputpath (conf, new Path (Args[0]));

Fileoutputformat.setoutputpath (conf, new Path (Args[1]));

Conf.setmapperclass (Maxtemperatuermapper.class);

Conf.setreducerclass (Maxtemperatuerreducer.class);

Conf.setoutputkeyclass (Text.class);

Conf.setoutputvalueclass (Intwritable.class);

Jobclient.runjob (conf);

}

The jobconf object formulates the execution specification for the job . The constructor's argument is the class where the job resides , and Hadoop uses that class to find the JAR file that is contained in the class .

After constructing the jobconf object , set the path of the input and output data . Here , the path to the input data is defined by Fileinputformat 's static method Addinputpath () , which can be either a single file or a directory ( all files under the directory ) or a set of files that conform to a specific pattern . Can be called multiple times ( as can be seen from the name, Addinputpath ()).

Similarly , Fileoutputformat.setoutputpath () specifies the output path . is written to the directory . The write directory should not exist before the job is run, and Hadoop will reject and error . This design is primarily to prevent data from being overwritten and data being lost . After all, Hadoop runs for a long time and is lost very annoying .

Fileoutputformat.setoutputpath () and conf.setmapperclass () Specify the map and the reduce type .

Next , Setoutputkeyclass and setoutputvalueclass specify the output of the map and reduce functions type , the output types of the two functions are often the same . If different , the output function type of map is specified by setmapoutputkeyclass and setmapoutputvalueclass .

The input type is set with InputFormat , not specified in this example , using the default textinputformat ( text input format );

Finally , Jobclient.runjob () submits the job and waits for completion to write the results to the console .

the new The difference between the Java MapReduce API and the old API :

The new API tends to use the base class instead of the interface because it is easier to extend .

The new API is placed in the org.apache.hadoop.mapreduce package , in the old org.apache.hadoop.mapred .

new API fully uses context object, user code can be used with mapreduce system for communication EX, mapcontext basic jobconf, Outputcollector and The function of the Reporter

new The API supports "push" and "pull" (pull) iteration These two types of api, can k/v pair push the record to mapper, can also be pull.pull in the map () method is that can achieve batch processing of data Instead of a record-by-article treatment

The new API implements a unified configuration . is not configured through the jobconf object ( an extension of the Hadoop-configured object ) , but through configuration .

The job in the new API is controlled by the job class, not the jobclient class , and it is deleted .

The output file is named slightly different . Map is part-m-nnnnn, reduce is part-r-nnnnn (nnnnn is a block sequence number , integer , starting from 0 ).

The "Hadoop authoritative Guide" notes chapter I & Chapter II

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The "Hadoop authoritative Guide" notes chapter I & Chapter II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The "Hadoop authoritative Guide" notes chapter I & Chapter II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support