The "Hadoop authoritative Guide" notes chapter I & Chapter II

Source: Internet
Author: User

??

??

??

??

??

??

??

Using MapReduce

??

??

??

??

Import java.io.IOException;

// is the type of Hadoop optimized for stream processing

Import org.apache.hadoop.io.IntWritable;

Import org.apache.hadoop.io.LongWritable;

Import Org.apache.hadoop.io.Text;

// will inherit this base class

Import Org.apache.hadoop.mapred.MapReduceBase;

// will implement this interface

Import Org.apache.hadoop.mapred.Mapper;

// processed data is collected by it

Import Org.apache.hadoop.mapred.OoutputCollector;

Import Org.apache.hadoop.mapred.Reporter;

??

// Although the system has not yet started to learn java Syntax , I suspect that extends is the base class that inherits

//Implements is an implementation interface , and Java separates them syntactically.

public class Maxtemperaturemapper extends Mapreducebase

//Mapper is a generic interface

Implements Mapper<longwritable, text, text, intwritable> {

??

??

Mapper is a generic interface :

??

mapper<longwritable, text, text, intwritable>

it has 4 type of parameter , respectively is Map input keys for functions , Input Value , type of output key and output value .

??

for now, , input key is a long integer offset , input value is a line of text , output key is year , the output value is the temperature ( integer ).

??

Hadoop provides a set of basic types to optimize network serialization transport , do not use directly Java built-in types . in here , Longwritable equivalent Long, Intwritable equivalent Int, Text equivalent String.

??

map () the input to the method is a key and a value .

??

map () also provides a Outputcollector instance is used to write output content .

??

??

??

Reduce The input key value pair of the function must match the Map match the output key value pair of the function .

The third part of the code is the job that is responsible for running MapReduce .

??

Import java.io.IOException;

??

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.IntWritable;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapred.FileInputFormat;

Import Org.apache.hadoop.mapred.FileOutputFormat;

Import org.apache.hadoop.mapred.JobClient;

Import org.apache.hadoop.mapred.JobConf;

??

public class Maxtemperature {

??

public static void Main (string[] args) throws IOException {

if (Args.length!=2) {

System.err.println ("Usage:maxtemperature <input path> <output path>");

System.exit (-1);

}

??

jobconf conf = new jobconf (maxtemperatuer.class);

Conf.setjobname ("Max temperature");

??

Fileinputformat.addinputpath (conf, new Path (Args[0]));

Fileoutputformat.setoutputpath (conf, new Path (Args[1]));

Conf.setmapperclass (Maxtemperatuermapper.class);

Conf.setreducerclass (Maxtemperatuerreducer.class);

Conf.setoutputkeyclass (Text.class);

Conf.setoutputvalueclass (Intwritable.class);

??

Jobclient.runjob (conf);

}

}

??

??

The jobconf object formulates the execution specification for the job . The constructor's argument is the class where the job resides , and Hadoop uses that class to find the JAR file that is contained in the class .

??

After constructing the jobconf object , set the path of the input and output data . Here , the path to the input data is defined by Fileinputformat 's static method Addinputpath () , which can be either a single file or a directory ( all files under the directory ) or a set of files that conform to a specific pattern . Can be called multiple times ( as can be seen from the name, Addinputpath ()).

??

Similarly , Fileoutputformat.setoutputpath () specifies the output path . is written to the directory . The write directory should not exist before the job is run, and Hadoop will reject and error . This design is primarily to prevent data from being overwritten and data being lost . After all, Hadoop runs for a long time and is lost very annoying .

??

Fileoutputformat.setoutputpath () and conf.setmapperclass () Specify the map and the reduce type .

??

Next , Setoutputkeyclass and setoutputvalueclass specify the output of the map and reduce functions type , the output types of the two functions are often the same . If different , the output function type of map is specified by setmapoutputkeyclass and setmapoutputvalueclass .

??

The input type is set with InputFormat , not specified in this example , using the default textinputformat ( text input format );

??

Finally , Jobclient.runjob () submits the job and waits for completion to write the results to the console .

??

??

??

??

??

the new The difference between the Java MapReduce API and the old API :

??

The new API tends to use the base class instead of the interface because it is easier to extend .

The new API is placed in the org.apache.hadoop.mapreduce package , in the old org.apache.hadoop.mapred .

new API fully uses context object, user code can be used with mapreduce system for communication EX, mapcontext basic jobconf, Outputcollector and The function of the Reporter

new The API supports "push" and "pull" (pull) iteration These two types of api, can k/v pair push the record to mapper, can also be pull.pull in the map () method is that can achieve batch processing of data Instead of a record-by-article treatment

The new API implements a unified configuration . is not configured through the jobconf object ( an extension of the Hadoop-configured object ) , but through configuration .

The job in the new API is controlled by the job class, not the jobclient class , and it is deleted .

The output file is named slightly different . Map is part-m-nnnnn, reduce is part-r-nnnnn (nnnnn is a block sequence number , integer , starting from 0 ).

??

??

The "Hadoop authoritative Guide" notes chapter I & Chapter II

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.