Hadoop (quad)--programming core mapreduce (UP)

Source: Internet
Author: User

    The previous article describedhadOOPone of the core contentHDFS, isHadoopDistributed Platform Foundation, and this speaks ofMapReduceis to make the best useHdfsdistributed, improved algorithm model for operational efficiency ,Map(Mapping)and theReduce (return to about)the two main stages are<key,value>Key-value pairs as inputs and outputs, all we need to do is to<key,value>do the processing we want. Seemingly simple but troublesome, because it is too flexible.

First, OK, Let's take a look at the two graphs below, and see how mapreduce is executing in Hadoop , and the execution process within MapReduce:


To analyze meteorological data as an example:



Two, parse: The steps to perform the mapreduce :


Map Task Handling :

1. Read the input file content and parse it into a key-value pair (key/value). for each line of the input file , Parse into

Key-value pairs (key/value). each key-value pair is called once Map function

2. Write your own logic , handle The input key-value pair (key/value) , and switch to the new key-value pair

(Key/value) Output .

3. partition the output key-value pairs (key/value) . Partition)

4. The data of different partitions , sorted by key , grouped . the same Key/value put in

in a collection . ( Shuffle)

5. The data after grouping is under the statute . Combiner, optional ), which is the ability to process a mapper

Reduce the work of reducing


Reduce Task Handling :

1. Output to multiple map tasks , according to different partitions , through the network Copy to a different reduce node .

2. merge and sort The output of multiple map tasks . Write Reduce the logic of the function itself , of the input

Key/value processing , Convert to New Key/value Output .

3. Save the output of reduce to a file ( written to HDFs ).

Three, task execution optimization:


1 , speculative execution: that is, if Job Tracker If you find a task that has been dragged down, you'll start the same backup task again, and then the one that runs out will Kill off another. Therefore, on the Monitoring Web page often can see the normal execution of the job has been killed task.

2 , speculative execution is turned on by default, but if it is a code problem, it does not solve the problem and makes the cluster slower,

Set mapred.map.tasks.speculative.execution and in the Mapred-site.xml configuration file.

Mapred.reduce.tasks.speculative.execution can be turned on or off for a map task or reduce task

Speculative execution

3 , reusing a JVM eliminates the time it takes to start a new JVM and sets it in the Mapred-site.xml configuration file

Mapred.job.reuse.jvm.num.tasks set the maximum number of tasks to run on a single JVM (1, >1, or 1 tables

Show No Limit )

4, ignore mode, the task after reading data failure 2 times, will be the data location to tell Jobtracker, the latter restart

The task and skips directly when it encounters the bad data being logged (by default, the Skipbadrecord method

Open


Four, the error mechanism handles the failure:


1 , hardware failure, i.e. Job Tracker and the Tasktracker Fault:

A,jobtracker is a single point, if a fault occurs at present, Hadoop can not be processed, only the most reliable choice of hardware as

Jobtracker

B,jobtracker the heartbeat (period 1 minutes) to see if the tasktracker is faulty or the load is too severe

C,jobtracker removes the failed Tasktracker from the task node list

D, if the failed node is performing a map task and is not yet completed, Jobtracker will ask the other node to re-execute the

Map task

F, if the failed node is performing a reduce task and is not yet completed, Jobtracker will ask the other node to continue execution

Reduce tasks that have not been completed


2, Task failed: Task failed due to code or process crash:

A, The JVM automatically exits, sends a party error message to the Tasktracker parent process, and the error message is also written to the log

B, the Tasktracker listener will find that the process exited, or the process has not been updated for a long time, and the task is marked as

Failed

C, after marking the failed task, the task counter subtracts 1 to accept the new task and tells Jobtracker through the heartbeat signal

Task Failure Information

D, Jobtrack learned that the task failed, the task will be put back into the dispatch queue, reassigned and then executed

E, If a task fails more than 4 times (can be set), it will no longer be executed and the job also declares a loss


Five, finally look at a wordCount Example:

Package Job;import Java.io.ioexception;import Java.util.stringtokenizer;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import org.apache.hadoop.io.IntWritable; Import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.Mapper ; Import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import org.apache.hadoop.util.genericoptionsparser;/** * Hadoop's first MapReduce example, WordCount, calculates the number of words * * @author Administrator * */public class WordCount {/* * Inherit Mapper interface, set the input class of map Type is &LT;OBJECT,TEXT&GT, the output is <Text,IntWritable> */public static class Tokenizermapper extends Mapper<object, Text, text, intwritable> {//one indicates that the word appeared once private final static intwritable one=new intwritable (1);// Word is used to store the cut-down word private text word=new text ();//map to split the content in the form of a < word,1> write out public void map (Object key, text value, Context context) Throws ioexception,interruptedexception{//the word segmentation stringtokenizer itr=new StringTokenizer (value.tostring ()); while ( Itr.hasmoreelements ()) {Word.set (Itr.nexttoken ());//Cut the word into Word context.write (word, one);}} /** * Reducer Function Writing * @author Administrator * */public static class Intsumreducer extends Reducer<text, intwritable, Tex T, Intwritable>{//result Record the frequency of the word private intwritable result=new intwritable ();p ublic void reduce (Text key,iterable <IntWritable> Values,context Context) throws Ioexception,interruptedexception{int Sum=0;for (intwritable val: Values) {Sum+=val.get ();} Result.set (sum); Context.write (key, result);}} public static void Main (string[] args) throws Exception{configuration configuration=new Configuration (); String[] otherargs=new genericoptionsparser (Configuration,args). Getremainingargs (); if (otherargs.length!=2) { System.err.println ("Usage:wordcount <in> <out>"); System.exit (2);} Configure the job name Job Job=new job (configuration, "word count"); Job.setjarbyclass (Wordcount.class); Job.setmapperclass (Tokenizermapper.class); Job.setcombinerclass (Intsumreducer.class); Job.setReducerClass ( Intsumreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job, New Path (Otherargs[0])); Fileoutputformat.setoutputpath (Job, New Path (Otherargs[1])); System.exit (Job.waitforcompletion (true)? 0:1);}}

MapReduce, more understanding of Process execution, properties corresponding to the API, and then is to exercise their own modeling thinking, algorithm related exercise, etc. ..


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Hadoop (quad)--programming core mapreduce (UP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.