MapReduce Link Job

Source: Internet
Author: User

For a simple parser, we can do it with just one mapreduce, but for more complex analysis procedures, we may need multiple jobs or multiple maps or reduce to calculate. Let's talk about multiple jobs or multiple MapReduce programming forms

The main types of MapReduce are the following programming forms

1. Iterative MapReduce

The MapReduce iterative approach, typically the output of the previous MapReduce task as input to the next MapReduce task, preserves only the final result of the MapReduce task, which can be deleted or retained, as shown below

The sample code for an iterative mapreduce is shown below

/*** @ProjectName mapreducelinkjob* @PackageName com.buaa* @ClassName iterativejob* @Description todo* @Author Liu Jishu * @Date 2016-06-11 11:01:57*/public class Iterativejob extends configured implements Tool {    //This gives only the main code , other ellipsis     ......          @Override      public int Run (string[] args) throws Exception {         Configuration conf = new configuration ();                  //the first mapreduce task         job job1 = new Job ( conf, "Job1");        ...         //job1 input         fileinputformat.addinputpath (job1,input);  Output of        //job1        &NBSp Fileoutputfromat.setoutputpath (JOB1,OUT1);         Job1.waitforcompletion (True);                  //a second mapreduce task         job job2 = new Job (conf, " Job2 ");         ...        // The output of the JOB1 as input to the JOB2         fileinputformat.addinputpath (JOB2,OUT1);  Output of        //job2          Fileoutputfromat.setoutputpath (JOB2,OUT2);         Job2.waitforcompletion (True);                  //A third mapreduce task         job job3 = new Job (conf, " Job3 ");                //job2 output as input to JOB3          fileinputformat.addinputpath (job3,out2); output of         //job3         fileoutputfromat.setoutputpath (JOB3,OUT3);         job3.waitforcompletion (true);          ....     }        &NBSP, ...}

Although the iteration of MapReduce can implement multitasking, it has the following two drawbacks:

1, each iteration, if all job objects are created repeatedly, the cost will be very high.

2, each iteration, the data are written to local, and then read from the local, I/O and network transmission cost is relatively large

2, dependent type Mapreuce

The dependency MapReduce is implemented by the Jobcontrol class in the Org.apache.hadoop.mapred.jobcontrol package. An instance of Jobcontrol represents a run diagram of a job, you can join the job configuration, and then tell the dependencies between Jobcontrol instance jobs. When you run Jobcontrol in a thread, it executes the jobs in the order in which they are dependent. You can also view the process, and at the end of the job, you can query all the status of the job and the error messages related to each failure. If a job fails, Jobcontrol will not perform subsequent jobs with which it is dependent

The sample code for dependency Mapreuce is shown below

/*** @ProjectName mapreducelinkjob* @PackageName com.buaa* @ClassName dependentjob* @Description todo* @Author Liu Jishu * @Date 2016-06-11 11:12:45*/public class Dependentjob extends configured implements Tool {    //This gives only the main code , other ellipsis     ......          @Override      public int Run (string[] args) throws Exception {         Configuration Conf1 = new configuration ();         job job1 = new Job (CONF1, " Job1 ");        ...                  configuration conf2 = new Configuration ();         job job2 = new Job (CONF2, "Job2");         ....                 configuration conf3 = new Configuration ();         job job3 = new Job (conf3, "Job3");        ....          //constructs a controlledjob        controlledjob cJob1 = new Controlledjob (CONF1);         //settings controlledjob         cjob1.setjob (JOB1);         controlledjob cJob2 = New Controlledjob (CONF2);         cjob2.setjob (JOB2);         controlledjob cJob3 = new Controlledjob (conf3);         cjob2.setjob (JOB3);                  //set JOB3 and Job1 dependencies         job3.adddepending (JOB1)         //set job3 and Job2 dependencies          job3.adddepending (JOB2);                  jobcontrol JC = new Jobcontrol ("Dependentjob");         //added three constructed controlledjob to Jobcontrol         jc.addjob (cJob1 );         jc.addjob (CJOB2);         jc.addjob (CJOB3);         thread t = new Thread (JC);         t.start ();         while (True) {             if (jobcontrol.allfinished ()) {                 jobcontrol.stop ();                 break;             }        }     }        &NBSP, ...}

Note: The Jobcontrol class of Hadoop implements the thread runnable interface. We need to instantiate a thread to start it. Calling Jobcontrol's Run () method directly, the thread will not end.

3, Chain-type MapReduce

A large number of data processing tasks involve preprocessing and post-processing of records. For example, when processing a document for information retrieval, it may be necessary to first remove the stop words (words such as a, the and is often present but not meaningful), and then do stemming (convert the different forms of a word into the same form, For example, convert finishing and finished to finish).

We can write a mapreduce job for preprocessing and post-processing, and link them together. You can use Identityreducer (or a completely different reducer) in these jobs. This approach is inefficient because the intermediate results of each job during execution require I/O and storage resources to be consumed. Another way is to write mapper yourself to pre-invoke all preprocessing jobs, and then let Reducer invoke all the post-processing jobs. This forces us to use a modular and composable approach to building preprocessing and post-processing. So Hadoop introduces the Chainmapper and Chainreducer classes to simplify the composition of preprocessing and post-processing.

Hadoop provides dedicated chained chainmapper and chainreducer to handle chained mapreduce tasks. There are multiple mapper in the map or reduce phase, these mapper like the Linux pipeline, the output of the previous mapper is redirected directly to the input of the latter mapper, forming the pipeline, as shown in

Its invocation form is as follows:

... Chainmapper.addmapper (...); Chainreducer.setreducer (...); Chainreducer.addmapper (...); ...

The Addmapper method is as follows:

public static void Addmapper (Job job,  class<extends mapper> mclass,  class<extends k1> Inputkeyclass,  class<extends v1> inputvalueclass,  class<extends k2> outputkeyclass,  Class <extends v2> Outputvalueclass,  Configuration conf)

The Addmapper () method has 8 parameters. The first and last job and local configuration objects, respectively, are global. The second parameter is the Mapper class, which is responsible for data processing. The remaining 4 parameters Inputkeyclass, Inputvalueclass, Outputkeyclass, and Outputvalueclass are the types of input/output classes in this mapper class. Chainreducer specifically provides a setreducer () method to set the entire job's unique reducer, which is similar to the Addmapper () method.

The example code for chained MapReduce is shown below

/*** @ProjectName mapreducelinkjob* @PackageName com.buaa* @ClassName chainjob* @Description todo* @Author Liu Jishu * @Date 2016 -06-11 11:16:55*/public class Chainjob extends configured implements Tool {    //only the main code is given here, others are omitted and nbsp;   ......          @Override      public int Run (string[] args) throws Exception {        configuration conf = new Configuration ();         job job = new Job (conf);                  job.setjobname ("Chainjob") ;         job.setinputformat (Textinputformat.class);         job.setoutputformat (Textoutputformat.class);                  fileinputformat.addinpuTpath (Job, in);         fileoutputformat.setoutputpath (Job, out);                  //adding Map1 stages to a job         configuration map1conf = new Configuration (false);         chainmapper.addmapper (Job, Map1.class, Longwritable.class, Text.class, Text.class, Text.class, map1conf);                  //Add Map2 stage         configuration map2conf in Job New Configuration (false);         chainmapper.addmapper (Job, Map2.class, Text.class, Text.class,longwritable.class, Text.class, map2conf);                  //Adding the Reduce phase to the job       &Nbsp; configuration reduceconf = new Configuration (false);         Chainreducer.setreducer (Job,reduce.class,longwritable.class,text.class,text.class,text.class, reduceconf);                  //adding in a job MAP3 stage         configuration map3conf = new Configuration (false);         chainreducer.addmapper (Job,map3.class,text.class,text.class, Longwritable.class,text.class, map3conf);                  //adding MAP4 stages to your job          Configuration map4conf = new configuration (false);         Chainreducer.addmapper (Job,map4.class,longwritable.class,text.class,longwritable.class,text.class, map4conf);        &Nbsp;         job.waitforcompletion (True);    &NBSP;}&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP, ...}

Note: For any mapreduce job, the map and reduce stages can have an unlimited number of mapper, but reduce can only have one. So a job with multiple reduce can not be done using chainmapper/chainreduce.

If you think reading this blog gives you something to gain, you might as well " top "
If you want to find my new blog more easily, you may wish to " subscribe "
If you are interested in what my blog is talking about, please keep following my follow-up blog, I am " Liu Chao-ljc".

This article is copyrighted by the author and Csdn, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.

MapReduce Link Job

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.