<1>. changing MapReduce jobs 1.1 Chaining MapReduce jobs in a sequence
The MapReduce program is capable of performing some complex data processing, typically by splitting the task tasks into smaller subtask, then each subtask is run through the job in Hadoop, and then the lesson plan subtask results are collected. Complete this complex task.
The simplest is "order" executed. The programming model is also relatively simple. We know that starting a task Jobclient.runjob () in MapReduce programming, when the task job is completed, the statement can end, so if you want to perform the task sequentially, you only need to restart a new task after each task is completed.
1.2 Chaining MapReduce jobs with complex dependency
But in more cases it is not enough to simply execute sequentially, and the Jobcontrl class is provided in Hadoop to encapsulate a series of jobs and the dependencies between those jobs.
The AddJob method is provided in Jobcontrol to add a job to the job's collection;
The Adddependingjob method is provided for each job class at the same time.
For example, we now need to run 5 job:job1, JOB2, JOB3, JOB4, job5 if the JOB2 needs to run after the JOB1 is complete, JOB4 needs to run after the JOB3 run, The final job5 needs to run after Job2 and JOB4 are complete. Then you can build the model like this:
1.3 Chaining preprocessing and postprocessing steps
Each task in Hadoop can have multiple mapper and reducer, and the program executes in the following order:
The Mapper classes is invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, a nd so on until the last Mapper, the output of the last Mapper would be written to the task ' s output.
Which is like a pipeline command Mapper1 under Linux | Mapper2 | Reducer1 | Mapper3 | Mapper4.
<2>. Joining data from different sources
If there are two data sources such as customers and orders,
If we want to analyze this data, then we need to use the Datajoin feature of Hadoop. First, let's look at the process of Hadoop data join.
First the map function reads data from customers and orders, outputs <k1-2, v1-2>,<k2-1, v2-1>;
Then Hadoop begins to Partition,shuffle, which is different from what was originally dealt with by sending the same package of group key to a reduce function;
The reduce function will then get the same set of data as group key.
Then in the reduce function, the data join is executed to form combination, and then the combination is sent to the Combine () function to output each result record.
There are datajoinmapperbase base classes and datajoinreducerbase base classes in Hadoop that are used to implement data joins. The mapper for the data join needs to inherit from Datajoinmapperbase, and the mapper needs to implement three methods:
For reducer that need to implement the data join, you need to inherit from Datajoinreducerbase, when you need to override the method combine, the first thing you need to be clear is the combine effect of the object:
For each of the tuple in the cross product, it calls the following method, which are expected to being implemented in a subclass. Protected abstract Taggedmapoutput Combine (object[] tags, object[] values); The above method is expected to produce one output value from an array of records of different sources.
Combine is clearly targeted at an array of records of different sources, namely:
This means that combine is only responsible for getting the final output format.
Hadoop--07--mapreduce Advanced Programming