Abstract: The MapReduce program processes a patent data set.
Keywords: MapReduce program patent Data Set
Data Source: Patent reference Data set Cite75_99.txt. (the dataset can be downloaded from the URL http://www.nber.org/patents/)
Problem Description:
Read the patent reference dataset and reverse it. For each patent, find the patent that cites it and merge it. TOP5 output results are as follows:
1 3964859, 4647229
10000 4539112
100000 5031388
1000006 4714284
1000007 4766693
Solution:
1 Development tools: Vm10+ubuntu12.04+hadoop1.1.2+eclipse
2 Create a project in Eclipse and add a Java class to the project.
The list of procedures is as follows:
Package com.wangluqing;
Import java.io.ioexception;
import java.util.iterator;
Import org.apache.hadoop.conf.configuration;
Import org.apache.hadoop.conf.configured;
Import org.apache.hadoop.fs.path;
Import org.apache.hadoop.io.text;
Import org.apache.hadoop.mapred.fileinputformat;
Import org.apache.hadoop.mapred.fileoutputformat;
Import org.apache.hadoop.mapred.jobclient;
import org.apache.hadoop.mapred.jobconf;
Import org.apache.hadoop.mapred.keyvaluetextinputformat;
Import org.apache.hadoop.mapred.mapreducebase;
Import org.apache.hadoop.mapred.mapper;
Import org.apache.hadoop.mapred.outputcollector;
Import org.apache.hadoop.mapred.reducer;
import org.apache.hadoop.mapred.reporter;
Import org.apache.hadoop.mapred.textoutputformat;
Import org.apache.hadoop.util.tool;
Import Org.apache.hadoop.util.ToolRunner;
public class MYJOB1 extends configured implements Tool {
public static class Mapclass extends Mapreducebase implements mapper<text,text,text,text> {
@Override
public void Map (text key, text value, Outputcollector<text, text> output,
Reporter Reporter) throws IOException {
TODO auto-generated Method Stub
Output.collect (value, key);
}
}
public static class Reduce extends Mapreducebase implements reducer<text,text,text,text> {
@Override
public void reduce (Text key, iterator<text> values,
Outputcollector<text, text> output, Reporter Reporter)
Throws IOException {
TODO auto-generated Method Stub
String csv = "";
while (Values.hasnext ()) {
if (Csv.length () >0)
CSV + = ",";
CSV + = Values.next (). toString ();
}
Output.collect (Key, New Text (CSV));
}
}
public static void Main (string[] args) throws Exception {
TODO auto-generated Method Stub
String[] arg={"Hdfs://hadoop:9000/user/root/input/cite75_99.txt", "Hdfs://hadoop:9000/user/root/output"};
int res = Toolrunner.run (new Configuration (), New MyJob1 (), ARG);
System.exit (RES);
}
public int run (string[] args) throws Exception {
TODO auto-generated Method Stub
Configuration conf = getconf ();
jobconf job = new jobconf (conf, myjob1.class);
Path in = new Path (args[0]);
Path out = new path (args[1]);
Fileinputformat.setinputpaths (Job, in);
Fileoutputformat.setoutputpath (Job, out);
Job.setjobname ("MyJob");
Job.setmapperclass (Mapclass.class);
Job.setreducerclass (Reduce.class);
Job.setinputformat (Keyvaluetextinputformat.class);
Job.setoutputformat (Textoutputformat.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);
Job.set ("Key.value.separator.in.input.line", ",");
Jobclient.runjob (Job);
return 0;
}
}
Running run on Hadoop, executing commands under Ubuntu
Hadoop fs-cat/usr/root/output/part-00000 | Head
You can view the results after a MapReduce program has been processed.
Summarize:
First: You can use the Eclipse Integration development tool with the Hadoop version corresponding plugin for the development of MapReduce programs.
Second: Design and write MapReduce programs based on data flow and problem domains.
Resource:
1 http://www.wangluqing.com/2014/03/hadoop-mapreduce-programapp1/
2 reference to "Hadoop Combat" chapter fourth MapReduce basic procedure