Using the Hadoop version for 2.2.0
Inverted index simple can be understood as full-text retrieval of a word
For example: in A.txt and b.txt two articles to find statistics hello the number of occurrences of the word, the more the number of occurrences, and the keyword of the higher the degree of consistency
The existing a.txt content is as follows:
Hello, Tom.
Hello Jerry
Hello Kitty
Hello World
Hello, Tom.
B.txt content is as follows:
Hello Jerry
Hello, Tom.
Hello World
Write Mr Code analysis on the Hadoop platform count the number of occurrences of each word in two text
In fact, it is only the revision of the WordCount program ~
Upload two text to the II folder in the HDFs root (Mr Direct Read II folder will read all files that do not begin with _ (underscore))
Writing Mr Code
The first analysis, the map input format is
The line offset of the line text
Such as:
0 Hello
We know that the output of map will be merged according to the same key
And each word is not unique, it may appear in two of text, using the word as key can not tell which text the word belongs to
With the text name as key, then we will achieve our original goal, because the map output will become a.txt-> words. Words.. Words
This is obviously not the result we want.
So the format of the map output should be
Text 1 with single word
Such as:
Hello->a.txt 1
This is used here as a separation between the word and the text where it resides
This will not affect our results when merging according to Key.
The map code is as follows:
public static class Mymapper extends Mapper<longwritable, text, text, text> {private Text k = new text ();p rivate Tex T v = new Text ();p rotected void map (longwritable key,text value,org.apache.hadoop.mapreduce.mapper<longwritable, Text, text, Text>. Context context) throws Java.io.IOException, interruptedexception {string[] data = Value.tostring (). Split ("");// The Filesplit class is obtained from a context context and can obtain the path of the currently read file Filesplit Filesplit = (filesplit) context.getinputsplit ();//file path is hdfs:// hadoop:9000/ii/a.txt//the current filename string[] fileNames = Filesplit.getpath (). toString (). Split ("/") by the last chunk. String fileName = Filenames[filenames.length-1];for (string d:data) {K.set (d + "+" + fileName), V.set ("1"), context. Write (k, v);}};}
After map execution is complete
We need a combiner to help get some work done.
Note that the input format and output format of the combiner are consistent, that is, the output format of the map, otherwise an error occurs
Again, the key value pairs after the key merge value are the same:
(hello->a.txt,{1,1,1,1,1})
Combiner to do is to talk about the values statistics accumulation
and separates the word and text of the key, combining the text name with the values after the statistic to form a new value
Such as:
(hello,a.txt->5)
Why do you do this?
Because after the execution of Combiner is complete,
A merge of value is also performed based on Key, which is the same as after map
Make a values collection of the same value as key
As a result, after combiner execution, the input to reduce becomes
(Hello,{a.txt->5,b.txt->3})
In this format, and then loop through reduce the values output is not the result we want it ~
The combiner code is as follows:
public static class Mycombiner extends Reducer<text, text, text, text> {private Text k = new text ();p rivate Text v = New text ();p rotected void reduce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer <text, text, text, Text> Context context) throws Java.io.IOException, interruptedexception {//Split file name and Word string[] Wordandpath = key.tostring (). Split ("-");//Statistics occurrences int counts = 0;for (Text t:values) {counts + = Integer.parseint (t.tostring ());} The new Key-value output K.set (wordandpath[0]), V.set (Wordandpath[1] + "+" + counts); Context.write (k, v);};}
And then the job of reduce is simple.
The code is as follows:
public static class Myreducer extends Reducer<text, text, text, text> {private Text v = new text ();p rotected void re Duce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer<text, text, text, text . Context context) throws Java.io.IOException, interruptedexception {String res = ""; for (Text text:values) {res + = Text.to String () + "\ r";} V.set (res); Context.write (key, v);};}
Main Method Code:
public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); FileSystem fs = Filesystem.get (conf); Path Inpath = new Path ("hdfs://hadoop:9000" + args[0]); Path Outpath = new Path ("hdfs://hadoop:9000" + args[1]), if (Fs.exists (Outpath)) {Fs.delete (Outpath, True);} Job Job = job.getinstance (conf); Job.setjarbyclass (Inverseindex.class); Fileinputformat.setinputpaths (Job, Inpath); Job.setinputformatclass (Textinputformat.class); Job.setMapperClass ( Mymapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Mycombiner.class); Job.setreducerclass (Myreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileoutputformat.setoutputpath (Job, Outpath); Job.setoutputformatclass (Textoutputformat.class); Job.waitforcompletion (TRUE);}
Running the jar package execution results on Hadoop
Beginner Hadoop, only for note-taking, in which if there is a false hope please inform ^ ^
MapReduce implements simple inverted index of search engine