Inverted index:
Before we found the file location---Find the word
Right now:
Depending on the word, returns the result of which file it appears in, and how often it is.
This is like Baidu Search, you enter a keyword, then the Baidu engine quickly
Find the file with the keyword on its server, and depending on the frequency and some other policies
(such as page click Poll Rate), etc. to return your results. In this process, the inverted index plays a key role
Combine multiple text words, break down, count, determine location, and integrate.
It is divided into three processes: Map, combiner,reduce process.
One, the map process:
1, the path to the storage of the word is taken by filesplit.
2, by stringtokenizer the words in each file into a tag, by default is resolved by a space.
3, loop This stringtokenizer, the word + path is set to key, the frequency of words that value is set to 1 by default.
Then write key value through the content, note that even if the word has the same in different files, it is
Does not merge into one iteration, because the path in key is different.
Second, the combiner process:
Before this process, the key passed in is the same (that is, the word + path is the same, the same file word), its value (frequency), will be stored in a
Value iteration Inside (iterable), using the same key.
1, the frequency of words appearing in each file is counted by iterating over the values.
2, through the interception of the input key, the intercepted word, set to key,
Sets the path and Word frequency to value.
In the next reduce process, the same key (the word), the corresponding value (path + frequency),
Will form an iteration that facilitates the next merger.
Three, the reduce process:
This process is actually a process of merging, by iterating over the key corresponding to the vlaues,
Then in the loop, the path + word frequency, combined into a value, and then write.
Write the key, or the original key.
Code Analysis:
1, enter four files with a statement written in it:
2, output result:
Code Show:
Import Java.io.ioexception;import Java.util.stringtokenizer;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.FileSplit; Import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Invertedindex {public static class Invert Edindexmapper extends Mapper<object, text, text, text>{private text keyInfo = new text (); Stores the combination of words and URIs private text valueinfo = new text (); Store word frequency private filesplit split; Stores the split object. @Override protected void Map (Object key, text value, Mapper<object, text, text, Text>. Context context) throws IOException, interruptedexception {//Get <key,value> to the owning Filesplit object 。 Split = (FIlesplit) Context.getinputsplit (); System.out.println ("******split======" +split); System.out.println ("-------value====" +value.tostring ()); StringTokenizer ITR = new StringTokenizer (value.tostring ()); The while (Itr.hasmoretokens ()) {//Key value consists of a word and a URI. Keyinfo.set (Itr.nexttoken () + ":" +split.getpath (). toString ()); System.out.println ("***********map---keyinfo====" +keyinfo); The word frequency was initially 1 valueinfo.set ("1"); System.out.println ("+++++++valueinfo======" +valueinfo); Context.write (KeyInfo, valueinfo); }}} public static class Invertedindexcombiner extends Reducer<text, text, text, text>{private Text info = new text (); @Override protected void Reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, Interruptedexception { System.out.println ("***combiner***values====" +values.tostring ()); statistic word frequency int sum = 0; for (Text value:values) {sum + = Integer.parseint (value.tostring ()); }//system.out.println ("--combiner----sum=====" +sum); int splitindex = Key.tostring (). IndexOf (":"); The value is reset by the URI and the word frequency composition Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); Reset the key value to the word key.set (key.tostring (). substring (0,splitindex)); System.out.println ("*****************************"); System.out.println ("combiner-----key====" +key.tostring ()); System.out.println ("-----------------------"); SYSTEM.OUT.PRINTLN ("combiner------info===" +info.tostring ()); Context.write (key, info); }} public static class Invertedindexreducer extends Reducer<text, text, text, text>{private text ResU lt = new Text (); @Override protected void reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, interruptedexception {//Generate document List String FileList = new String (); for (Text value:values) {fileList + = value.tostring () + ";";} Result.set (fileList); Context.write (key, result); }} public static void Main (string[] args) {try {configuration conf = new configuration (); Job Job = job.getinstance (conf, "Invertedindex"); Job.setjarbyclass (Invertedindex.class); Implement the map function to generate intermediate results based on the input <key,value>. Job.setmapperclass (Invertedindexmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Invertedindexcombiner.class); Job.setreducerclass (Invertedindexreducer.class); Job.setOutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job, New Path ("hdfs://192.168.61.128:9000/daopai2/")); Fileoutputformat.setoutputpath (Job, New Path ("hdfs://192.168.61.128:9000/outdaopai1/" +system.currenttimemillis () +"/")); System.exit (Job.waitforcompletion (true)? 0:1); } catch (IllegalStateException e) {e.printstacktrace (); } catch (IllegalArgumentException e) {e.printstacktrace (); } catch (ClassNotFoundException e) {e.printstacktrace (); } catch (IOException e) {e.printstacktrace (); } catch (Interruptedexception e) {e.printstacktrace (); } }}
The inverted index of Hadoop