The inverted index of Hadoop

Last Update:2016-05-13 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Inverted index:
Before we found the file location---Find the word
Right now:
Depending on the word, returns the result of which file it appears in, and how often it is.
This is like Baidu Search, you enter a keyword, then the Baidu engine quickly
Find the file with the keyword on its server, and depending on the frequency and some other policies
(such as page click Poll Rate), etc. to return your results. In this process, the inverted index plays a key role

Combine multiple text words, break down, count, determine location, and integrate.
It is divided into three processes: Map, combiner,reduce process.
One, the map process:
1, the path to the storage of the word is taken by filesplit.
2, by stringtokenizer the words in each file into a tag, by default is resolved by a space.
3, loop This stringtokenizer, the word + path is set to key, the frequency of words that value is set to 1 by default.
Then write key value through the content, note that even if the word has the same in different files, it is
Does not merge into one iteration, because the path in key is different.

Second, the combiner process:
Before this process, the key passed in is the same (that is, the word + path is the same, the same file word), its value (frequency), will be stored in a
Value iteration Inside (iterable), using the same key.
1, the frequency of words appearing in each file is counted by iterating over the values.
2, through the interception of the input key, the intercepted word, set to key,
Sets the path and Word frequency to value.
In the next reduce process, the same key (the word), the corresponding value (path + frequency),
Will form an iteration that facilitates the next merger.

Three, the reduce process:
This process is actually a process of merging, by iterating over the key corresponding to the vlaues,
Then in the loop, the path + word frequency, combined into a value, and then write.
Write the key, or the original key.

Code Analysis:
1, enter four files with a statement written in it:

2, output result:

Code Show:

Import Java.io.ioexception;import Java.util.stringtokenizer;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.FileSplit; Import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Invertedindex {public static class Invert  Edindexmapper extends Mapper<object, text, text, text>{private text keyInfo = new text (); Stores the combination of words and URIs private text valueinfo = new text ();  Store word frequency private filesplit split;        Stores the split object. @Override protected void Map (Object key, text value, Mapper<object, text, text, Text>. Context context) throws IOException, interruptedexception {//Get <key,value> to the owning Filesplit object            。 Split = (FIlesplit) Context.getinputsplit ();            System.out.println ("******split======" +split);            System.out.println ("-------value====" +value.tostring ());            StringTokenizer ITR = new StringTokenizer (value.tostring ());                The while (Itr.hasmoretokens ()) {//Key value consists of a word and a URI.            Keyinfo.set (Itr.nexttoken () + ":" +split.getpath (). toString ());                System.out.println ("***********map---keyinfo====" +keyinfo);                The word frequency was initially 1 valueinfo.set ("1");                System.out.println ("+++++++valueinfo======" +valueinfo);            Context.write (KeyInfo, valueinfo);  }}} public static class Invertedindexcombiner extends Reducer<text, text, text, text>{private        Text info = new text (); @Override protected void Reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, Interruptedexception {   System.out.println ("***combiner***values====" +values.tostring ());            statistic word frequency int sum = 0;            for (Text value:values) {sum + = Integer.parseint (value.tostring ());            }//system.out.println ("--combiner----sum=====" +sum);            int splitindex = Key.tostring (). IndexOf (":");            The value is reset by the URI and the word frequency composition Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum);    Reset the key value to the word key.set (key.tostring (). substring (0,splitindex));            System.out.println ("*****************************");            System.out.println ("combiner-----key====" +key.tostring ());            System.out.println ("-----------------------");            SYSTEM.OUT.PRINTLN ("combiner------info===" +info.tostring ());        Context.write (key, info); }} public static class Invertedindexreducer extends Reducer<text, text, text, text>{private text ResU        lt = new Text ();   @Override     protected void reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, interruptedexception {//Generate document List String FileList            = new String ();            for (Text value:values) {fileList + = value.tostring () + ";";}            Result.set (fileList);        Context.write (key, result);            }} public static void Main (string[] args) {try {configuration conf = new configuration ();            Job Job = job.getinstance (conf, "Invertedindex");            Job.setjarbyclass (Invertedindex.class);            Implement the map function to generate intermediate results based on the input <key,value>.            Job.setmapperclass (Invertedindexmapper.class);            Job.setmapoutputkeyclass (Text.class);            Job.setmapoutputvalueclass (Text.class);            Job.setcombinerclass (Invertedindexcombiner.class);            Job.setreducerclass (Invertedindexreducer.class); Job.setOutputkeyclass (Text.class);            Job.setoutputvalueclass (Text.class);            Fileinputformat.addinputpath (Job, New Path ("hdfs://192.168.61.128:9000/daopai2/")); Fileoutputformat.setoutputpath (Job, New Path ("hdfs://192.168.61.128:9000/outdaopai1/" +system.currenttimemillis ()            +"/"));        System.exit (Job.waitforcompletion (true)? 0:1);        } catch (IllegalStateException e) {e.printstacktrace ();        } catch (IllegalArgumentException e) {e.printstacktrace ();        } catch (ClassNotFoundException e) {e.printstacktrace ();        } catch (IOException e) {e.printstacktrace ();        } catch (Interruptedexception e) {e.printstacktrace (); }    }}

The inverted index of Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More