The inverted index of Hadoop

Source: Internet
Author: User
Tags iterable

Inverted index:
Before we found the file location---Find the word
Right now:
Depending on the word, returns the result of which file it appears in, and how often it is.
This is like Baidu Search, you enter a keyword, then the Baidu engine quickly
Find the file with the keyword on its server, and depending on the frequency and some other policies
(such as page click Poll Rate), etc. to return your results. In this process, the inverted index plays a key role

Combine multiple text words, break down, count, determine location, and integrate.
It is divided into three processes: Map, combiner,reduce process.
One, the map process:
1, the path to the storage of the word is taken by filesplit.
2, by stringtokenizer the words in each file into a tag, by default is resolved by a space.
3, loop This stringtokenizer, the word + path is set to key, the frequency of words that value is set to 1 by default.
Then write key value through the content, note that even if the word has the same in different files, it is
Does not merge into one iteration, because the path in key is different.

Second, the combiner process:
Before this process, the key passed in is the same (that is, the word + path is the same, the same file word), its value (frequency), will be stored in a
Value iteration Inside (iterable), using the same key.
1, the frequency of words appearing in each file is counted by iterating over the values.
2, through the interception of the input key, the intercepted word, set to key,
Sets the path and Word frequency to value.
In the next reduce process, the same key (the word), the corresponding value (path + frequency),
Will form an iteration that facilitates the next merger.

Three, the reduce process:
This process is actually a process of merging, by iterating over the key corresponding to the vlaues,
Then in the loop, the path + word frequency, combined into a value, and then write.
Write the key, or the original key.

Code Analysis:
1, enter four files with a statement written in it:

2, output result:

Code Show:

Import Java.io.ioexception;import Java.util.stringtokenizer;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.FileSplit; Import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Invertedindex {public static class Invert  Edindexmapper extends Mapper<object, text, text, text>{private text keyInfo = new text (); Stores the combination of words and URIs private text valueinfo = new text ();  Store word frequency private filesplit split;        Stores the split object. @Override protected void Map (Object key, text value, Mapper<object, text, text, Text>. Context context) throws IOException, interruptedexception {//Get <key,value> to the owning Filesplit object            。 Split = (FIlesplit) Context.getinputsplit ();            System.out.println ("******split======" +split);            System.out.println ("-------value====" +value.tostring ());            StringTokenizer ITR = new StringTokenizer (value.tostring ());                The while (Itr.hasmoretokens ()) {//Key value consists of a word and a URI.            Keyinfo.set (Itr.nexttoken () + ":" +split.getpath (). toString ());                System.out.println ("***********map---keyinfo====" +keyinfo);                The word frequency was initially 1 valueinfo.set ("1");                System.out.println ("+++++++valueinfo======" +valueinfo);            Context.write (KeyInfo, valueinfo);  }}} public static class Invertedindexcombiner extends Reducer<text, text, text, text>{private        Text info = new text (); @Override protected void Reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, Interruptedexception {   System.out.println ("***combiner***values====" +values.tostring ());            statistic word frequency int sum = 0;            for (Text value:values) {sum + = Integer.parseint (value.tostring ());            }//system.out.println ("--combiner----sum=====" +sum);            int splitindex = Key.tostring (). IndexOf (":");            The value is reset by the URI and the word frequency composition Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum);    Reset the key value to the word key.set (key.tostring (). substring (0,splitindex));            System.out.println ("*****************************");            System.out.println ("combiner-----key====" +key.tostring ());            System.out.println ("-----------------------");            SYSTEM.OUT.PRINTLN ("combiner------info===" +info.tostring ());        Context.write (key, info); }} public static class Invertedindexreducer extends Reducer<text, text, text, text>{private text ResU        lt = new Text ();   @Override     protected void reduce (text key, iterable<text> values, Reducer<text, text, text, Text>. Context context) throws IOException, interruptedexception {//Generate document List String FileList            = new String ();            for (Text value:values) {fileList + = value.tostring () + ";";}            Result.set (fileList);        Context.write (key, result);            }} public static void Main (string[] args) {try {configuration conf = new configuration ();            Job Job = job.getinstance (conf, "Invertedindex");            Job.setjarbyclass (Invertedindex.class);            Implement the map function to generate intermediate results based on the input <key,value>.            Job.setmapperclass (Invertedindexmapper.class);            Job.setmapoutputkeyclass (Text.class);            Job.setmapoutputvalueclass (Text.class);            Job.setcombinerclass (Invertedindexcombiner.class);            Job.setreducerclass (Invertedindexreducer.class); Job.setOutputkeyclass (Text.class);            Job.setoutputvalueclass (Text.class);            Fileinputformat.addinputpath (Job, New Path ("hdfs://192.168.61.128:9000/daopai2/")); Fileoutputformat.setoutputpath (Job, New Path ("hdfs://192.168.61.128:9000/outdaopai1/" +system.currenttimemillis ()            +"/"));        System.exit (Job.waitforcompletion (true)? 0:1);        } catch (IllegalStateException e) {e.printstacktrace ();        } catch (IllegalArgumentException e) {e.printstacktrace ();        } catch (ClassNotFoundException e) {e.printstacktrace ();        } catch (IOException e) {e.printstacktrace ();        } catch (Interruptedexception e) {e.printstacktrace (); }    }}

The inverted index of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.