This article address: http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html, reprint please indicate source address.
1. Introduction to Inverted Index
The Inverted index (inverted index), also commonly referred to as a reverse index , place file , or reverse file , is an indexed method, A mapping that is used to store a word in a document or group of documents under a full-text search. It is the most commonly used data structure in the document retrieval system.
There are two different types of reverse indexes:
- A record's horizontal reverse index (or Reverse archive index) contains a list of documents for each reference word.
- The horizontal reverse index of a word (or a full reverse index) also contains the position of each word in a document.
The latter form provides more compatibility (such as phrase search), but requires more time and space to create.
Example:
In English, for example, here is the text to be indexed:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We can get the following reverse file index:
' A ': {2} ' banana ': {2} ' is ': { 0, 1, 2} ' it ': { 0, 1, 2} ' what ': {0, 1}
Retrieves the condition "what"
, "is"
and "it"
will correspond to this collection: {0,1}∩{0,1,2}∩{0,1,2}={0,1}.
For the same text, we get back these fully inverted indexes, with the number of documents and the current query of the word result consisting of paired data. Similarly, the number of documents and the word results of the current query start from zero.
So, that "banana": {(2, 3)}
means "banana" is in the third document (T2), and in the third document the position is the fourth word (address 3).
"A": {(2, 2)} "banana": {(2, 3)} "is": (1, 1), (2, 1)} "it": (1, 2), (2, 0)} "what":
(1, 0)}
If we perform a phrase search "what is it"
we get all the words of this phrase the respective results are in document 0 and document 1. However, the successive conditions of this phrase retrieval are only obtained in document 1.
2. Analysis and Design
(1) Map process
First, using the default Textinputformat class to process the input file, get the offset of each line in the text and its content, the map process must first analyze the input <key, value>, to get the inverted index of the three required information: Word, document URI and frequency,:
There are two problems, the first: <key, value> can only have two values, without using the Hadoop custom data type, you need to combine the two values as a value or a key value, as appropriate;
Second, it is not possible to complete the word frequency statistics and generate a document list at the same time through a reduce process, so you must add a combine process to complete the word frequency statistic
Public Static classInvertedindexmapperextendsMapper<object, text, text, text> { PrivateText KeyInfo =NewText ();//Store A combination of words and URIs PrivateText Valueinfo =NewText ();//Store Word frequency PrivateFilesplit split;//Store Split Objects Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {//get <key,value> Filesplit objects to which they belongSplit =(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ()); while(Itr.hasmoretokens ()) {//The key value consists of a word and a URI, such as "Mapreduce:1.txt"Keyinfo.set (Itr.nexttoken () + ":" +Split.getpath (). toString ()); //The word frequency is initially 1Valueinfo.set ("1"); Context.write (KeyInfo, valueinfo); } }}
(2) Combine process
Add the value of the same value as the key, and get the word frequency in the document.
Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> { PrivateText info =NewText (); Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Statistical Frequency intsum = 0; for(Text value:values) {sum+=Integer.parseint (value.tostring ()); } intSplitindex= key.tostring (). IndexOf (":"); //Reset value is made up of URI and word frequencyInfo.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); //Reset key value to WordKey.set (key.tostring (). substring (0, Splitindex)); Context.write (key, info); }}
(3) Reduce process
After talking about these two processes, the reduce process simply combines the value values of the same key value into the desired format for the inverted index file, and the rest can be referred directly to the MapReduce framework for processing.
Public Static classInvertedindexreducerextendsReducer<text, text, text, text> { PrivateText result =NewText (); Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Generate Document ListString fileList =NewString (); for(Text value:values) {fileList+ = value.tostring () + ";"; } result.set (FileList); Context.write (key, result); }}
The complete code is as follows:
Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser; Public classInvertedindex { Public Static classInvertedindexmapperextendsMapper<object, text, text, text> { PrivateText KeyInfo =NewText (); PrivateText Valueinfo =NewText (); PrivateFilesplit split; Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {split=(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ()); while(Itr.hasmoretokens ()) {Keyinfo.set (Itr.nexttoken () )+ ":" +Split.getpath (). toString ()); Valueinfo.set ("1"); Context.write (KeyInfo, valueinfo); } } } Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> { PrivateText info =NewText (); Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {intsum = 0; for(Text value:values) {sum+=Integer.parseint (value.tostring ()); } intSplitindex= key.tostring (). IndexOf (":"); Info.set (key.tostring (). substring (Splitindex+ 1) + ":" +sum); Key.set (key.tostring (). substring (0, Splitindex)); Context.write (key, info); } } Public Static classInvertedindexreducerextendsReducer<text, text, text, text> { PrivateText result =NewText (); Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {String fileList=NewString (); for(Text value:values) {fileList+ = value.tostring () + ";"; } result.set (FileList); Context.write (key, result); } } Public Static voidMain (string[] args)throwsexception{//TODO auto-generated Method StubConfiguration conf =NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length! = 2) {System.err.println ("Usage:wordcount <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "Invertedindex"); Job.setjarbyclass (Invertedindex.class); Job.setmapperclass (invertedindexmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Invertedindexcombiner.class); Job.setreducerclass (invertedindexreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); System.exit (Job.waitforcompletion (true) ? 0:1); }}
Resources
Http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95
"Combat HADOP: Open the way to cloud computing." Liu Peng
MapReduce Combat--Inverted index