MapReduce Combat--Inverted index

Last Update:2014-12-22 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article address: http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html, reprint please indicate source address.

1. Introduction to Inverted Index

The Inverted index (inverted index), also commonly referred to as a reverse index , place file , or reverse file , is an indexed method, A mapping that is used to store a word in a document or group of documents under a full-text search. It is the most commonly used data structure in the document retrieval system.

There are two different types of reverse indexes:

A record's horizontal reverse index (or Reverse archive index) contains a list of documents for each reference word.
The horizontal reverse index of a word (or a full reverse index) also contains the position of each word in a document.

The latter form provides more compatibility (such as phrase search), but requires more time and space to create.

Example:

In English, for example, here is the text to be indexed:

T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"

We can get the following reverse file index:

' A ':      {2} ' banana ': {2} ' is ': {     0, 1, 2} ' it ': {     0, 1, 2} ' what ':   {0, 1}

Retrieves the condition "what" , "is" and "it" will correspond to this collection: {0,1}∩{0,1,2}∩{0,1,2}={0,1}.

For the same text, we get back these fully inverted indexes, with the number of documents and the current query of the word result consisting of paired data. Similarly, the number of documents and the word results of the current query start from zero.

So, that "banana": {(2, 3)} means "banana" is in the third document (T2), and in the third document the position is the fourth word (address 3).

"A":      {(2, 2)} "banana": {(2, 3)} "is":     (1, 1), (2, 1)} "it":     (1, 2), (2, 0)} "what": 
   (1, 0)}

If we perform a phrase search "what is it" we get all the words of this phrase the respective results are in document 0 and document 1. However, the successive conditions of this phrase retrieval are only obtained in document 1.

2. Analysis and Design

(1) Map process

First, using the default Textinputformat class to process the input file, get the offset of each line in the text and its content, the map process must first analyze the input <key, value>, to get the inverted index of the three required information: Word, document URI and frequency,:

There are two problems, the first: <key, value> can only have two values, without using the Hadoop custom data type, you need to combine the two values as a value or a key value, as appropriate;

Second, it is not possible to complete the word frequency statistics and generate a document list at the same time through a reduce process, so you must add a combine process to complete the word frequency statistic

 Public Static classInvertedindexmapperextendsMapper<object, text, text, text> {    PrivateText KeyInfo =NewText ();//Store A combination of words and URIs    PrivateText Valueinfo =NewText ();//Store Word frequency    PrivateFilesplit split;//Store Split Objects         Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {//get <key,value> Filesplit objects to which they belongSplit =(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ());  while(Itr.hasmoretokens ()) {//The key value consists of a word and a URI, such as "Mapreduce:1.txt"Keyinfo.set (Itr.nexttoken () + ":" +Split.getpath (). toString ()); //The word frequency is initially 1Valueinfo.set ("1");        Context.write (KeyInfo, valueinfo); }    }}

(2) Combine process

Add the value of the same value as the key, and get the word frequency in the document.

 Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> {    PrivateText info =NewText ();  Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Statistical Frequency        intsum = 0;  for(Text value:values) {sum+=Integer.parseint (value.tostring ()); }        intSplitindex= key.tostring (). IndexOf (":"); //Reset value is made up of URI and word frequencyInfo.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); //Reset key value to WordKey.set (key.tostring (). substring (0, Splitindex));    Context.write (key, info); }}

(3) Reduce process

After talking about these two processes, the reduce process simply combines the value values of the same key value into the desired format for the inverted index file, and the rest can be referred directly to the MapReduce framework for processing.

 Public Static classInvertedindexreducerextendsReducer<text, text, text, text> {    PrivateText result =NewText ();  Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Generate Document ListString fileList =NewString ();  for(Text value:values) {fileList+ = value.tostring () + ";";        } result.set (FileList);    Context.write (key, result); }}

The complete code is as follows:

Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser; Public classInvertedindex { Public Static classInvertedindexmapperextendsMapper<object, text, text, text> {        PrivateText KeyInfo =NewText (); PrivateText Valueinfo =NewText (); PrivateFilesplit split;  Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {split=(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ());  while(Itr.hasmoretokens ()) {Keyinfo.set (Itr.nexttoken () )+ ":" +Split.getpath (). toString ()); Valueinfo.set ("1");            Context.write (KeyInfo, valueinfo); }        }            }     Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> {        PrivateText info =NewText ();  Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {intsum = 0;  for(Text value:values) {sum+=Integer.parseint (value.tostring ()); }            intSplitindex= key.tostring (). IndexOf (":"); Info.set (key.tostring (). substring (Splitindex+ 1) + ":" +sum); Key.set (key.tostring (). substring (0, Splitindex));        Context.write (key, info); }    }     Public Static classInvertedindexreducerextendsReducer<text, text, text, text> {        PrivateText result =NewText ();  Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {String fileList=NewString ();  for(Text value:values) {fileList+ = value.tostring () + ";";            } result.set (FileList);        Context.write (key, result); }    }     Public Static voidMain (string[] args)throwsexception{//TODO auto-generated Method StubConfiguration conf =NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length! = 2) {System.err.println ("Usage:wordcount <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "Invertedindex"); Job.setjarbyclass (Invertedindex.class); Job.setmapperclass (invertedindexmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Invertedindexcombiner.class); Job.setreducerclass (invertedindexreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); System.exit (Job.waitforcompletion (true) ? 0:1); }}

Resources

Http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

"Combat HADOP: Open the way to cloud computing." Liu Peng

MapReduce Combat--Inverted index

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More