MapReduce Combat--Inverted index

Source: Internet
Author: User
Tags iterable

This article address: http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html, reprint please indicate source address.

1. Introduction to Inverted Index

The Inverted index (inverted index), also commonly referred to as a reverse index , place file , or reverse file , is an indexed method, A mapping that is used to store a word in a document or group of documents under a full-text search. It is the most commonly used data structure in the document retrieval system.

There are two different types of reverse indexes:

    • A record's horizontal reverse index (or Reverse archive index) contains a list of documents for each reference word.
    • The horizontal reverse index of a word (or a full reverse index) also contains the position of each word in a document.

The latter form provides more compatibility (such as phrase search), but requires more time and space to create.

Example:

In English, for example, here is the text to be indexed:

    • T0 = "it is what it is"
    • T1 = "what is it"
    • T2 = "it is a banana"

We can get the following reverse file index:

' A ':      {2} ' banana ': {2} ' is ': {     0, 1, 2} ' it ': {     0, 1, 2} ' what ':   {0, 1}

Retrieves the condition "what" , "is" and "it" will correspond to this collection: {0,1}∩{0,1,2}∩{0,1,2}={0,1}.

For the same text, we get back these fully inverted indexes, with the number of documents and the current query of the word result consisting of paired data. Similarly, the number of documents and the word results of the current query start from zero.

So, that "banana": {(2, 3)} means "banana" is in the third document (T2), and in the third document the position is the fourth word (address 3).

"A":      {(2, 2)} "banana": {(2, 3)} "is":     (1, 1), (2, 1)} "it":     (1, 2), (2, 0)} "what": 
   (1, 0)}

If we perform a phrase search "what is it" we get all the words of this phrase the respective results are in document 0 and document 1. However, the successive conditions of this phrase retrieval are only obtained in document 1.

2. Analysis and Design

(1) Map process

First, using the default Textinputformat class to process the input file, get the offset of each line in the text and its content, the map process must first analyze the input <key, value>, to get the inverted index of the three required information: Word, document URI and frequency,:

There are two problems, the first: <key, value> can only have two values, without using the Hadoop custom data type, you need to combine the two values as a value or a key value, as appropriate;

Second, it is not possible to complete the word frequency statistics and generate a document list at the same time through a reduce process, so you must add a combine process to complete the word frequency statistic

 Public Static classInvertedindexmapperextendsMapper<object, text, text, text> {    PrivateText KeyInfo =NewText ();//Store A combination of words and URIs    PrivateText Valueinfo =NewText ();//Store Word frequency    PrivateFilesplit split;//Store Split Objects         Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {//get <key,value> Filesplit objects to which they belongSplit =(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ());  while(Itr.hasmoretokens ()) {//The key value consists of a word and a URI, such as "Mapreduce:1.txt"Keyinfo.set (Itr.nexttoken () + ":" +Split.getpath (). toString ()); //The word frequency is initially 1Valueinfo.set ("1");        Context.write (KeyInfo, valueinfo); }    }}

(2) Combine process

Add the value of the same value as the key, and get the word frequency in the document.

 Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> {    PrivateText info =NewText ();  Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Statistical Frequency        intsum = 0;  for(Text value:values) {sum+=Integer.parseint (value.tostring ()); }        intSplitindex= key.tostring (). IndexOf (":"); //Reset value is made up of URI and word frequencyInfo.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); //Reset key value to WordKey.set (key.tostring (). substring (0, Splitindex));    Context.write (key, info); }}

(3) Reduce process

After talking about these two processes, the reduce process simply combines the value values of the same key value into the desired format for the inverted index file, and the rest can be referred directly to the MapReduce framework for processing.

 Public Static classInvertedindexreducerextendsReducer<text, text, text, text> {    PrivateText result =NewText ();  Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {//Generate Document ListString fileList =NewString ();  for(Text value:values) {fileList+ = value.tostring () + ";";        } result.set (FileList);    Context.write (key, result); }}

The complete code is as follows:

Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser; Public classInvertedindex { Public Static classInvertedindexmapperextendsMapper<object, text, text, text> {        PrivateText KeyInfo =NewText (); PrivateText Valueinfo =NewText (); PrivateFilesplit split;  Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {split=(Filesplit) context.getinputsplit (); StringTokenizer ITR=NewStringTokenizer (value.tostring ());  while(Itr.hasmoretokens ()) {Keyinfo.set (Itr.nexttoken () )+ ":" +Split.getpath (). toString ()); Valueinfo.set ("1");            Context.write (KeyInfo, valueinfo); }        }            }     Public Static classInvertedindexcombinerextendsReducer<text, text, text, text> {        PrivateText info =NewText ();  Public voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {intsum = 0;  for(Text value:values) {sum+=Integer.parseint (value.tostring ()); }            intSplitindex= key.tostring (). IndexOf (":"); Info.set (key.tostring (). substring (Splitindex+ 1) + ":" +sum); Key.set (key.tostring (). substring (0, Splitindex));        Context.write (key, info); }    }     Public Static classInvertedindexreducerextendsReducer<text, text, text, text> {        PrivateText result =NewText ();  Public voidReducer (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {String fileList=NewString ();  for(Text value:values) {fileList+ = value.tostring () + ";";            } result.set (FileList);        Context.write (key, result); }    }     Public Static voidMain (string[] args)throwsexception{//TODO auto-generated Method StubConfiguration conf =NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length! = 2) {System.err.println ("Usage:wordcount <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "Invertedindex"); Job.setjarbyclass (Invertedindex.class); Job.setmapperclass (invertedindexmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Invertedindexcombiner.class); Job.setreducerclass (invertedindexreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); System.exit (Job.waitforcompletion (true) ? 0:1); }}
Resources

Http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

"Combat HADOP: Open the way to cloud computing." Liu Peng

MapReduce Combat--Inverted index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.