The inverted index of Hadoop

Source: Internet
Author: User

Objective:
From it spans to DT, today's data is growing at a massive rate every day. How can a search engine work better in the face of such huge data? This article, as the second in the Hadoop series, will introduce the basic implementation of the search engine under distributed conditions, namely "inverted index".

1. Description of the problem
Store the keywords in all the different files and retrieve them quickly. The following is assumed to have 3 files with the following data:

is was is  simplefile3.txt:Hello mapreduce bye mapreduce 

You should eventually produce the following index results:

Hello     file3.txt:1MapReduce    file3.txt:2; file2.txt:1; file1.txt:1  bye     file3.txt:1is     file2.txt:2; file1.txt:1 powerful    file2.txt:1simple     file2.txt:1; file1.txt:1 

--------------------------------------------------------

2. Design
First, we preprocess the data that we read into using the map operation, 1:

Compared to the previous word count (Worldcount.java), to achieve the inverted index by map and reduce operation obviously can not be completed, so we add ' Combine ', that is, merge operation; 2:

--------------------------------------------------------------

3. Code implementation

1  PackagePro;2 3 Importjava.io.IOException;4 ImportJava.util.StringTokenizer;5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.Path;7 Importorg.apache.hadoop.io.IntWritable;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapreduce.Job;Ten ImportOrg.apache.hadoop.mapreduce.Mapper; One ImportOrg.apache.hadoop.mapreduce.Reducer; A ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; - ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit; - ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; the ImportOrg.apache.hadoop.util.GenericOptionsParser; -  -  Public classInvertedindex { -     Final StaticString Input_path = "Hdfs://hadoop0:9000/index_in"; +     Final StaticString Output_path = "Hdfs://hadoop0:9000/index_out"; -  +      Public Static classMapextendsMapper<object, text, text, text> { A  at         PrivateText KeyInfo =NewText ();//store word and URL combinations -         PrivateText Valueinfo =NewText ();//Store Word frequency -         PrivateFilesplit split;//Store Split Objects -  -         //implementing the Map function -          Public voidmap (Object key, Text value, context context) in                 throwsIOException, interruptedexception { -             //get <key,value> Filesplit objects to which they belong toSplit =(Filesplit) context.getinputsplit (); +StringTokenizer ITR =NewStringTokenizer (value.tostring ()); -              while(Itr.hasmoretokens ()) { the  *                 //gets the name of the file only.  $                 intSplitindex = Split.getpath (). toString (). IndexOf ("File");Panax NotoginsengKeyinfo.set (Itr.nexttoken () + ":" -+Split.getpath (). toString (). substring (splitindex)); the                 //Word frequency initialized to 1 +Valueinfo.set ("1"); A Context.write (KeyInfo, valueinfo); the             } +         } -     } $  $      Public Static classCombineextendsReducer<text, text, text, text> { -         PrivateText info =NewText (); -  the         //implementing the Reduce function -          Public voidReduce (Text key, iterable<text>values, context context)Wuyi                 throwsIOException, interruptedexception { the             //Statistical Frequency -             intsum = 0; Wu              for(Text value:values) { -Sum + =Integer.parseint (value.tostring ()); About             } $  -             intSplitindex = Key.tostring (). IndexOf (":"); -             //Reset value is made up of URL and word frequency -Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); A             //Reset key value to Word +Key.set (key.tostring (). substring (0, Splitindex)); the Context.write (key, info); -         } $     } the  the      Public Static classReduceextendsReducer<text, text, text, text> { the         PrivateText result =NewText (); the  -         //implementing the Reduce function in          Public voidReduce (Text key, iterable<text>values, context context) the                 throwsIOException, interruptedexception { the             //Generate Document List AboutString fileList =NewString (); the              for(Text value:values) { theFileList + = value.tostring () + ";"; the             } + Result.set (fileList); -  the Context.write (key, result);Bayi         } the     } the  -      Public Static voidMain (string[] args)throwsException { -  theConfiguration conf =NewConfiguration (); the  theJob Job =NewJob (conf, "inverted Index"); theJob.setjarbyclass (Invertedindex.class); -  the         //set up map, combine, and reduce processing classes theJob.setmapperclass (Map.class); theJob.setcombinerclass (Combine.class);94Job.setreducerclass (Reduce.class); the  the         //Setting the map output type theJob.setmapoutputkeyclass (Text.class);98Job.setmapoutputvalueclass (Text.class); About  -         //set the reduce output type101Job.setoutputkeyclass (Text.class);102Job.setoutputvalueclass (Text.class);103 104         //setting the input and output directories theFileinputformat.addinputpath (Job,NewPath (Input_path));106Fileoutputformat.setoutputpath (Job,NewPath (Output_path));107System.exit (Job.waitforcompletion (true) ? 0:1);108     }109}

4. Test results

Hello        file3.txt:1; MapReduce    file3.txt:2;file1.txt:1;file2.txt:1; bye        file3.txt:1; is        file1.txt: 1;file2.txt:2;p owerful    file2.txt:1; simple        file2.txt:1;file1.txt:1;

Reference:

[1] Hadoop authoritative guide "A" Tom wbite

[2] Deep cloud computing · Hadoop application Development Combat "A" Wanchuan Meiche

--------------

Conclusion:

From the above map---> Combine----> Reduce operation, we can realize that the process of "inverted index" is actually the process of combining and splitting strings, which is the embodiment of mapreduce parallel computation in Hadoop. In most of today's enterprises, one of the main applications of Hadoop is to deal with the log, so want to enter the big data field of friends, for the map/reduce implementation of Hadoop principle can be more practical operation to deepen understanding. This article is only kind, for the deep application of Hadoop I am also slowly groping ~ ~

The inverted index of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.