The inverted index of Hadoop

Last Update:2015-09-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective:
From it spans to DT, today's data is growing at a massive rate every day. How can a search engine work better in the face of such huge data? This article, as the second in the Hadoop series, will introduce the basic implementation of the search engine under distributed conditions, namely "inverted index".

1. Description of the problem
Store the keywords in all the different files and retrieve them quickly. The following is assumed to have 3 files with the following data:

is was is  simplefile3.txt:Hello mapreduce bye mapreduce

You should eventually produce the following index results:

Hello     file3.txt:1MapReduce    file3.txt:2; file2.txt:1; file1.txt:1  bye     file3.txt:1is     file2.txt:2; file1.txt:1 powerful    file2.txt:1simple     file2.txt:1; file1.txt:1

--------------------------------------------------------

2. Design
First, we preprocess the data that we read into using the map operation, 1:

Compared to the previous word count (Worldcount.java), to achieve the inverted index by map and reduce operation obviously can not be completed, so we add ' Combine ', that is, merge operation; 2:

--------------------------------------------------------------

3. Code implementation

1  PackagePro;2 3 Importjava.io.IOException;4 ImportJava.util.StringTokenizer;5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.Path;7 Importorg.apache.hadoop.io.IntWritable;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapreduce.Job;Ten ImportOrg.apache.hadoop.mapreduce.Mapper; One ImportOrg.apache.hadoop.mapreduce.Reducer; A ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; - ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit; - ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; the ImportOrg.apache.hadoop.util.GenericOptionsParser; -  -  Public classInvertedindex { -     Final StaticString Input_path = "Hdfs://hadoop0:9000/index_in"; +     Final StaticString Output_path = "Hdfs://hadoop0:9000/index_out"; -  +      Public Static classMapextendsMapper<object, text, text, text> { A  at         PrivateText KeyInfo =NewText ();//store word and URL combinations -         PrivateText Valueinfo =NewText ();//Store Word frequency -         PrivateFilesplit split;//Store Split Objects -  -         //implementing the Map function -          Public voidmap (Object key, Text value, context context) in                 throwsIOException, interruptedexception { -             //get <key,value> Filesplit objects to which they belong toSplit =(Filesplit) context.getinputsplit (); +StringTokenizer ITR =NewStringTokenizer (value.tostring ()); -              while(Itr.hasmoretokens ()) { the  *                 //gets the name of the file only.  $                 intSplitindex = Split.getpath (). toString (). IndexOf ("File");Panax NotoginsengKeyinfo.set (Itr.nexttoken () + ":" -+Split.getpath (). toString (). substring (splitindex)); the                 //Word frequency initialized to 1 +Valueinfo.set ("1"); A Context.write (KeyInfo, valueinfo); the             } +         } -     } $  $      Public Static classCombineextendsReducer<text, text, text, text> { -         PrivateText info =NewText (); -  the         //implementing the Reduce function -          Public voidReduce (Text key, iterable<text>values, context context)Wuyi                 throwsIOException, interruptedexception { the             //Statistical Frequency -             intsum = 0; Wu              for(Text value:values) { -Sum + =Integer.parseint (value.tostring ()); About             } $  -             intSplitindex = Key.tostring (). IndexOf (":"); -             //Reset value is made up of URL and word frequency -Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); A             //Reset key value to Word +Key.set (key.tostring (). substring (0, Splitindex)); the Context.write (key, info); -         } $     } the  the      Public Static classReduceextendsReducer<text, text, text, text> { the         PrivateText result =NewText (); the  -         //implementing the Reduce function in          Public voidReduce (Text key, iterable<text>values, context context) the                 throwsIOException, interruptedexception { the             //Generate Document List AboutString fileList =NewString (); the              for(Text value:values) { theFileList + = value.tostring () + ";"; the             } + Result.set (fileList); -  the Context.write (key, result);Bayi         } the     } the  -      Public Static voidMain (string[] args)throwsException { -  theConfiguration conf =NewConfiguration (); the  theJob Job =NewJob (conf, "inverted Index"); theJob.setjarbyclass (Invertedindex.class); -  the         //set up map, combine, and reduce processing classes theJob.setmapperclass (Map.class); theJob.setcombinerclass (Combine.class);94Job.setreducerclass (Reduce.class); the  the         //Setting the map output type theJob.setmapoutputkeyclass (Text.class);98Job.setmapoutputvalueclass (Text.class); About  -         //set the reduce output type101Job.setoutputkeyclass (Text.class);102Job.setoutputvalueclass (Text.class);103 104         //setting the input and output directories theFileinputformat.addinputpath (Job,NewPath (Input_path));106Fileoutputformat.setoutputpath (Job,NewPath (Output_path));107System.exit (Job.waitforcompletion (true) ? 0:1);108     }109}

4. Test results

Hello        file3.txt:1; MapReduce    file3.txt:2;file1.txt:1;file2.txt:1; bye        file3.txt:1; is        file1.txt: 1;file2.txt:2;p owerful    file2.txt:1; simple        file2.txt:1;file1.txt:1;

Reference:

[1] Hadoop authoritative guide "A" Tom wbite

[2] Deep cloud computing · Hadoop application Development Combat "A" Wanchuan Meiche

--------------

Conclusion:

From the above map---> Combine----> Reduce operation, we can realize that the process of "inverted index" is actually the process of combining and splitting strings, which is the embodiment of mapreduce parallel computation in Hadoop. In most of today's enterprises, one of the main applications of Hadoop is to deal with the log, so want to enter the big data field of friends, for the map/reduce implementation of Hadoop principle can be more practical operation to deepen understanding. This article is only kind, for the deep application of Hadoop I am also slowly groping ~ ~

The inverted index of Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The inverted index of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The inverted index of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support