Objective:
From it spans to DT, today's data is growing at a massive rate every day. How can a search engine work better in the face of such huge data? This article, as the second in the Hadoop series, will introduce the basic implementation of the search engine under distributed conditions, namely "inverted index".
1. Description of the problem
Store the keywords in all the different files and retrieve them quickly. The following is assumed to have 3 files with the following data:
is was is simplefile3.txt:Hello mapreduce bye mapreduce
You should eventually produce the following index results:
Hello file3.txt:1MapReduce file3.txt:2; file2.txt:1; file1.txt:1 bye file3.txt:1is file2.txt:2; file1.txt:1 powerful file2.txt:1simple file2.txt:1; file1.txt:1
--------------------------------------------------------
2. Design
First, we preprocess the data that we read into using the map operation, 1:
Compared to the previous word count (Worldcount.java), to achieve the inverted index by map and reduce operation obviously can not be completed, so we add ' Combine ', that is, merge operation; 2:
--------------------------------------------------------------
3. Code implementation
1 PackagePro;2 3 Importjava.io.IOException;4 ImportJava.util.StringTokenizer;5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.Path;7 Importorg.apache.hadoop.io.IntWritable;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapreduce.Job;Ten ImportOrg.apache.hadoop.mapreduce.Mapper; One ImportOrg.apache.hadoop.mapreduce.Reducer; A ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; - ImportOrg.apache.hadoop.mapreduce.lib.input.FileSplit; - ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; the ImportOrg.apache.hadoop.util.GenericOptionsParser; - - Public classInvertedindex { - Final StaticString Input_path = "Hdfs://hadoop0:9000/index_in"; + Final StaticString Output_path = "Hdfs://hadoop0:9000/index_out"; - + Public Static classMapextendsMapper<object, text, text, text> { A at PrivateText KeyInfo =NewText ();//store word and URL combinations - PrivateText Valueinfo =NewText ();//Store Word frequency - PrivateFilesplit split;//Store Split Objects - - //implementing the Map function - Public voidmap (Object key, Text value, context context) in throwsIOException, interruptedexception { - //get <key,value> Filesplit objects to which they belong toSplit =(Filesplit) context.getinputsplit (); +StringTokenizer ITR =NewStringTokenizer (value.tostring ()); - while(Itr.hasmoretokens ()) { the * //gets the name of the file only. $ intSplitindex = Split.getpath (). toString (). IndexOf ("File");Panax NotoginsengKeyinfo.set (Itr.nexttoken () + ":" -+Split.getpath (). toString (). substring (splitindex)); the //Word frequency initialized to 1 +Valueinfo.set ("1"); A Context.write (KeyInfo, valueinfo); the } + } - } $ $ Public Static classCombineextendsReducer<text, text, text, text> { - PrivateText info =NewText (); - the //implementing the Reduce function - Public voidReduce (Text key, iterable<text>values, context context)Wuyi throwsIOException, interruptedexception { the //Statistical Frequency - intsum = 0; Wu for(Text value:values) { -Sum + =Integer.parseint (value.tostring ()); About } $ - intSplitindex = Key.tostring (). IndexOf (":"); - //Reset value is made up of URL and word frequency -Info.set (key.tostring (). substring (Splitindex + 1) + ":" +sum); A //Reset key value to Word +Key.set (key.tostring (). substring (0, Splitindex)); the Context.write (key, info); - } $ } the the Public Static classReduceextendsReducer<text, text, text, text> { the PrivateText result =NewText (); the - //implementing the Reduce function in Public voidReduce (Text key, iterable<text>values, context context) the throwsIOException, interruptedexception { the //Generate Document List AboutString fileList =NewString (); the for(Text value:values) { theFileList + = value.tostring () + ";"; the } + Result.set (fileList); - the Context.write (key, result);Bayi } the } the - Public Static voidMain (string[] args)throwsException { - theConfiguration conf =NewConfiguration (); the theJob Job =NewJob (conf, "inverted Index"); theJob.setjarbyclass (Invertedindex.class); - the //set up map, combine, and reduce processing classes theJob.setmapperclass (Map.class); theJob.setcombinerclass (Combine.class);94Job.setreducerclass (Reduce.class); the the //Setting the map output type theJob.setmapoutputkeyclass (Text.class);98Job.setmapoutputvalueclass (Text.class); About - //set the reduce output type101Job.setoutputkeyclass (Text.class);102Job.setoutputvalueclass (Text.class);103 104 //setting the input and output directories theFileinputformat.addinputpath (Job,NewPath (Input_path));106Fileoutputformat.setoutputpath (Job,NewPath (Output_path));107System.exit (Job.waitforcompletion (true) ? 0:1);108 }109}
4. Test results
Hello file3.txt:1; MapReduce file3.txt:2;file1.txt:1;file2.txt:1; bye file3.txt:1; is file1.txt: 1;file2.txt:2;p owerful file2.txt:1; simple file2.txt:1;file1.txt:1;
Reference:
[1] Hadoop authoritative guide "A" Tom wbite
[2] Deep cloud computing · Hadoop application Development Combat "A" Wanchuan Meiche
--------------
Conclusion:
From the above map---> Combine----> Reduce operation, we can realize that the process of "inverted index" is actually the process of combining and splitting strings, which is the embodiment of mapreduce parallel computation in Hadoop. In most of today's enterprises, one of the main applications of Hadoop is to deal with the log, so want to enter the big data field of friends, for the map/reduce implementation of Hadoop principle can be more practical operation to deepen understanding. This article is only kind, for the deep application of Hadoop I am also slowly groping ~ ~
The inverted index of Hadoop