Using Hadoop to implement document inverted indexes

Last Update:2015-04-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The document inverted index is primarily to count the frequency of each word appearing in each document, so the word is key,value as the document and the word frequency in this document, that is, the format of the output data is as follows:

< word1,[doc1,3] [doc2,4] ... > : Indicates that the word word1 appears 3 times in the Doc1 document and appears 4 times in the DOC2 document.

The input of the whole program is a series of files, such as File01.txt, File02.txt, File03.txt ...., first upload these files to Hadoop HDFs as input to the program. The upload process, as well as the compilation of Java classes, can refer to this blog: Running the Hadoop sample program WordCount, which is no longer described in detail. The source code of this program is at the end of the article.

First, the general idea of program operation

Since the document inverted index examines the relationship between a word and a document, the system default Linerecordreader is the key value for the map input at the offset of each line, and the contents of each row as the value of the map, where the key value (the line offset is not significant for us), Here we consider the name of a document as a keyword, and the value of each row as value, so it is easier to handle (that is: the input form of map is <filename, a line>, mainly through a custom recordreader class to implement, This will be described below). The entire program data processing flow is as follows:

The main function of the map class is to process the input of the program, where the input form is <filename,a line> That is, the Input keyword key is the file name, such as File01.txt, the value is a row of data, map of the task is to the line of data segmentation, and in the form of the first part of the diagram output.

The primary function of the Combine class is to combine (add) the value of the same key as the map output, which facilitates the reduction of data transfer, which is combine at this node.

The main function of the partition is to partition the output of the combine, the purpose of which is to make the key value of the same data is divided into the same node, so that when the reduce operation requires only local data is sufficient, do not need to network to other nodes to find data. In the "Partitionbyword1 rather than word1#doc1" means word1 as the keyword when partitioning, not word1#doc1, because we have in the previous output the form of the keyword is Word1#doc1 is not word1 so the system will default according to the Word1#doc1 partition, and the result we want is to follow the word1 partition, so we need to customize the Patition class.

The main operation of reduce is to summarize the results and make the results conform to the form we want.

2, the program and each class design description

This section describes the design and function of each class in the order in which they are executed, and some subclasses inherit the parent class, but there is no way to re-implement the parent class, and these methods are not described in detail here.

2.1, Filenamerecordreader class

The Filenamerecordreader class inherits from Recordreader, which is a custom implementation of the Recordreader class, the main function of which is to take the file name of the record as key, rather than the offset of the file where the record line is located, the statement used to get the file name is:
FileName = ((filesplit) arg0). GetPath (). GetName ();

2.2, Filenameinputformat class

Because we rewrote the Recordreader class, here we rewrite the Fileinputformat class to use our custom filenamerecordreader, the primary function of this class is to return an instance of a Filenamerecordreader class.

2.3, Invertedindexmapper class

This class inherits from Mapper, the main method is the setup and map method, the main function of the Setup method is to initialize a stopwords list before executing the map, mainly when the map processes the input word, if the word is in the list of Stopwords, The word is skipped and not processed. Stopwords was initially stored in HDFs as a text file, and the program was initially executed with a Hadoop configuration that set the text file to Cachefile for each node to share, and before executing the map, Initializes a list of stopwords.
The main operation of Invertedindexmapper is map, this method will read a row of data for word breaker, and in the form of <key:word1#doc1 value:1> key-value pairs, write data outward, in the map method, The value written is 1. The class diagram for the Invertedindexmapper class is shown in 2.

2.4, Sumcombiner class

This class mainly merges the output of the preceding Invertedindexmapper class, and if a word appears more than once in a document, the value is set to the number of occurrences and.

2.5, Newpartitioner class

Partition class is mainly to partition the previous output, that is, select the appropriate node, partition class is generally used keyword key to partition, but we here the keyword is word1#doc1, we finally want to make word the same record on the same node, So Newpartitioner's task is to use Word to partition.

2.5, Invertedindexreducer class

The input form of the invertedindexreducerreduce is: <key:word1#doc1 value:2> <key:word1#doc2 value:1> <key:word2#doc1 Value:1> as shown in the first figure, the same word is passed to reduce as multiple inputs, and the final result requires only one word to be output, while different documents such as DOC1,DOC2 are exported as the value of this word, and when we implement this function, our reduce Set two variables, CurrentItem and Postinglist, where CurrentItem saves each time a key is read, the initial value is empty, and postinglist is a list of the occurrences of this key for the document and the number of occurrences in this document. Because the same key may be read multiple times, each time the key is read, compared to the previous CurrentItem, if the same as the previous CurrentItem, indicating that the same key is read, and then append the newly read key document to the Postinglist , if the root of the CurrentItem is different, representing the same word and finished reading, this time we want to count the total number of occurrences of a currentitem, and the total number of articles containing this item, which we have previously stored in postinglist, This information can be obtained as long as the postinglist is traversed, and the CurrentItem and postinglist are reset after the information is obtained. See Code implementation for details. The class diagram is as shown.

3. Operation Result

I compile and execute the command as follows, you can adjust the situation according to your own directory

  Javac-classpath ~/hadoop-1.2.1/hadoop-core-1.2.1.jar-d./Invertedindexer.java   JAR-CFV inverted.jar-c./*.  Hadoop jar./inverted.jar invertedindexer Input Output  #运行结束后显示  Hadoop fs-cat output/part-r-00000

Results:

4. SOURCE program

Import Java.io.bufferedreader;import java.io.filereader;import Java.io.ioexception;import Java.net.URI;import Java.util.list;import Java.util.set;import Java.util.stringtokenizer;import Java.util.arraylist;import Java.util.treeset;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.filecache.distributedcache;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.recordreader;import Org.apache.hadoop.mapreduce.lib.input.linerecordreader;import Org.apache.hadoop.mapreduce.inputsplit;import Org.apache.hadoop.mapreduce.lib.input.filesplit;import Org.apache.hadoop.mapreduce.taskattemptcontext;import Org.apache.hadoop.mapreduce.lib.partition.HashPartitioner ; Import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.FileOuTputformat;public class Invertedindexer {public static class Filenameinputformat extends Fileinputformat<text, text& Gt {@Overridepublic Recordreader<text, text> createrecordreader (inputsplit split,taskattemptcontext context) Throws Ioexception,interruptedexception {Filenamerecordreader FNRR = new Filenamerecordreader (); Fnrr.initialize ( Split, context); return FNRR;}} public static class Filenamerecordreader extends Recordreader<text, text> {String fileName; Linerecordreader LRR = new Linerecordreader (); @Overridepublic Text Getcurrentkey () throws IOException, interruptedexception {return new Text (fileName);} @Overridepublic Text GetCurrentValue () throws IOException, Interruptedexception {return lrr.getcurrentvalue (); @Overridepublic void Initialize (Inputsplit arg0, Taskattemptcontext arg1) throws IOException, Interruptedexception { Lrr.initialize (arg0, arg1); fileName = ((filesplit) arg0). GetPath (). GetName (); public void Close () throws IOException {Lrr.close ();} public boolean nExtkeyvalue () throws IOException, Interruptedexception {return lrr.nextkeyvalue (); public float getprogress () throws IOException, Interruptedexception {return lrr.getprogress ();} public static class Invertedindexmapper extends Mapper<text, text, text, intwritable> {private set<string> St Opwords;private path[] localfiles;private String pattern = "[^\\w]";p ublic void Setup (context context) throws IOException , interruptedexception {stopwords = new treeset<string> (); Configuration conf = context.getconfiguration (); localfiles = Distributedcache.getlocalcachefiles (conf); for (int i = 0; i < Localfiles.length; i++) {String line; BufferedReader br = new BufferedReader (New FileReader (Localfiles[i].tostring ())); while (line = Br.readline ()) = null) { StringTokenizer ITR = new StringTokenizer (line), while (Itr.hasmoretokens ()) {Stopwords.add (Itr.nexttoken ());}} Br.close ();}} protected void Map (text key, text value, context context) throws IOException, interruptedexception {String teMP = new String (); String line = value.tostring (). toLowerCase (); line = Line.replaceall (Pattern, ""); StringTokenizer ITR = new StringTokenizer (line), for (; Itr.hasmoretokens ();) {temp = Itr.nexttoken (), if (!stopwords.contains (temp)) {text Word = new Text (); Word.set (temp + "#" + key); Context.write (wo Rd, New Intwritable (1));}}} public static class Sumcombiner extends Reducer<text, intwritable, Text, intwritable> {private intwritable result = New Intwritable ();p ublic void reduce (Text key, iterable<intwritable> Values,context Context) throws IOException, interruptedexception {int sum = 0;for (intwritable val:values) {sum + = Val.get ();} Result.set (sum); Context.write (key, result);}} public static class Newpartitioner extends Hashpartitioner<text, intwritable> {public int getpartition (Text key, in twritable value, int numreducetasks) {String term = new string (); term = key.tostring (). Split ("#") [0];//<term#docid&gt ; =>termreturn super.getpartition (New Text (term), value, numreducetasks);}} public static class Invertedindexreducer extends Reducer<text, intwritable, Text, text> {private Text word1 = new Te XT ();p rivate Text word2 = new text (); String temp = new string (), static text CurrentItem = new text (""); static list<string> postinglist = new ARRAYLIST&L T String> ();p ublic void reduce (Text key, iterable<intwritable> Values,context Context) throws IOException, interruptedexception {int sum = 0; String KeyWord = key.tostring (). Split ("#") [0];int Needblank = 15-keyword.length (); for (int i=0;i<needblank;i++) { KeyWord + = "";} Word1.set (keyWord); temp = Key.tostring (). Split ("#") [1];//key in the form of word1#doc1, so temp is doc1for (intwritable val:values) {//Get the total number of a word in a file sum + = Val.get ();} Word2.set ("[" + Temp + "," + Sum + "]"); The WORD2 format is: [Doc1,3]if (! Currentitem.equals (word1) &&! Currentitem.equals ("")) {StringBuilder out = new StringBuilder (); Long Count = 0;double FileCount = 0;for (String p:post Inglist) {out.append (P); Out.append (""); couNT = Count + Long.parselong (p.substring (P.indexof (",") + 1,p.indexof ("]")); filecount++;} Out.append ("[Total," + Count + "]");d ouble average = count/filecount;out.append ("[Average," +string.format ("%.3f", Average) + "]."); if (Count > 0) context.write (CurrentItem, New Text (Out.tostring ()));p ostinglist = new arraylist<string> ();} CurrentItem = new Text (word1);p Ostinglist.add (word2.tostring ());} public void Cleanup (context context) throws Ioexception,interruptedexception {StringBuilder out = new StringBuilder (); Long Count = 0;for (String p:postinglist) {out.append (P); Out.append (""); count = Count + Long.parselong (p.substring (p.in Dexof (",") + 1,p.indexof ("]")));} Out.append ("[Total," + Count + "]."); if (Count > 0) context.write (CurrentItem, New Text (Out.tostring ()));}} public static void Main (string[] args) throws Exception {configuration conf = new Configuration ();D ISTRIBUTEDCACHE.ADDCAC Hefile (New URI ("Hdfs://namenode:9000/user/hadoop/stop_word/stop_word.txt"), conf); Job Job = new Job(conf, "inverted index"); Job.setjarbyclass (Invertedindexer.class); Job.setinputformatclass ( Filenameinputformat.class); Job.setmapperclass (Invertedindexmapper.class); Job.setcombinerclass ( Sumcombiner.class); Job.setreducerclass (Invertedindexreducer.class); Job.setpartitionerclass ( Newpartitioner.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (IntWritable.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); System.exit (Job.waitforcompletion (true)? 0:1);}}

4. Reference Documents

"In-depth understanding of big data processing and programming practical" Editor: Huang Yihua teacher (Nanjing University)

Using Hadoop to implement document inverted indexes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Hadoop to implement document inverted indexes

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support