mapreduce--Inverted Index

Source: Internet
Author: User
Tags iterable split static class stub
Introduction to the document inverted algorithm

Inverted index (inverted index) is a data structure that is currently dependent on almost all search engines that support full-text indexing. Based on the index structure, give a word (term), you can get a list of documents containing the terms (the list of documents)
The problems in Web search are mainly divided into three parts: crawling (gathering Web content), web crawler, data collection indexing (construction of the inverted index), built on a large number of data Row index structure retrieval (ranking documents given a query), according to a search word index and sorting results, such as can be based on the frequency of the number of rows
Crawling and indexing are offline, retrieval is online and real-time.
Here's a question of how the index structure will be stored.
Given a word, how to get the result quickly.
There are two kinds of storage methods, one is the hash list and the other is B (+ +) tree.
Talking about storage I thought of my classmate asked a question in the interview: How to store a~z26 words quickly indexed.
(⊙﹏⊙) b Incredibly is 26 fork tree, very insane ah. Basic inverted index structure

Experimental Tasks

Please implement the "Document inverted algorithm with Word frequency attribute" introduced in class.
In addition to outputting inverted indexes with the word frequency attribute in the inverted index of the statistical term, calculate the average
The number of mentions "(definition below) and output.
The "average number of mentions" is defined here as:
Average number of mentions = the sum of the frequency of words appearing in all documents/the number of documents containing the word
If there are four novels in the document collection: A, B, C, D. The word "lake" appears in document a 100 times, in document B
Occurs 200 times in document C, 300 times, and does not appear in document D. The word "lake" in this article
The average number of mentions in the file set is (100 + 200 + 300)/3 = 200.

Output format
For each word, output two key-value pairs, two key-value pairs in the following format:
[Word] \tab Word 1: Frequency, Word 2: Frequency, Word 3: Word frequency, ..., words 100: Frequency
[Words] \tab average number of mentions
The following illustration shows a fragment of the output file (the contents of the figure are only examples of formats):
Design

The inverted index can be seen as an extension of wordcount, which needs to count the number of occurrences of a word in multiple files, and how the mapper and reducer should be designed.
Naturally, we would have thought
Mapper: For any word in a file, Key = Word, Value = fileName + 1.

Reduer: For Input key, iterable (Text) values, statistics values in each value, record the occurrence of the filename and the frequency.

Here's a question, it needs to assume that for an output of the same key,mapper, the design Trick:value-to-key conversion

Value-to-key conversion
For example, the original (term, DOCID, TF) can put the value of the docid in the key to get
New key-value pairs ((term, docid), TF).
This reduces the number of key-value pairs with the same key value.

About the code inside the mapper I have two questions:
1. Use a space "" to do word segmentation, in the final output will appear in blank words, very strange, although I added the token as "" or "\ T" when the output is not output, But the final result inside still has the blank word, is inconceivable.
2.Mapper Output if using ((TERM:DOCID), TF) in the form of ":" to separate the term and docid, then in combiner if I use ":" to separate the key (that is, the bottom of the wrong mapper way), So the number of strings you get is sometimes <2, so I'm going to use "--" to separate them.

public static class Inverseindexmapper extends Mapper<longwritable, text, text, text>{

        @Override
        protected void Map (longwritable key, Text value,
                context context)
                throws IOException, interruptedexception {
            //TODO auto-generated method stub
            String line = value.tostring ();
            string[] Strtokens = Line.split ("");
            Filesplit inputsplit = (filesplit) context.getinputsplit ();
            Path PATH = Inputsplit.getpath ();
            String pathstr = path.tostring ();
            int index = Pathstr.lastindexof ("/");
            String strFileName = pathstr.substring (index + 1); 
            for (String token:strtokens) {
                if (token! = "" && token! = "\ T") {
                    context.write (new Text (token +)-> ; "+ strFileName), New Text (" 1 "));}}}
    
Use of combiner

In order to reduce the output of the mapper, which reduces the transport overhead and storage overhead of mapper to reducer, using combiner is a good way to summarize the results by first Mapper after each reducer.
Here is the same term of the same document word frequency statistics.
I've seen a way to handle it for reducer convenience, so mapper output from <

public static class Inverseindexcombiner extends Reducer<text, text, text, text>{

        @Override
        protected void reduce (Text key, iterable<text> values,
                context context)
                throws IOException, interruptedexception {
            //TODO auto-generated method stub
            String strkey = key.tostring ();
            string[] tokens = Strkey.split ("--");
            int freq = 0;
            for (Text value:values) {
                freq++;
            }
            Context.write (new text (Tokens[0]), new text (Tokens[1] + "+" + freq);}
    }

Correct the following

public static class Inverseindexcombiner extends Reducer<text, text, text, text>{
        @Override
        protected void reduce (Text key, iterable<text> values,
                context context)
                throws IOException, interruptedexception {
            //TODO auto-generated method stub
            int freq = 0;
            for (Text value:values) {
                freq++;
            }
            Context.write (Key, New Text ("" + Freq));
        }
    }
the design of Partitioner

Because the key in the output of Value-to-key Conversion,mapper becomes (term, docid). If the default partitioner, then have the same term, different docid items are likely to be divided into different reducer, which is contrary to the original intention, so it is necessary to customize a partitioner, using term in key as the basis for division.
There is a small problem here, if I use the wrong way in the combiner to change the output of combiner to (term, (DOCID, TF)), then whether you need to customize Partitioner.
The answer is need, it seems partitioner's judgment is not the output of combiner.

public static class Inverseindexpartitioner extends Hashpartitioner<text, text>{

        @Override public
        int Getpartition (text key, text value, int numreducetasks) {
            //TODO auto-generated method stub
            String strkey = key.to String ();
            string[] tokens = Strkey.split ("--");
            Return Super.getpartition (New Text (Tokens[0]), value, numreducetasks);
        }

    }
the design of reducer

According to the output from mapper (term, docid), TF, the processing that needs to be done here is to assemble the key-value pairs with the same term and combine them (term, (DOCID1:TF1, DOCID2:TF2, ...)). The output form. A static variable Strword is used to record the term in the last reduce process; Static variable map is used to record the corresponding DOCID:TF pair of static variable Strword. When dealing with the reduce process, first divide the key out of term and filename, determine if the term is equal to Strword, if equal, first accumulate values, get DOCID,TF to add map, otherwise it will strword,map output, and empties the Map,strword assignment to term, processes the current DOCID,TF, and joins the map; Because the last reduce process is not able to output its own data, it is necessary to overload the cleanup function to output in the inside. The equality of string judgment with "= =" is not possible, oh, if the use of "= =" instead of "equals", what will happen?

public static class Inverseindexreducer extends Reducer<text, text, text, text>{static map<string, Inte
        ger> map = new hashmap<string, integer> ();
        static String Strword = null;
                @Override protected void reduce (Text key, iterable<text> values, context context) Throws IOException, Interruptedexception {//TODO auto-generated method stub string[] t
                Okens = Key.tostring (). Split ("-");
                if (Strword = = null) {Strword = tokens[0];
                    } if (Strword.equals (Tokens[0])) {String strFileName = tokens[1];
                    int freq = 0;
                    for (Text value:values) {freq + = Integer.parseint (value.tostring ());
                } map.put (strFileName, freq);
          } else {String strnewvalue = "";          Double avefreq = 0;  For (map.entry<string, integer> entry:map.entrySet ()) {Strnewvalue + = Entry.getkey () + ":"
                        + entry.getvalue () + ",";
                    Avefreq + = (double) entry.getvalue ();
                    } avefreq/= (double) map.size ();
                    Text NewKey = new text (Strword);
                    Map.clear ();
                    Context.write (NewKey, New Text (Strnewvalue));

                    Context.write (NewKey, New Text ("" + Avefreq));
                    Strword = Tokens[0];
                    String strFileName = tokens[1];
                    int freq = 0;
                    for (Text value:values) {freq + = Integer.parseint (value.tostring ());
                } map.put (strFileName, freq); }} @Override protected void Cleanup (Reducer<text, text, text, Text>. Context context)
                Throws IOException, Interruptedexception {//TODO auto-generated method stub St
            Ring Strnewvalue = "";
            Double avefreq = 0; For (map.entry<string, integer> entry:map.entrySet ()) {Strnewvalue + = Entry.getkey () + ":" + Entry
                . GetValue () + ",";
            Avefreq + = (double) entry.getvalue ();
            } avefreq/= (double) map.size ();
            Text NewKey = new text (Strword);
            Map.clear ();
            Context.write (NewKey, New Text (Strnewvalue));

            Context.write (NewKey, New Text ("" + Avefreq));
        Super.cleanup (context); }


    }
Main function
public static void Main (string[] args) throws IOException, ClassNotFoundException,
        interruptedexception {//TODO auto-generated Method Stub Configuration conf = new configuration ();
        Job Job = new Job (conf, "Inverseindex");

        Job.setjarbyclass (Inverseindex.class);

        Job.setnumreducetasks (4);
        Job.setmapperclass (Inverseindexmapper.class);
        Job.setcombinerclass (Inverseindexcombiner.class);
        Job.setpartitionerclass (Inverseindexpartitioner.class);
        Job.setreducerclass (Inverseindexreducer.class);
        Job.setoutputkeyclass (Text.class);
        Job.setoutputvalueclass (Text.class);
        Job.setinputformatclass (Textinputformat.class);

        Job.setoutputformatclass (Textoutputformat.class);
        Fileinputformat.setinputpaths (Job, New Path (Args[0]));
        Fileoutputformat.setoutputpath (Job, New Path (Args[1]));
    Job.waitforcompletion (TRUE); }
Run Results


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.