MapReduce implements simple inverted index of search engine

Source: Internet
Author: User
Tags iterable

Using the Hadoop version for 2.2.0

Inverted index simple can be understood as full-text retrieval of a word

For example: in A.txt and b.txt two articles to find statistics hello the number of occurrences of the word, the more the number of occurrences, and the keyword of the higher the degree of consistency

The existing a.txt content is as follows:

Hello, Tom.

Hello Jerry

Hello Kitty

Hello World

Hello, Tom.

B.txt content is as follows:

Hello Jerry

Hello, Tom.

Hello World

Write Mr Code analysis on the Hadoop platform count the number of occurrences of each word in two text

In fact, it is only the revision of the WordCount program ~


Upload two text to the II folder in the HDFs root (Mr Direct Read II folder will read all files that do not begin with _ (underscore))

Writing Mr Code

The first analysis, the map input format is


The line offset of the line text


Such as:

0 Hello


We know that the output of map will be merged according to the same key

And each word is not unique, it may appear in two of text, using the word as key can not tell which text the word belongs to

With the text name as key, then we will achieve our original goal, because the map output will become a.txt-> words. Words.. Words

This is obviously not the result we want.


So the format of the map output should be


Text 1 with single word


Such as:

Hello->a.txt 1


This is used here as a separation between the word and the text where it resides

This will not affect our results when merging according to Key.


The map code is as follows:

public static class Mymapper extends Mapper<longwritable, text, text, text> {private Text k = new text ();p rivate Tex T v = new Text ();p rotected void map (longwritable key,text value,org.apache.hadoop.mapreduce.mapper<longwritable, Text, text, Text>. Context context) throws Java.io.IOException, interruptedexception {string[] data = Value.tostring (). Split ("");// The Filesplit class is obtained from a context context and can obtain the path of the currently read file Filesplit Filesplit = (filesplit) context.getinputsplit ();//file path is hdfs:// hadoop:9000/ii/a.txt//the current filename string[] fileNames = Filesplit.getpath (). toString (). Split ("/") by the last chunk. String fileName = Filenames[filenames.length-1];for (string d:data) {K.set (d + "+" + fileName), V.set ("1"), context. Write (k, v);}};}

After map execution is complete

We need a combiner to help get some work done.

Note that the input format and output format of the combiner are consistent, that is, the output format of the map, otherwise an error occurs

Again, the key value pairs after the key merge value are the same:

(hello->a.txt,{1,1,1,1,1})

Combiner to do is to talk about the values statistics accumulation

and separates the word and text of the key, combining the text name with the values after the statistic to form a new value

Such as:

(hello,a.txt->5)

Why do you do this?

Because after the execution of Combiner is complete,

A merge of value is also performed based on Key, which is the same as after map

Make a values collection of the same value as key

As a result, after combiner execution, the input to reduce becomes

(Hello,{a.txt->5,b.txt->3})

In this format, and then loop through reduce the values output is not the result we want it ~

The combiner code is as follows:

public static class Mycombiner extends Reducer<text, text, text, text> {private Text k = new text ();p rivate Text v = New text ();p rotected void reduce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer <text, text, text, Text> Context context) throws Java.io.IOException, interruptedexception {//Split file name and Word string[] Wordandpath = key.tostring (). Split ("-");//Statistics occurrences int counts = 0;for (Text t:values) {counts + = Integer.parseint (t.tostring ());} The new Key-value output K.set (wordandpath[0]), V.set (Wordandpath[1] + "+" + counts); Context.write (k, v);};}

And then the job of reduce is simple.

The code is as follows:

public static class Myreducer extends Reducer<text, text, text, text> {private Text v = new text ();p rotected void re Duce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer<text, text, text, text . Context context) throws Java.io.IOException, interruptedexception {String res = ""; for (Text text:values) {res + = Text.to String () + "\ r";} V.set (res); Context.write (key, v);};}

Main Method Code:

public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); FileSystem fs = Filesystem.get (conf); Path Inpath = new Path ("hdfs://hadoop:9000" + args[0]); Path Outpath = new Path ("hdfs://hadoop:9000" + args[1]), if (Fs.exists (Outpath)) {Fs.delete (Outpath, True);} Job Job = job.getinstance (conf); Job.setjarbyclass (Inverseindex.class); Fileinputformat.setinputpaths (Job, Inpath); Job.setinputformatclass (Textinputformat.class); Job.setMapperClass ( Mymapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Mycombiner.class); Job.setreducerclass (Myreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileoutputformat.setoutputpath (Job, Outpath); Job.setoutputformatclass (Textoutputformat.class); Job.waitforcompletion (TRUE);}


Running the jar package execution results on Hadoop




Beginner Hadoop, only for note-taking, in which if there is a false hope please inform ^ ^




MapReduce implements simple inverted index of search engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.