MapReduce implements simple inverted index of search engine

Last Update:2015-03-19 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using the Hadoop version for 2.2.0

Inverted index simple can be understood as full-text retrieval of a word

For example: in A.txt and b.txt two articles to find statistics hello the number of occurrences of the word, the more the number of occurrences, and the keyword of the higher the degree of consistency

The existing a.txt content is as follows:

Hello, Tom.

Hello Jerry

Hello Kitty

Hello World

Hello, Tom.

B.txt content is as follows:

Hello Jerry

Hello, Tom.

Hello World

Write Mr Code analysis on the Hadoop platform count the number of occurrences of each word in two text

In fact, it is only the revision of the WordCount program ~

Upload two text to the II folder in the HDFs root (Mr Direct Read II folder will read all files that do not begin with _ (underscore))

Writing Mr Code

The first analysis, the map input format is

The line offset of the line text

Such as:

0 Hello

We know that the output of map will be merged according to the same key

And each word is not unique, it may appear in two of text, using the word as key can not tell which text the word belongs to

With the text name as key, then we will achieve our original goal, because the map output will become a.txt-> words. Words.. Words

This is obviously not the result we want.

So the format of the map output should be

Text 1 with single word

Such as:

Hello->a.txt 1

This is used here as a separation between the word and the text where it resides

This will not affect our results when merging according to Key.

The map code is as follows:

public static class Mymapper extends Mapper<longwritable, text, text, text> {private Text k = new text ();p rivate Tex T v = new Text ();p rotected void map (longwritable key,text value,org.apache.hadoop.mapreduce.mapper<longwritable, Text, text, Text>. Context context) throws Java.io.IOException, interruptedexception {string[] data = Value.tostring (). Split ("");// The Filesplit class is obtained from a context context and can obtain the path of the currently read file Filesplit Filesplit = (filesplit) context.getinputsplit ();//file path is hdfs:// hadoop:9000/ii/a.txt//the current filename string[] fileNames = Filesplit.getpath (). toString (). Split ("/") by the last chunk. String fileName = Filenames[filenames.length-1];for (string d:data) {K.set (d + "+" + fileName), V.set ("1"), context. Write (k, v);}};}

After map execution is complete

We need a combiner to help get some work done.

Note that the input format and output format of the combiner are consistent, that is, the output format of the map, otherwise an error occurs

Again, the key value pairs after the key merge value are the same:

(hello->a.txt,{1,1,1,1,1})

Combiner to do is to talk about the values statistics accumulation

and separates the word and text of the key, combining the text name with the values after the statistic to form a new value

Such as:

(hello,a.txt->5)

Why do you do this?

Because after the execution of Combiner is complete,

A merge of value is also performed based on Key, which is the same as after map

Make a values collection of the same value as key

As a result, after combiner execution, the input to reduce becomes

(Hello,{a.txt->5,b.txt->3})

In this format, and then loop through reduce the values output is not the result we want it ~

The combiner code is as follows:

public static class Mycombiner extends Reducer<text, text, text, text> {private Text k = new text ();p rivate Text v = New text ();p rotected void reduce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer <text, text, text, Text> Context context) throws Java.io.IOException, interruptedexception {//Split file name and Word string[] Wordandpath = key.tostring (). Split ("-");//Statistics occurrences int counts = 0;for (Text t:values) {counts + = Integer.parseint (t.tostring ());} The new Key-value output K.set (wordandpath[0]), V.set (Wordandpath[1] + "+" + counts); Context.write (k, v);};}

And then the job of reduce is simple.

The code is as follows:

public static class Myreducer extends Reducer<text, text, text, text> {private Text v = new text ();p rotected void re Duce (Text key,java.lang.iterable<text> values,org.apache.hadoop.mapreduce.reducer<text, text, text, text . Context context) throws Java.io.IOException, interruptedexception {String res = ""; for (Text text:values) {res + = Text.to String () + "\ r";} V.set (res); Context.write (key, v);};}

Main Method Code:

public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); FileSystem fs = Filesystem.get (conf); Path Inpath = new Path ("hdfs://hadoop:9000" + args[0]); Path Outpath = new Path ("hdfs://hadoop:9000" + args[1]), if (Fs.exists (Outpath)) {Fs.delete (Outpath, True);} Job Job = job.getinstance (conf); Job.setjarbyclass (Inverseindex.class); Fileinputformat.setinputpaths (Job, Inpath); Job.setinputformatclass (Textinputformat.class); Job.setMapperClass ( Mymapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setcombinerclass (Mycombiner.class); Job.setreducerclass (Myreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileoutputformat.setoutputpath (Job, Outpath); Job.setoutputformatclass (Textoutputformat.class); Job.waitforcompletion (TRUE);}

Running the jar package execution results on Hadoop

Beginner Hadoop, only for note-taking, in which if there is a false hope please inform ^ ^

MapReduce implements simple inverted index of search engine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More