MapReduce Application: TF-IDF Distributed implementation

Last Update:2016-06-24 Source: Internet

Author: User

Tags iterable idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

In this paper, TF-IDF distributed implementation, using a lot of previous MapReduce core knowledge points. It's a small application of MapReduce.

Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Q-whai
Published: June 24, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51747801
Source: CSDN
Read MORE: Hadoop for classification >> Big Data

Pre-School Guide

This article is not going to talk about a lot of TF-IDF concepts and what TF-IDF can do. If you don't know enough about this, you can go to my other blog, data mining: Data set selection optimization based on TF-IDF algorithm.
Because my language expression may not be very simple and clear, if you read this article when you encounter some difficult to understand, you can click the relevant links below to learn. These are the basis and premise of this article, of course, can also submit comments to communicate with me.
-Data mining: Selection and optimization of data sets based on TF-IDF algorithm
-"from WordCount to MapReduce model of computation"
-"MapReduce Advanced: Chained mode with multiple Mapreduc e"
-"MapReduce Advanced: Multi-path input and output"
-"MapReduce Advanced: Partitioner Components"

Algorithmic framework

First, let's look at the distributed TF-IDF Algorithm framework diagram:

In the figure, we have three large modules, and these three large modules are the three jobs in MapReduce.
When we learn TF-IDF, we know that the calculation of TF-IDF can be divided into three parts. First stage: Calculates the TF value for each word in each document; Second stage: Calculates the IDF value for all words in all documents; third stage: Calculates the TF-IDF value of each word in each document. These calculations are easy to implement in a single-machine environment. But how do you do it in a distributed environment? So, based on these three stages, I designed the above architecture diagram.
The Tfmapreducecore class contains the core class that computes the TF, and Idfmapreducecore contains the core class of IDF, and Integratecore contains the integration of the TF, IDF results to calculate the final TF-IDF results. And there are two intermediate output directories, and these two intermediate output directories are the input directories for the third stage, which requires the multipath input of MapReduce. There is also a special article describing this piece.

Code implementation Tfmapreducecore

Here I encapsulate the code associated with the calculation TF in the same Tfmapreducecore class, where Tfmapper, Tfreducer, and so on, are a subclass of the Tfmapreducecore class.

Tfmapper

 Public Static  class tfmapper extends Mapper<Object, text, text , Text> {    Private FinalText one =NewText ("1");PrivateText label =NewText ();Private intAllwordcount =0;PrivateString FileName ="";@Override    protected void Setup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {fileName = Getinputsplitfilename (Context.getinputsplit ()); }@Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ())            {allwordcount++; Label.set (String.Join (":", Tokenizer.nexttoken (), fileName));        Context.write (label, one); }    }@Override    protected void Cleanup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {context.write (NewText ("!:"+ FileName),NewText (string.valueof (Allwordcount))); }PrivateStringGetinputsplitfilename(Inputsplit inputsplit)        {String Filefullname = ((filesplit) inputsplit). GetPath (). toString (); string[] namesegments = Filefullname.split ("/");returnNamesegments[namesegments.length-1]; }}

Because the source file that we enter is a file that represents a category, if you are dividing it with other rules, you do not have to follow the logic of this article. I first get the file name in Setup (), which is intended to improve the efficiency of the program by not having to repeat it in the map (). and in Cleanup (), the file name (that is, the classification) information is written to the Mapper output path.
You may have noticed here that when I wrote the file name, I used a trick to use "!" acted as a word. Because the ASCII code of this character is smaller than the ASCII code of all characters, this is done to allow the record to be accessed before all other records (all other records referred to here refer to all records in the same category. Because here we have the output of Mapper to do partitioner partition).

Tfcombiner & Tfreducer

From the Mapper above, you can see that the Mapper output key is in the following format:. So, just to parse the key in the keyword can be. The file information is also written in the cleanup () method of Mapper. This way, we can use this "!: Allwordcount" to differentiate each file. The principle of distinction is also mentioned before, because the ASCII code of "!" is the smallest reason.

 Public Static  class tfcombiner extends Reducer<text, text, text , Text> {    Private intAllwordcount =0;@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). StartsWith ("!") {Allwordcount = Integer.parseint (Values.iterator (). Next (). ToString ());return; }intSumcount =0; for(Text value:values)        {Sumcount + = Integer.parseint (value.tostring ()); }Doubletf =1.0* SUMCOUNT/ALLWORDCOUNT; Context.write (Key,NewText (string.valueof (TF))); }}

The TF value for all words has been calculated after the reduce operation of the above combiner. Again through a Reducer operation will be OK. The code for Reducer is as follows:

 Public Static  class tfreducer extends Reducer<text, text,  Text, text> {    @Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; } for(Text value:values)        {Context.write (key, value); }    }}

Tfpartitioner

In this section of the Partitioner partition, we simply use a custom Hash partitioner as the partition class. If you have more stringent requirements, you can refer to my previous blog, "MapReduce Advanced: partitioner components."

publicstaticclass TFPartitioner extends Partitioner<Text, Text> {    @Override    publicintgetPartitionint numPartitions) {        String fileName = key.toString().split(":")[1];        return127) % numPartitions);    }}

Idfmapreducecore

Here I encapsulate the code associated with the calculation of IDF in the same Idfmapreducecore class, where Idfmapper, Idfreducer, is a subclass of the Idfmapreducecore class.

Idfmapper

Because the IDF calculation is for all documents, it is OK to write directly in Idfmapper with the logic of the computed WordCount. Because we do not need to care about the word frequency of a certain words when we calculate the IDF, we use 1 to populate Mapper's output value uniformly here.

 Public Static  class idfmapper extends Mapper<Object, text, text , Text> {    Private FinalText one =NewText ("1");PrivateText label =NewText ();@Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Label.set (Tokenizer.nexttoken (). Split (":")[0]);    Context.write (label, one); }}

Idfreducer

In front of us we have counted a word that appears in a document (category), that is, the word W appears once in document D. In this way, we can count how many times the word W has appeared in all the documents. And the idea is to compute WordCount logic. So the code is well written. And so on, we also need to calculate the number of all documents. Yes, in the formula for calculating IDF, we need to know how many documents are in total. However, in the current situation we cannot get this value because it is in Reducer. Although the total number of documents cannot be counted in Reducer, it is possible outside of Reducer. This process is purely Java logic, very simple, not much to say.
When we know the total number of training documents, we can pass the information to Reducer through the job. Just here we don't call Job.setnumreducetasks (N), but instead call the Job.setprofileparams (msg) method.

 Public Static  class idfreducer extends Reducer<text, text, text , Text> {    PrivateText label =NewText ();@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }intFileCount =0; for(Text value:values)        {FileCount + = Integer.parseint (value.tostring ()); } Label.set (String.Join (":", Key.tostring (),"!"));intTotalfilecount = Integer.parseint (Context.getprofileparams ())-1;DoubleIdfvalue = MATH.LOG10 (1.0* Totalfilecount/(FileCount +1)); Context.write (Label,NewText (string.valueof (Idfvalue))); }}

Integratecore

Here I enclose the code associated with the calculation TF-IDF in the same Integratecore class, where Integratemapper, Integratereducer, is a subclass of the Integratecore class. In the final step of the calculation, there is nothing to explain. The format of the intermediate output files generated by TF and IDF is not uniform, so there is a need for different formats of file content to be considered differently.
Integratemapper

 Public Static  class integratemapper extends Mapper<Object, Text, text, text> {    @Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Context.write (NewText (Tokenizer.nexttoken ()),NewText (Tokenizer.nexttoken ())); }}

Integratereducer

 Public Static  class integratereducer extends Reducer<text, text, text, text> {    Private DoubleKEYWORDIDF =0.0DPrivateText value =NewText ();@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). Split (":")[1].startswith ("!") {KEYWORDIDF = Double.parsedouble (Values.iterator (). Next (). ToString ());return;        } value.set (String.valueof (Double.parsedouble () (Values.iterator (). Next (). toString ()) * KEYWORDIDF));    Context.write (key, value); }}

Test run data source

Android

androidjavaactivitymap

Hadoop

mapreducesshmapreduce

Ios

iosiphonejobs

Java

javacodeeclipsejavamap

Python

pythonpycharm

Execute command

Before you execute this command, upload the test data to the/input directory in HDFS.

$ hadoop jar temp/run.jar /input /output

Execution results

Activity: Android0. 0994850021680094Android: Android0. 0994850021680094Code: Java0. 07958800173440753Eclipse: Java0. 07958800173440753iOS: iOS0. 13264666955734586iphone: iOS0. 13264666955734586Java: Android0. 0554621874040891Java: Java0. 08873949984654256Jobs: iOS0. 13264666955734586Map: Android0. 024227503252014105Map: Hadoop0. 024227503252014105Map: Java0. 019382002601611284MapReduce: Hadoop0. 0994850021680094Pycharm:p Ython0. 1989700043360188python:p Ython0. 1989700043360188Reduce: Hadoop0. 0994850021680094SSH: Hadoop0. 0994850021680094

Seeing this result you may think that the result is not necessarily reliable. If you suspect these results, you can write your own version of the Java program to verify. Of course, I've already verified it.

Job

Here is the information display of the browser login Cluster Metrics. Shows the program after the completion of the content, see that there are three jobs involved in the calculation of TF-IDF.

MapReduce Application: TF-IDF Distributed implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More