MapReduce Application: TF-IDF Distributed implementation

Source: Internet
Author: User
Tags iterable idf

Overview

In this paper, TF-IDF distributed implementation, using a lot of previous MapReduce core knowledge points. It's a small application of MapReduce.

Copyright notice

Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Q-whai
Published: June 24, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51747801
Source: CSDN
Read MORE: Hadoop for classification >> Big Data

Pre-School Guide

This article is not going to talk about a lot of TF-IDF concepts and what TF-IDF can do. If you don't know enough about this, you can go to my other blog, data mining: Data set selection optimization based on TF-IDF algorithm.
Because my language expression may not be very simple and clear, if you read this article when you encounter some difficult to understand, you can click the relevant links below to learn. These are the basis and premise of this article, of course, can also submit comments to communicate with me.
-Data mining: Selection and optimization of data sets based on TF-IDF algorithm
-"from WordCount to MapReduce model of computation"
-"MapReduce Advanced: Chained mode with multiple Mapreduc e"
-"MapReduce Advanced: Multi-path input and output"
-"MapReduce Advanced: Partitioner Components"

Algorithmic framework

First, let's look at the distributed TF-IDF Algorithm framework diagram:

In the figure, we have three large modules, and these three large modules are the three jobs in MapReduce.
When we learn TF-IDF, we know that the calculation of TF-IDF can be divided into three parts. First stage: Calculates the TF value for each word in each document; Second stage: Calculates the IDF value for all words in all documents; third stage: Calculates the TF-IDF value of each word in each document. These calculations are easy to implement in a single-machine environment. But how do you do it in a distributed environment? So, based on these three stages, I designed the above architecture diagram.
The Tfmapreducecore class contains the core class that computes the TF, and Idfmapreducecore contains the core class of IDF, and Integratecore contains the integration of the TF, IDF results to calculate the final TF-IDF results. And there are two intermediate output directories, and these two intermediate output directories are the input directories for the third stage, which requires the multipath input of MapReduce. There is also a special article describing this piece.

Code implementation Tfmapreducecore

Here I encapsulate the code associated with the calculation TF in the same Tfmapreducecore class, where Tfmapper, Tfreducer, and so on, are a subclass of the Tfmapreducecore class.

Tfmapper
 Public Static  class tfmapper extends Mapper<Object, text, text , Text> {    Private FinalText one =NewText ("1");PrivateText label =NewText ();Private intAllwordcount =0;PrivateString FileName ="";@Override    protected void Setup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {fileName = Getinputsplitfilename (Context.getinputsplit ()); }@Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ())            {allwordcount++; Label.set (String.Join (":", Tokenizer.nexttoken (), fileName));        Context.write (label, one); }    }@Override    protected void Cleanup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {context.write (NewText ("!:"+ FileName),NewText (string.valueof (Allwordcount))); }PrivateStringGetinputsplitfilename(Inputsplit inputsplit)        {String Filefullname = ((filesplit) inputsplit). GetPath (). toString (); string[] namesegments = Filefullname.split ("/");returnNamesegments[namesegments.length-1]; }}

Because the source file that we enter is a file that represents a category, if you are dividing it with other rules, you do not have to follow the logic of this article. I first get the file name in Setup (), which is intended to improve the efficiency of the program by not having to repeat it in the map (). and in Cleanup (), the file name (that is, the classification) information is written to the Mapper output path.
You may have noticed here that when I wrote the file name, I used a trick to use "!" acted as a word. Because the ASCII code of this character is smaller than the ASCII code of all characters, this is done to allow the record to be accessed before all other records (all other records referred to here refer to all records in the same category. Because here we have the output of Mapper to do partitioner partition).

Tfcombiner & Tfreducer

From the Mapper above, you can see that the Mapper output key is in the following format:. So, just to parse the key in the keyword can be. The file information is also written in the cleanup () method of Mapper. This way, we can use this "!: Allwordcount" to differentiate each file. The principle of distinction is also mentioned before, because the ASCII code of "!" is the smallest reason.

 Public Static  class tfcombiner extends Reducer<text, text, text , Text> {    Private intAllwordcount =0;@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). StartsWith ("!") {Allwordcount = Integer.parseint (Values.iterator (). Next (). ToString ());return; }intSumcount =0; for(Text value:values)        {Sumcount + = Integer.parseint (value.tostring ()); }Doubletf =1.0* SUMCOUNT/ALLWORDCOUNT; Context.write (Key,NewText (string.valueof (TF))); }}

The TF value for all words has been calculated after the reduce operation of the above combiner. Again through a Reducer operation will be OK. The code for Reducer is as follows:

 Public Static  class tfreducer extends Reducer<text, text,  Text, text> {    @Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; } for(Text value:values)        {Context.write (key, value); }    }}
Tfpartitioner

In this section of the Partitioner partition, we simply use a custom Hash partitioner as the partition class. If you have more stringent requirements, you can refer to my previous blog, "MapReduce Advanced: partitioner components."

publicstaticclass TFPartitioner extends Partitioner<Text, Text> {    @Override    publicintgetPartitionint numPartitions) {        String fileName = key.toString().split(":")[1];        return127) % numPartitions);    }}
Idfmapreducecore

Here I encapsulate the code associated with the calculation of IDF in the same Idfmapreducecore class, where Idfmapper, Idfreducer, is a subclass of the Idfmapreducecore class.

Idfmapper

Because the IDF calculation is for all documents, it is OK to write directly in Idfmapper with the logic of the computed WordCount. Because we do not need to care about the word frequency of a certain words when we calculate the IDF, we use 1 to populate Mapper's output value uniformly here.

 Public Static  class idfmapper extends Mapper<Object, text, text , Text> {    Private FinalText one =NewText ("1");PrivateText label =NewText ();@Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Label.set (Tokenizer.nexttoken (). Split (":")[0]);    Context.write (label, one); }}
Idfreducer

In front of us we have counted a word that appears in a document (category), that is, the word W appears once in document D. In this way, we can count how many times the word W has appeared in all the documents. And the idea is to compute WordCount logic. So the code is well written. And so on, we also need to calculate the number of all documents. Yes, in the formula for calculating IDF, we need to know how many documents are in total. However, in the current situation we cannot get this value because it is in Reducer. Although the total number of documents cannot be counted in Reducer, it is possible outside of Reducer. This process is purely Java logic, very simple, not much to say.
When we know the total number of training documents, we can pass the information to Reducer through the job. Just here we don't call Job.setnumreducetasks (N), but instead call the Job.setprofileparams (msg) method.

 Public Static  class idfreducer extends Reducer<text, text, text , Text> {    PrivateText label =NewText ();@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }intFileCount =0; for(Text value:values)        {FileCount + = Integer.parseint (value.tostring ()); } Label.set (String.Join (":", Key.tostring (),"!"));intTotalfilecount = Integer.parseint (Context.getprofileparams ())-1;DoubleIdfvalue = MATH.LOG10 (1.0* Totalfilecount/(FileCount +1)); Context.write (Label,NewText (string.valueof (Idfvalue))); }}
Integratecore

Here I enclose the code associated with the calculation TF-IDF in the same Integratecore class, where Integratemapper, Integratereducer, is a subclass of the Integratecore class. In the final step of the calculation, there is nothing to explain. The format of the intermediate output files generated by TF and IDF is not uniform, so there is a need for different formats of file content to be considered differently.
Integratemapper

 Public Static  class integratemapper extends Mapper<Object, Text, text, text> {    @Override    protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Context.write (NewText (Tokenizer.nexttoken ()),NewText (Tokenizer.nexttoken ())); }}

Integratereducer

 Public Static  class integratereducer extends Reducer<text, text, text, text> {    Private DoubleKEYWORDIDF =0.0DPrivateText value =NewText ();@Override    protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). Split (":")[1].startswith ("!") {KEYWORDIDF = Double.parsedouble (Values.iterator (). Next (). ToString ());return;        } value.set (String.valueof (Double.parsedouble () (Values.iterator (). Next (). toString ()) * KEYWORDIDF));    Context.write (key, value); }}
Test run data source

Android

androidjavaactivitymap

Hadoop

mapreducesshmapreduce

Ios

iosiphonejobs

Java

javacodeeclipsejavamap

Python

pythonpycharm
Execute command

Before you execute this command, upload the test data to the/input directory in HDFS.

$ hadoop jar temp/run.jar /input /output
Execution results
Activity: Android0. 0994850021680094Android: Android0. 0994850021680094Code: Java0. 07958800173440753Eclipse: Java0. 07958800173440753iOS: iOS0. 13264666955734586iphone: iOS0. 13264666955734586Java: Android0. 0554621874040891Java: Java0. 08873949984654256Jobs: iOS0. 13264666955734586Map: Android0. 024227503252014105Map: Hadoop0. 024227503252014105Map: Java0. 019382002601611284MapReduce: Hadoop0. 0994850021680094Pycharm:p Ython0. 1989700043360188python:p Ython0. 1989700043360188Reduce: Hadoop0. 0994850021680094SSH: Hadoop0. 0994850021680094

Seeing this result you may think that the result is not necessarily reliable. If you suspect these results, you can write your own version of the Java program to verify. Of course, I've already verified it.

Job

Here is the information display of the browser login Cluster Metrics. Shows the program after the completion of the content, see that there are three jobs involved in the calculation of TF-IDF.

MapReduce Application: TF-IDF Distributed implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.