Overview
In this paper, TF-IDF distributed implementation, using a lot of previous MapReduce core knowledge points. It's a small application of MapReduce.
Copyright notice
Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Q-whai
Published: June 24, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51747801
Source: CSDN
Read MORE: Hadoop for classification >> Big Data
Pre-School Guide
This article is not going to talk about a lot of TF-IDF concepts and what TF-IDF can do. If you don't know enough about this, you can go to my other blog, data mining: Data set selection optimization based on TF-IDF algorithm.
Because my language expression may not be very simple and clear, if you read this article when you encounter some difficult to understand, you can click the relevant links below to learn. These are the basis and premise of this article, of course, can also submit comments to communicate with me.
-Data mining: Selection and optimization of data sets based on TF-IDF algorithm
-"from WordCount to MapReduce model of computation"
-"MapReduce Advanced: Chained mode with multiple Mapreduc e"
-"MapReduce Advanced: Multi-path input and output"
-"MapReduce Advanced: Partitioner Components"
Algorithmic framework
First, let's look at the distributed TF-IDF Algorithm framework diagram:
In the figure, we have three large modules, and these three large modules are the three jobs in MapReduce.
When we learn TF-IDF, we know that the calculation of TF-IDF can be divided into three parts. First stage: Calculates the TF value for each word in each document; Second stage: Calculates the IDF value for all words in all documents; third stage: Calculates the TF-IDF value of each word in each document. These calculations are easy to implement in a single-machine environment. But how do you do it in a distributed environment? So, based on these three stages, I designed the above architecture diagram.
The Tfmapreducecore class contains the core class that computes the TF, and Idfmapreducecore contains the core class of IDF, and Integratecore contains the integration of the TF, IDF results to calculate the final TF-IDF results. And there are two intermediate output directories, and these two intermediate output directories are the input directories for the third stage, which requires the multipath input of MapReduce. There is also a special article describing this piece.
Code implementation Tfmapreducecore
Here I encapsulate the code associated with the calculation TF in the same Tfmapreducecore class, where Tfmapper, Tfreducer, and so on, are a subclass of the Tfmapreducecore class.
Tfmapper
Public Static class tfmapper extends Mapper<Object, text, text , Text> { Private FinalText one =NewText ("1");PrivateText label =NewText ();Private intAllwordcount =0;PrivateString FileName ="";@Override protected void Setup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {fileName = Getinputsplitfilename (Context.getinputsplit ()); }@Override protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ()) {allwordcount++; Label.set (String.Join (":", Tokenizer.nexttoken (), fileName)); Context.write (label, one); } }@Override protected void Cleanup(Mapper<object, text, text, Text>.) Context context)throwsIOException, interruptedexception {context.write (NewText ("!:"+ FileName),NewText (string.valueof (Allwordcount))); }PrivateStringGetinputsplitfilename(Inputsplit inputsplit) {String Filefullname = ((filesplit) inputsplit). GetPath (). toString (); string[] namesegments = Filefullname.split ("/");returnNamesegments[namesegments.length-1]; }}
Because the source file that we enter is a file that represents a category, if you are dividing it with other rules, you do not have to follow the logic of this article. I first get the file name in Setup (), which is intended to improve the efficiency of the program by not having to repeat it in the map (). and in Cleanup (), the file name (that is, the classification) information is written to the Mapper output path.
You may have noticed here that when I wrote the file name, I used a trick to use "!" acted as a word. Because the ASCII code of this character is smaller than the ASCII code of all characters, this is done to allow the record to be accessed before all other records (all other records referred to here refer to all records in the same category. Because here we have the output of Mapper to do partitioner partition).
Tfcombiner & Tfreducer
From the Mapper above, you can see that the Mapper output key is in the following format:. So, just to parse the key in the keyword can be. The file information is also written in the cleanup () method of Mapper. This way, we can use this "!: Allwordcount" to differentiate each file. The principle of distinction is also mentioned before, because the ASCII code of "!" is the smallest reason.
Public Static class tfcombiner extends Reducer<text, text, text , Text> { Private intAllwordcount =0;@Override protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). StartsWith ("!") {Allwordcount = Integer.parseint (Values.iterator (). Next (). ToString ());return; }intSumcount =0; for(Text value:values) {Sumcount + = Integer.parseint (value.tostring ()); }Doubletf =1.0* SUMCOUNT/ALLWORDCOUNT; Context.write (Key,NewText (string.valueof (TF))); }}
The TF value for all words has been calculated after the reduce operation of the above combiner. Again through a Reducer operation will be OK. The code for Reducer is as follows:
Public Static class tfreducer extends Reducer<text, text, Text, text> { @Override protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; } for(Text value:values) {Context.write (key, value); } }}
Tfpartitioner
In this section of the Partitioner partition, we simply use a custom Hash partitioner as the partition class. If you have more stringent requirements, you can refer to my previous blog, "MapReduce Advanced: partitioner components."
publicstaticclass TFPartitioner extends Partitioner<Text, Text> { @Override publicintgetPartitionint numPartitions) { String fileName = key.toString().split(":")[1]; return127) % numPartitions); }}
Idfmapreducecore
Here I encapsulate the code associated with the calculation of IDF in the same Idfmapreducecore class, where Idfmapper, Idfreducer, is a subclass of the Idfmapreducecore class.
Idfmapper
Because the IDF calculation is for all documents, it is OK to write directly in Idfmapper with the logic of the computed WordCount. Because we do not need to care about the word frequency of a certain words when we calculate the IDF, we use 1 to populate Mapper's output value uniformly here.
Public Static class idfmapper extends Mapper<Object, text, text , Text> { Private FinalText one =NewText ("1");PrivateText label =NewText ();@Override protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Label.set (Tokenizer.nexttoken (). Split (":")[0]); Context.write (label, one); }}
Idfreducer
In front of us we have counted a word that appears in a document (category), that is, the word W appears once in document D. In this way, we can count how many times the word W has appeared in all the documents. And the idea is to compute WordCount logic. So the code is well written. And so on, we also need to calculate the number of all documents. Yes, in the formula for calculating IDF, we need to know how many documents are in total. However, in the current situation we cannot get this value because it is in Reducer. Although the total number of documents cannot be counted in Reducer, it is possible outside of Reducer. This process is purely Java logic, very simple, not much to say.
When we know the total number of training documents, we can pass the information to Reducer through the job. Just here we don't call Job.setnumreducetasks (N), but instead call the Job.setprofileparams (msg) method.
Public Static class idfreducer extends Reducer<text, text, text , Text> { PrivateText label =NewText ();@Override protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }intFileCount =0; for(Text value:values) {FileCount + = Integer.parseint (value.tostring ()); } Label.set (String.Join (":", Key.tostring (),"!"));intTotalfilecount = Integer.parseint (Context.getprofileparams ())-1;DoubleIdfvalue = MATH.LOG10 (1.0* Totalfilecount/(FileCount +1)); Context.write (Label,NewText (string.valueof (Idfvalue))); }}
Integratecore
Here I enclose the code associated with the calculation TF-IDF in the same Integratecore class, where Integratemapper, Integratereducer, is a subclass of the Integratecore class. In the final step of the calculation, there is nothing to explain. The format of the intermediate output files generated by TF and IDF is not uniform, so there is a need for different formats of file content to be considered differently.
Integratemapper
Public Static class integratemapper extends Mapper<Object, Text, text, text> { @Override protected void Map(Object key, text value, Mapper<object, text, text, Text>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); Context.write (NewText (Tokenizer.nexttoken ()),NewText (Tokenizer.nexttoken ())); }}
Integratereducer
Public Static class integratereducer extends Reducer<text, text, text, text> { Private DoubleKEYWORDIDF =0.0DPrivateText value =NewText ();@Override protected void Reduce(text key, iterable<text> values, Reducer<text, text, text, Text>.) Context context)throwsIOException, Interruptedexception {if(Values = =NULL) {return; }if(Key.tostring (). Split (":")[1].startswith ("!") {KEYWORDIDF = Double.parsedouble (Values.iterator (). Next (). ToString ());return; } value.set (String.valueof (Double.parsedouble () (Values.iterator (). Next (). toString ()) * KEYWORDIDF)); Context.write (key, value); }}
Test run data source
Android
androidjavaactivitymap
Hadoop
mapreducesshmapreduce
Ios
iosiphonejobs
Java
javacodeeclipsejavamap
Python
pythonpycharm
Execute command
Before you execute this command, upload the test data to the/input directory in HDFS.
$ hadoop jar temp/run.jar /input /output
Execution results
Activity: Android0. 0994850021680094Android: Android0. 0994850021680094Code: Java0. 07958800173440753Eclipse: Java0. 07958800173440753iOS: iOS0. 13264666955734586iphone: iOS0. 13264666955734586Java: Android0. 0554621874040891Java: Java0. 08873949984654256Jobs: iOS0. 13264666955734586Map: Android0. 024227503252014105Map: Hadoop0. 024227503252014105Map: Java0. 019382002601611284MapReduce: Hadoop0. 0994850021680094Pycharm:p Ython0. 1989700043360188python:p Ython0. 1989700043360188Reduce: Hadoop0. 0994850021680094SSH: Hadoop0. 0994850021680094
Seeing this result you may think that the result is not necessarily reliable. If you suspect these results, you can write your own version of the Java program to verify. Of course, I've already verified it.
Job
Here is the information display of the browser login Cluster Metrics. Shows the program after the completion of the content, see that there are three jobs involved in the calculation of TF-IDF.
MapReduce Application: TF-IDF Distributed implementation