Tf-idf
1. Concept
2. Principle
3. Java Code Implementation Ideas
Data set:
three MapReduce
First MapReduce: (using an IK word breaker, a post, which is the content of a record, is split into words) The result of the first MapReduce final run: 1. Get
The total number of micro-blogs in the data collection;
2. Get
the TF value for each word in the current Weibo Mapper End:key:longwritable (offset) value:3823890314914825 The weather was fine today, and the sisters were about to go shopping together. Step one: Split read (by ' \ t '), get ID, content step two: Use the IK word breaker to segment the content (today, weather, sisters), traverse the results of the participle, for each of the word segmentation results Word
output (w_id, 1)Step three: When the content is traversed, record the current Weibo,
output (count,1)
first Mr Custom partition:extends hashpartitioner<text, Intwritable>, overriding getpartition Default partition rule: Number of key.hash ()%reduce
The value of the key is judged here,
if Key.equals ("Count"), give the last reduce, otherwise, to ReduceCount-1Reducer End: The first --
key:w_id
value:{1,1,1} The second--
key:count value{1,1 , 1 ... ..}Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum))
! NoteBecause the number of reduce is set here in Firstjob (Job.setnumreducetasks (4)),so finally there will be four file output, and Key=count also specify reduce, so
key:count Valu e{1,1,1 ...} In the last file,
key:w_id
value : {1,1,1} in the first three files
a second mapreduce:Read from the results of the first mapreduce output, as input for this time
The result of the second MapReduce final run:1. Get each word in the data set how many tweets have appeared,
that is, the DF value
Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Get the Data Fragment (split) of the current mapper task,
Judging by the name of the Fliesplit, guaranteeing that it is not the last file (
because the content of the last file is Count 1075) Step two: At this point the value of the mapper input isToday _3823890314914825 2
For data processing, according to "\ T" cutting, and then according to "_" Cut, output Context.write (today, 1)//
Note that this will count the total number of files that are included today, so don't pay attention to Weibo ID
Reducer End: Key:w value:{1,1,1} Data sample: Key= Today, value={1,1,1,1,1}
//each 1 means there is a microblog in the data set containing the word today Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum))
after a second mapreduce operation, the DF (document frequency) value
for each word is obtained.
a third mapreduce:
to calculate the TF*IDF value
The result of the third MapReduce final run:1. Get the TF-IDF value of each word in each micro-blogexample of the result: {3823890314914825 today: 2.78834 Shopping: 3.98071 sisters: 1.98712}
Tips:
The fourth file of the first MapReduce output (count 1075),
it is necessary to calculate the TF-IDF value of each word, so this file is sent to the job
Line
when loaded into memory for increased operational efficiencyThe second MapReduce output file--
how many micro-blogs appear in the data set for each word, that is, the DF value(Today 5),
because it contains commonly used words.
sinks, unlike datasets, are not very large and can be loaded into memory to improve program execution efficiency
// 把微博总数加载到内存
job.addCacheFile(newPath("/user/tfidf/output/weibo1/part-r-00003")
.toUri());
// 把df加载到内存
job.addCacheFile(newPath("/user/tfidf/output/weibo2/part-r-00000")
.toUri());
Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Before formally executing the map method
Perform the setup (context) method first
Purpose: To encapsulate the total number of tweets loaded into memory, as well as the DF value, into the Map object (CMAP,DF) for easy operation of the map
Step two: Start the map operation, because the input of the mapper side is the output of the first mapreduce, so it needs to be judged whether it is the last file (count,1075) For data processing, according to the "\ T" cut,
get TF value-->v[1]=2, while V[0] is cut by "_" to get the word (today) and the Weibo ID (3823890314914825)
get "Count" from CMap, get the DF value of the word from DF, calculate the TF*IDF value of the word based on the TF value of the word
double s = tf * Math.log (Cmap.get ("Count")/Df.get (w));Step Three:
data output, key= Weibo id,value= (w:tf*idf value)
Reducer End :
key= Weibo ID, value= (w:tf*idf value) Sample data: key=3823890314914825,value={today: 2.89101, Shopping: 3.08092} Step one: The data after the shuffle process is integrated (key phase The same value is a group that iterates through the values in the iterator, defines the StringBuffer,
stitching each word in an iterator with the corresponding TF*IDF value)
Step two: Write down the results of the reduce,
Context.write (Key, New Text (Sb.tostring ()))
4. How do businesses achieve precise marketing? After the above process, we get the final data is 3823890314914825 {today: 2.89101, Shopping: 3.08092},
that is, the TF*IDF value of each word in each WeiboFor example, Korean food to push the big bone soup, this time only need
make a descending order of TFIDF values for each word in each microblog in the data set, then take the first 3 bits,
The traversal of data across the entire data set,
General TF*IDF value of the first three people contain big bone soup, is the business to push the object
Tf-idf_mapreducejava Code Implementation Ideas