TF-IDF_MapReduceJava Code Implementation ideas, mapreducetfidf
Thursday, February 16, 2017
TF-IDF
1. Concept
2. Principles
3. java code implementation ideas
Dataset:
Three MapReduce
First MapReduce: (use the ik tokenizer to split words in a blog post, that is, content in a record)Result of the first MapReduce operation: 1. Obtain the dataset
Total number of Weibo posts;
2. Get
TF value of each word on the current WeiboMapper end: key: LongWritable (offset) value: 3823890314914825 today, the weather is fine, sisters have an appointment, go shopping together. Step 1: Split and read (according to '\ t'), get id, content Step 2: Use the ik tokenizer to split the content (today, weather, sisters), and traverse the word segmentation result, for each word in the word splitting result
Output (w_id, 1)
Step 3: After the content is traversed, record the current microblog,
Output (count, 1)
The first MR custom partition: extends HashPartitioner <Text, IntWritable>. Rewrite the default partition rule of getPartition: key. hash () % reduce.
The key value is judged here,
If key. equals ("count"), hand it to the last reduce; otherwise, hand it to the reduceCount-1
CER end: first --
Key: w_id
Value: {1, 1}Second --
Key: count value {, 1 .....}Step 1: integrate the data after the shuffle process (the values with the same key are a group, and the values in the iterator are traversed)
Step 2: write the result after reduce, context. write (key, new IntWritable (sum ))
! Note:Because the number of Reduce (job. setNumReduceTasks (4) is set in FirstJob, four files are output at the end, and key = count specifies Reduce.
Key: count value {, 1 .....}In the last file,
Key: w_id
Value: {1, 1}First three files
Second MapReduce:Read from the result of the first MapReduce output as the input.
Result of the second MapReduce operation: 1. Obtain the number of microblogs in each word in the data set,
That is, the DF value
Mapper: key: LongWritable (offset) value: Today _ 3823890314914825 2 Step 1: Get the data segment (split) of the Current mapper task ),
Judge Based on the name of the FlieSplit FileIs not the last file (
Because the last file contains count 1075) Step 2: The value entered by the ER er is today _ 3823890314914825 2.
For data processing, cut by \ t, and then cut by _, output context. write (today, 1 )//
Note that the statistics here include the total number of files today, so do not pay attention to the Weibo id
Reducer end: key: w value: {, 1} data sample: key = today, value = {, 1}
// Each 1 indicates that one microblog in the dataset contains the word "today ".Step 1: integrate the data after the shuffle process (the values with the same key are a group, and the values in the iterator are traversed)
Step 2: write the result after reduce, context. write (key, new IntWritable (sum ))
After the second MapReduce operation, the df (document frequency) value of each word is obtained.
Third MapReduce:
Purpose --> calculate the TF * IDF Value
Result of the third MapReduce operation: 1. Get the TF-IDF value of each word in each microblog. Example: {3823890314914825 today: 2.78834 shopping: 3.98071 Sisters: 1.98712}
TIPS:
The fourth file output by the first MapReduce (count 1075 ),
To calculate the TF-IDF value for each word is needed, so run this file in the job
Line
Load Time To memory to Improve Running EfficiencyThe second MapReduce output file -->
The number of Weibo posts in each word set, that is, the df value.(Today 5 ),
Because it contains frequently-used words
Different from datasets, aggregation is not very large and can be loaded into the memory, improving the execution efficiency of the program.
job.addCacheFile(newPath("/user/tfidf/output/weibo1/part-r-00003")
.toUri());
// Load df to memory
job.addCacheFile(newPath("/user/tfidf/output/weibo2/part-r-00000")
.toUri());
Mapper side: key: LongWritable (offset) value: Today _ 3823890314914825 2 Step 1: Before formally executing the map Method
First execute the setup (Context context) Method
Objective: To encapsulate the total number of Weibo posts loaded into the memory and the DF value into a Map object (cmap, df) for easy map operations
Step 2: Start the map operation. Because the mapper input is the first MapReduce output, you also need to determine whether it is the last file (count, 1075)
For data processing, cut according to "\ t,
Tf Value--> V [1] = 2. Cut v [0] by "_" at the same time to get the word (today) and Weibo id (3823890314914825)
Get "count" from cmap, get the df value of the word from df, and calculate the tf * idf value of the word Based on the TF value of the word
Double s = tf * Math. log (cmap. get ("count")/df. get (w ));Step 3:
Output data, key = Weibo id, value = (w: tf * idf value)
CER end:
Key = Weibo id, value = (w: tf * idf value)Data example: key = 3823890314914825, value = {today: 2.89101, shopping: 3.08092} Step 1: integrate the data after the shuffle process (the same key value is a group, traverse the values in the iterator and define StringBuffer,
Concatenates each word in the iterator and the corresponding TF * IDF Value)
Step 2: Write the result after reduce,
Context. write (key, new Text (sb. toString ()))
4. How do sellers Implement Precision Marketing?After the above process, the final data we get is 3823890314914825 {today: 2.89101, shopping: 3.08092 },
That is, the TF * IDF value of each word in each microblogFor example, if Korean food is to be pushed, you only need
Perform a descending order on the TFIDF value corresponding to each word in each microblog in the dataset, and then take the first three places,
Traverse the data in the entire dataset,
All the top three tf-idf values that contain dashboard are the objects to be pushed by sellers.