TF-IDF_MapReduceJava Code Implementation ideas, mapreducetfidf

Source: Internet
Author: User
Tags split words idf

TF-IDF_MapReduceJava Code Implementation ideas, mapreducetfidf

Thursday, February 16, 2017

TF-IDF 1. Concept  2. Principles

  3. java code implementation ideas Dataset:  Three MapReduce First MapReduce: (use the ik tokenizer to split words in a blog post, that is, content in a record)Result of the first MapReduce operation: 1. Obtain the dataset Total number of Weibo posts;
2. Get TF value of each word on the current WeiboMapper end: key: LongWritable (offset) value: 3823890314914825 today, the weather is fine, sisters have an appointment, go shopping together. Step 1: Split and read (according to '\ t'), get id, content Step 2: Use the ik tokenizer to split the content (today, weather, sisters), and traverse the word segmentation result, for each word in the word splitting result Output (w_id, 1) 
Step 3: After the content is traversed, record the current microblog, Output (count, 1)  The first MR custom partition: extends HashPartitioner <Text, IntWritable>. Rewrite the default partition rule of getPartition: key. hash () % reduce.
The key value is judged here, If key. equals ("count"), hand it to the last reduce; otherwise, hand it to the reduceCount-1
CER end: first -- Key: w_id Value: {1, 1}Second -- Key: count value {, 1 .....}Step 1: integrate the data after the shuffle process (the values with the same key are a group, and the values in the iterator are traversed)
Step 2: write the result after reduce, context. write (key, new IntWritable (sum )) ! Note:Because the number of Reduce (job. setNumReduceTasks (4) is set in FirstJob, four files are output at the end, and key = count specifies Reduce. Key: count value {, 1 .....}In the last file, Key: w_id Value: {1, 1}First three files Second MapReduce:Read from the result of the first MapReduce output as the input.
Result of the second MapReduce operation: 1. Obtain the number of microblogs in each word in the data set, That is, the DF value   Mapper: key: LongWritable (offset) value: Today _ 3823890314914825 2 Step 1: Get the data segment (split) of the Current mapper task ), Judge Based on the name of the FlieSplit FileIs not the last file ( Because the last file contains count 1075) Step 2: The value entered by the ER er is today _ 3823890314914825 2.
For data processing, cut by \ t, and then cut by _, output context. write (today, 1 )// Note that the statistics here include the total number of files today, so do not pay attention to the Weibo id  Reducer end: key: w value: {, 1} data sample: key = today, value = {, 1} // Each 1 indicates that one microblog in the dataset contains the word "today ".Step 1: integrate the data after the shuffle process (the values with the same key are a group, and the values in the iterator are traversed)
Step 2: write the result after reduce, context. write (key, new IntWritable (sum )) After the second MapReduce operation, the df (document frequency) value of each word is obtained. Third MapReduce: Purpose --> calculate the TF * IDF Value
Result of the third MapReduce operation: 1. Get the TF-IDF value of each word in each microblog. Example: {3823890314914825 today: 2.78834 shopping: 3.98071 Sisters: 1.98712} TIPS:
The fourth file output by the first MapReduce (count 1075 ), To calculate the TF-IDF value for each word is needed, so run this file in the job Line Load Time To memory to Improve Running EfficiencyThe second MapReduce output file --> The number of Weibo posts in each word set, that is, the df value.(Today 5 ), Because it contains frequently-used words Different from datasets, aggregation is not very large and can be loaded into the memory, improving the execution efficiency of the program.
  1. job.addCacheFile(newPath("/user/tfidf/output/weibo1/part-r-00003")
  2. .toUri());
  3. // Load df to memory
  4. job.addCacheFile(newPath("/user/tfidf/output/weibo2/part-r-00000")
  5. .toUri());
  6.  Mapper side: key: LongWritable (offset) value: Today _ 3823890314914825 2 Step 1: Before formally executing the map Method First execute the setup (Context context) Method Objective: To encapsulate the total number of Weibo posts loaded into the memory and the DF value into a Map object (cmap, df) for easy map operations
    Step 2: Start the map operation. Because the mapper input is the first MapReduce output, you also need to determine whether it is the last file (count, 1075)
    For data processing, cut according to "\ t, Tf Value--> V [1] = 2. Cut v [0] by "_" at the same time to get the word (today) and Weibo id (3823890314914825) Get "count" from cmap, get the df value of the word from df, and calculate the tf * idf value of the word Based on the TF value of the word Double s = tf * Math. log (cmap. get ("count")/df. get (w ));Step 3: Output data, key = Weibo id, value = (w: tf * idf value)
     CER end: Key = Weibo id, value = (w: tf * idf value)Data example: key = 3823890314914825, value = {today: 2.89101, shopping: 3.08092} Step 1: integrate the data after the shuffle process (the same key value is a group, traverse the values in the iterator and define StringBuffer, Concatenates each word in the iterator and the corresponding TF * IDF Value)
    Step 2: Write the result after reduce, Context. write (key, new Text (sb. toString ()))   4. How do sellers Implement Precision Marketing?After the above process, the final data we get is 3823890314914825 {today: 2.89101, shopping: 3.08092 }, That is, the TF * IDF value of each word in each microblogFor example, if Korean food is to be pushed, you only need Perform a descending order on the TFIDF value corresponding to each word in each microblog in the dataset, and then take the first three places,
    Traverse the data in the entire dataset, All the top three tf-idf values that contain dashboard are the objects to be pushed by sellers.
     

    Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.