Tf-idf_mapreducejava Code Implementation Ideas

Source: Internet
Author: User
Tags shuffle idf

Tf-idf 1. Concept 2. Principle

3. Java Code Implementation Ideas Data set: three MapReduce First MapReduce: (using an IK word breaker, a post, which is the content of a record, is split into words) The result of the first MapReduce final run: 1. Get The total number of micro-blogs in the data collection;
2. Get the TF value for each word in the current Weibo Mapper End:key:longwritable (offset) value:3823890314914825 The weather was fine today, and the sisters were about to go shopping together. Step one: Split read (by ' \ t '), get ID, content step two: Use the IK word breaker to segment the content (today, weather, sisters), traverse the results of the participle, for each of the word segmentation results Word output (w_id, 1)Step three: When the content is traversed, record the current Weibo, output (count,1)  first Mr Custom partition:extends hashpartitioner<text, Intwritable>, overriding getpartition Default partition rule: Number of key.hash ()%reduce
The value of the key is judged here, if Key.equals ("Count"), give the last reduce, otherwise, to ReduceCount-1Reducer End: The first -- key:w_id value:{1,1,1} The second-- key:count value{1,1 , 1 ... ..}Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum)) ! NoteBecause the number of reduce is set here in Firstjob (Job.setnumreducetasks (4)),so finally there will be four file output, and Key=count also specify reduce, so key:count Valu e{1,1,1 ...} In the last file, key:w_id value : {1,1,1} in the first three files a second mapreduce:Read from the results of the first mapreduce output, as input for this time
The result of the second MapReduce final run:1. Get each word in the data set how many tweets have appeared, that is, the DF value   Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Get the Data Fragment (split) of the current mapper task, Judging by the name of the Fliesplit, guaranteeing that it is not the last file ( because the content of the last file is Count 1075) Step two: At this point the value of the mapper input isToday _3823890314914825 2
For data processing, according to "\ T" cutting, and then according to "_" Cut, output Context.write (today, 1)// Note that this will count the total number of files that are included today, so don't pay attention to Weibo ID  Reducer End: Key:w value:{1,1,1} Data sample: Key= Today, value={1,1,1,1,1} //each 1 means there is a microblog in the data set containing the word today Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum))    after a second mapreduce operation, the DF (document frequency) value for each word is obtained.                                     a third mapreduce: to calculate the TF*IDF value
The result of the third MapReduce final run:1. Get the TF-IDF value of each word in each micro-blogexample of the result: {3823890314914825 today: 2.78834 Shopping: 3.98071 sisters: 1.98712} Tips:
The fourth file of the first MapReduce output (count 1075), it is necessary to calculate the TF-IDF value of each word, so this file is sent to the job Line when loaded into memory for increased operational efficiencyThe second MapReduce output file-- how many micro-blogs appear in the data set for each word, that is, the DF value(Today 5), because it contains commonly used words. sinks, unlike datasets, are not very large and can be loaded into memory to improve program execution efficiency
      1. // 把微博总数加载到内存
  1. job.addCacheFile(newPath("/user/tfidf/output/weibo1/part-r-00003")
  2. .toUri());
  3. // 把df加载到内存
  4. job.addCacheFile(newPath("/user/tfidf/output/weibo2/part-r-00000")
  5. .toUri());
  Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Before formally executing the map method Perform the setup (context) method first Purpose: To encapsulate the total number of tweets loaded into memory, as well as the DF value, into the Map object (CMAP,DF) for easy operation of the map
Step two: Start the map operation, because the input of the mapper side is the output of the first mapreduce, so it needs to be judged whether it is the last file (count,1075) For data processing, according to the "\ T" cut, get TF value-->v[1]=2, while V[0] is cut by "_" to get the word (today) and the Weibo ID (3823890314914825) get "Count" from CMap, get the DF value of the word from DF, calculate the TF*IDF value of the word based on the TF value of the word double s = tf * Math.log (Cmap.get ("Count")/Df.get (w));Step Three: data output, key= Weibo id,value= (w:tf*idf value)
 Reducer End : key= Weibo ID, value= (w:tf*idf value) Sample data: key=3823890314914825,value={today: 2.89101, Shopping: 3.08092} Step one: The data after the shuffle process is integrated (key phase The same value is a group that iterates through the values in the iterator, defines the StringBuffer, stitching each word in an iterator with the corresponding TF*IDF value
Step two: Write down the results of the reduce, Context.write (Key, New Text (Sb.tostring ()))   4. How do businesses achieve precise marketing? After the above process, we get the final data is 3823890314914825 {today: 2.89101, Shopping: 3.08092}, that is, the TF*IDF value of each word in each WeiboFor example, Korean food to push the big bone soup, this time only need make a descending order of TFIDF values for each word in each microblog in the data set, then take the first 3 bits,
The traversal of data across the entire data set, General TF*IDF value of the first three people contain big bone soup, is the business to push the object

Tf-idf_mapreducejava Code Implementation Ideas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.