Tf-idf_mapreducejava Code Implementation Ideas

Last Update:2017-04-13 Source: Internet

Author: User

Tags shuffle idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tf-idf 1. Concept 2. Principle

3. Java Code Implementation Ideas Data set: three MapReduce First MapReduce: (using an IK word breaker, a post, which is the content of a record, is split into words) The result of the first MapReduce final run: 1. Get The total number of micro-blogs in the data collection;
2. Get the TF value for each word in the current Weibo Mapper End:key:longwritable (offset) value:3823890314914825 The weather was fine today, and the sisters were about to go shopping together. Step one: Split read (by ' \ t '), get ID, content step two: Use the IK word breaker to segment the content (today, weather, sisters), traverse the results of the participle, for each of the word segmentation results Word output (w_id, 1)Step three: When the content is traversed, record the current Weibo, output (count,1) first Mr Custom partition:extends hashpartitioner<text, Intwritable>, overriding getpartition Default partition rule: Number of key.hash ()%reduce
The value of the key is judged here, if Key.equals ("Count"), give the last reduce, otherwise, to ReduceCount-1Reducer End: The first -- key:w_id value:{1,1,1} The second-- key:count value{1,1 , 1 ... ..}Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum)) ! NoteBecause the number of reduce is set here in Firstjob (Job.setnumreducetasks (4)),so finally there will be four file output, and Key=count also specify reduce, so key:count Valu e{1,1,1 ...} In the last file, key:w_id value : {1,1,1} in the first three files a second mapreduce:Read from the results of the first mapreduce output, as input for this time
The result of the second MapReduce final run:1. Get each word in the data set how many tweets have appeared, that is, the DF value Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Get the Data Fragment (split) of the current mapper task, Judging by the name of the Fliesplit, guaranteeing that it is not the last file ( because the content of the last file is Count 1075) Step two: At this point the value of the mapper input isToday _3823890314914825 2
For data processing, according to "\ T" cutting, and then according to "_" Cut, output Context.write (today, 1)// Note that this will count the total number of files that are included today, so don't pay attention to Weibo ID Reducer End: Key:w value:{1,1,1} Data sample: Key= Today, value={1,1,1,1,1} //each 1 means there is a microblog in the data set containing the word today Step one: The data after the shuffle process is consolidated (the same value for key is a group, and the values in the iterator are traversed)
Step two: Write down the results of the reduce, context.write (key, New intwritable (sum)) after a second mapreduce operation, the DF (document frequency) value for each word is obtained. a third mapreduce: to calculate the TF*IDF value
The result of the third MapReduce final run:1. Get the TF-IDF value of each word in each micro-blogexample of the result: {3823890314914825 today: 2.78834 Shopping: 3.98071 sisters: 1.98712} Tips:
The fourth file of the first MapReduce output (count 1075), it is necessary to calculate the TF-IDF value of each word, so this file is sent to the job Line when loaded into memory for increased operational efficiencyThe second MapReduce output file-- how many micro-blogs appear in the data set for each word, that is, the DF value(Today 5), because it contains commonly used words. sinks, unlike datasets, are not very large and can be loaded into memory to improve program execution efficiency

// 把微博总数加载到内存

job.addCacheFile(newPath("/user/tfidf/output/weibo1/part-r-00003")
.toUri());
// 把df加载到内存
job.addCacheFile(newPath("/user/tfidf/output/weibo2/part-r-00000")
.toUri());

Mapper End:key:longwritable (offset) value: Today _3823890314914825 2 Step one: Before formally executing the map method Perform the setup (context) method first Purpose: To encapsulate the total number of tweets loaded into memory, as well as the DF value, into the Map object (CMAP,DF) for easy operation of the map
Step two: Start the map operation, because the input of the mapper side is the output of the first mapreduce, so it needs to be judged whether it is the last file (count,1075) For data processing, according to the "\ T" cut, get TF value-->v[1]=2, while V[0] is cut by "_" to get the word (today) and the Weibo ID (3823890314914825) get "Count" from CMap, get the DF value of the word from DF, calculate the TF*IDF value of the word based on the TF value of the word double s = tf * Math.log (Cmap.get ("Count")/Df.get (w));Step Three: data output, key= Weibo id,value= (w:tf*idf value)
Reducer End : key= Weibo ID, value= (w:tf*idf value) Sample data: key=3823890314914825,value={today: 2.89101, Shopping: 3.08092} Step one: The data after the shuffle process is integrated (key phase The same value is a group that iterates through the values in the iterator, defines the StringBuffer, stitching each word in an iterator with the corresponding TF*IDF value）
Step two: Write down the results of the reduce, Context.write (Key, New Text (Sb.tostring ())) 4. How do businesses achieve precise marketing? After the above process, we get the final data is 3823890314914825 {today: 2.89101, Shopping: 3.08092}, that is, the TF*IDF value of each word in each WeiboFor example, Korean food to push the big bone soup, this time only need make a descending order of TFIDF values for each word in each microblog in the data set, then take the first 3 bits,
The traversal of data across the entire data set, General TF*IDF value of the first three people contain big bone soup, is the business to push the object

Tf-idf_mapreducejava Code Implementation Ideas

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More