Hadoop word co-occurrence implementation

Source: Internet
Author: User

Word co-occurrence has never known how to translate words correctly. Word Similarity? Or co-occurrence words? Or the symbiotic matrix of words?

This is a commonly used Text Processing Algorithm in statistics to measure all phrases in a set of documents with the closest occurrence frequency. well, it's actually a context phrase, not a word. it is a common algorithm and can be derived from other statistical algorithms. it can be used for recommendation, because it can provide the result that "people will see this and that ". for example, we recommend shopping items other than collaborative filtering, analyze credit card risks, or calculate what everyone likes.


For example, when I love you, "I love" is often accompanied by "love you". However, Chinese processing is different from English. We need to use Word Segmentation for preprocessing.


Split the code by mapper, reducer, and driver.

Mapper program:

Package WCO; import Java. io. ioexception; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. mapper; public class wcomapper extends mapper <longwritable, text, text, intwritable> {@ override public void map (longwritable key, text value, context) throws ioexception, interruptedexception {/** all row content To lowercase. */string line_lc = value. tostring (). tolowercase (); string before = NULL;/** split the row into words * and the key is the previous word plus the next word * value is 1 */For (string word: line_lc.split ("\ W +") {// the content of the Loop Line. Separate the words if (word. length ()> 0) {If (before! = NULL) {// If the prefix is not empty, write the context (the first prefix must be empty and jump directly to the following before = word) context. write (new text (before + "," + word), new intwritable (1);} before = word; // assign the current word to the prefix }}}}


CER program:

Package WCO; import Java. io. ioexception; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. extends CER; public class wcoreducer extends CER <text, intwritable, text, intwritable >{@ override public void reduce (Text key, iterable <intwritable> values, context) throws ioexception, interruptedexception {int wordcount = 0; For (intwritable value: values) {wordcount + = value. get (); // calculate word count} context. write (Key, new intwritable (wordcount ));}}


The driver program will not be explained. The drivers in the world are the same:

package wco;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class WCo extends Configured implements Tool {  @Override  public int run(String[] args) throws Exception {    if (args.length != 2) {      System.out.printf("Usage: hadoop jar wco.WCo <input> <output>\n");      return -1;    }    Job job = new Job(getConf());    job.setJarByClass(WCo.class);    job.setJobName("Word Co Occurrence");    FileInputFormat.setInputPaths(job, new Path(args[0]));    FileOutputFormat.setOutputPath(job, new Path(args[1]));    job.setMapperClass(WCoMapper.class);    job.setReducerClass(WCoReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    boolean success = job.waitForCompletion(true);    return success ? 0 : 1;  }  public static void main(String[] args) throws Exception {    int exitCode = ToolRunner.run(new Configuration(), new WCo(), args);    System.exit(exitCode);  }}


The core of the algorithm is to extract the prefix and the suffix as the key and add a value to count the word, and calculate the symbiosis frequency of the word to cluster the text. I can tell from the Internet that there are a lot of K-means. In fact, many algorithms are based on requirements. K-means or fuzzy K-means are not necessarily high, wordcount may not be too short.

This article was posted on the "practice test truth" blog and declined to be reproduced!

Hadoop word co-occurrence implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.