Word co-occurrence has never known how to translate words correctly. Word Similarity? Or co-occurrence words? Or the symbiotic matrix of words?
This is a commonly used Text Processing Algorithm in statistics to measure all phrases in a set of documents with the closest occurrence frequency. well, it's actually a context phrase, not a word. it is a common algorithm and can be derived from other statistical algorithms. it can be used for recommendation, because it can provide the result that "people will see this and that ". for example, we recommend shopping items other than collaborative filtering, analyze credit card risks, or calculate what everyone likes.
For example, when I love you, "I love" is often accompanied by "love you". However, Chinese processing is different from English. We need to use Word Segmentation for preprocessing.
Split the code by mapper, reducer, and driver.
Mapper program:
Package WCO; import Java. io. ioexception; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. mapper; public class wcomapper extends mapper <longwritable, text, text, intwritable> {@ override public void map (longwritable key, text value, context) throws ioexception, interruptedexception {/** all row content To lowercase. */string line_lc = value. tostring (). tolowercase (); string before = NULL;/** split the row into words * and the key is the previous word plus the next word * value is 1 */For (string word: line_lc.split ("\ W +") {// the content of the Loop Line. Separate the words if (word. length ()> 0) {If (before! = NULL) {// If the prefix is not empty, write the context (the first prefix must be empty and jump directly to the following before = word) context. write (new text (before + "," + word), new intwritable (1);} before = word; // assign the current word to the prefix }}}}
CER program:
Package WCO; import Java. io. ioexception; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. extends CER; public class wcoreducer extends CER <text, intwritable, text, intwritable >{@ override public void reduce (Text key, iterable <intwritable> values, context) throws ioexception, interruptedexception {int wordcount = 0; For (intwritable value: values) {wordcount + = value. get (); // calculate word count} context. write (Key, new intwritable (wordcount ));}}
The driver program will not be explained. The drivers in the world are the same:
package wco;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class WCo extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: hadoop jar wco.WCo <input> <output>\n"); return -1; } Job job = new Job(getConf()); job.setJarByClass(WCo.class); job.setJobName("Word Co Occurrence"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WCoMapper.class); job.setReducerClass(WCoReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new WCo(), args); System.exit(exitCode); }}
The core of the algorithm is to extract the prefix and the suffix as the key and add a value to count the word, and calculate the symbiosis frequency of the word to cluster the text. I can tell from the Internet that there are a lot of K-means. In fact, many algorithms are based on requirements. K-means or fuzzy K-means are not necessarily high, wordcount may not be too short.
This article was posted on the "practice test truth" blog and declined to be reproduced!
Hadoop word co-occurrence implementation