Javaapi,mapreduce,awk,scala four ways to achieve word frequency statistics

Last Update:2018-06-02 Source: Internet

Author: User

Tags set set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

awk way to achieve word frequency statistics:

Mode one: VI wordcount.awk{for (i = 1; I <=nf;i++)//nf represents the number of fields that are browsing records freq[$i]++}end{for (word in freq)//printf "%s%d\n", Word , Freq[word]//} run: Awk-f wordcount.awk words.txt;----------------------------------mode two: operation Mode II: VI wordcount_awk.sh#!/ Bin/shawk-f "" ' {for  (i = 1; i<=nf; i++)      freq[$i]++}end{  for (word in freq)    printf "%s%d\n", Word,freq[word]} ' $1chmod u+x wordcount_awk.sh./ wordcount_awk.sh words.txt-----------------------------NF indicates the number of fields that are browsing the record $NF the last field (column), which is the contents of the Output last fields [[email Protected] shell]# Free-m | grep buffers\/-/+ buffers/cache:       1815       1859[[email protected] shell]# free-m | grep buffers\/| awk ' {print $NF} ' 1 859[[email protected] shell]# free-m | grep buffers\/| awk ' {print NF} ' 4[[email protected] shell]#--------------------------------%x represents hexadecimal%o is octal%d or%i represents a decimal integer,%c is the character%s is a string, %f or%e is an input real number, a decimal or exponential input can be%ld is a long double type, and a percent percent is entered.

Javaapi way to achieve word frequency statistics:

 PackageCN. Wordtongji;ImportJava.io.*;ImportJava.util.HashMap;ImportJava.util.Map;/*** Created by Administrator on 2018/6/1 0001.*/ Public classWorddemo { Public Static voidMain (string[] args)throwsIOException {//Read file contents, get file ObjectBufferedReader br =NewBufferedReader (NewFileReader ("D:\\test\\aaa.txt")); //get words from ObjectsString nextlines= ""; Map<String,Integer> map =NewHashmap<string,integer>();  while((Nextlines=br.readline ())! =NULL){            //split a word with a space and get to an array of wordsstring[] Data =nextlines.split (""); //put the word into the map and use the For loop to traverse             for(String word:data) {//first define a hashmap outside the loop//put the word into the map//< The form of Word,1>Map.put (word,1); }        }        //traverse a word in a map//KeySet (): Deposits all the keys in the map into the set set         for(String key:map.keySet ()) {//calculates the value of the value corresponding to the key, based on the key valueSystem.out.println (key+ "----" +Map.get (key)); }    }}

MapReduce implements word frequency statistics:

 PackageCn.bcqm1711.mr.day01;/*** Created by Administrator on 2018/5/2.*/Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;Importorg.apache.hadoop.io.LongWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.mapreduce.lib.partition.HashPartitioner;Importjava.io.IOException;/** * @author: Yongke.pan * @Desc: Custom Word Frequency statistics * @create 2018-05-02 9:44 **/ Public classCustomwordcount {//maptask Stage: By default a data block corresponds to a split Shard, and a shard corresponds to a maptask//longwritable, text represents the offset of each line and the data type of each row of content//Text, intwritable represents the data type of each map output Key/value     Public Static classWcmapperextendsmapper<longwritable, text, text, intwritable> {        Private Static FinalIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); //called once before the business code is started@Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {}//Write the business logic code, calling the map method once per line@Overrideprotected voidMap (longwritable key, Text value, context context)throwsIOException, interruptedexception {//get the contents of each rowString line =value.tostring (); //split a branch to get the wordstring[] Words = Line.split ("");  for(String wd:words) {word.set (wd); //output to Local disk:< word,1>Context.write (Word, one); }        }        //after the execution of the business code is completed, the last call cleanup@Overrideprotected voidCleanup (context context)throwsIOException, Interruptedexception {}}//Reducertask Stage//Text, intwritable two parameters receive the Key/value data type of the Maptask output//Text, intwritable reducertask stage the Key/value data type that is output to the HDFS system after the data service is processed     Public Static classWcreduceextendsReducer<text, Intwritable, Text, intwritable> {        //called once before starting to execute the reduce business code@Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {}//The hashcode code of the key is assigned to a reduce@Overrideprotected voidReduce (Text key, iterable<intwritable> values, context context)throwsIOException, interruptedexception {intsum = 0;  for(intwritable v:values) {sum+=V.get (); }            //output the number of words and times after aggregation to HDFsContext.write (Key,Newintwritable (sum)); }        //use once after processing the reduce business code@Overrideprotected voidCleanup (context context)throwsIOException, Interruptedexception {}}//driver portion of job jobs     Public Static voidMain (string[] args)throwsException {//Get Configuration ObjectConfiguration conf =NewConfiguration (); //Customwordcount is the name of the job and can be easily viewed on the history server//Job Job=new Job ();Job Job = job.getinstance (conf, "Customwordcount")); //set the entry class for the programJob.setjarbyclass (Customwordcount.class); //Package Maptask StageJob.setmapperclass (Wcmapper.class);//set up business processing code for the map stageJob.setmapoutputkeyclass (Text.class);//tell Mr Frame map to output key data typeJob.setmapoutputvalueclass (intwritable.class);//tell the MR Framework that the map outputs the data type of value//receive the parameters of the main method (parameters passed in when the job was submitted:/words3.txt)Fileinputformat.addinputpath (Job,NewPath (Args[0]));//tell the MR Phase to process the file path//Package Reducetask StageJob.setreducerclass (wcreduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); //which file (/out0502) The data is exported to the HDFS systemFileoutputformat.setoutputpath (Job,NewPath (args[1])); Job.setpartitionerclass (Hashpartitioner.class); Job.setnumreducetasks (2); //Submit Job        BooleanIsOk = Job.waitforcompletion (true); System.exit (IsOk? 0:1); }}

The Scala approach enables word frequency statistics:

 PackageCn.qmScala.day04Scala/*** Created by Administrator on 2018/6/2 0002. */Object Demo15wordcount {val acc=truedef main (Args:array[string]) {val data=array ("Jin Tian Tian qi bu cuo xiang chu qu Wan")    //splits the word. Using the Flatmap methodVal Words:array[string]=data.flatmap (_.split (""))    //the form of word--(word, 1)Val word_one:array[(String,int)]=words.map ((_,1))    //GroupingVal groupbyword:map[string,array[(String,int)]]=word_one.groupby (_._1)//1. Count the number of each wordVal words_times:map[string,int]=groupbyword.mapvalues (_.size)//for ((k,v) <-words_times) println (S "$k, $v")//2. Sort by the number of occurrences of the word. Put the word in the collection, sort by the method of the collectionVal wordstimeslist:list[(String,int)]=words_times.tolist//Val wordcounttimesort:list[(string,int)]=wordstimeslist.sortby (_._2)Val wordcounttimesort:list[(String,int)]=Wordstimeslist.sortby (_._2)//for ((k,v) <-Wordcounttimesort) println (S "$k, $v")//3. Find the most top three most frequently used words .... Scala's approachVal Wordcounttop3=wordcounttimesort.take (3)     for((k,v) <-wordCountTop3) println (S "$k, $v")  }}

Javaapi,mapreduce,awk,scala four ways to achieve word frequency statistics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More