awk way to achieve word frequency statistics:
Mode one: VI wordcount.awk{for (i = 1; I <=nf;i++)//nf represents the number of fields that are browsing records freq[$i]++}end{for (word in freq)//printf "%s%d\n", Word , Freq[word]//} run: Awk-f wordcount.awk words.txt;----------------------------------mode two: operation Mode II: VI wordcount_awk.sh#!/ Bin/shawk-f "" ' {for (i = 1; i<=nf; i++) freq[$i]++}end{ for (word in freq) printf "%s%d\n", Word,freq[word]} ' $1chmod u+x wordcount_awk.sh./ wordcount_awk.sh words.txt-----------------------------NF indicates the number of fields that are browsing the record $NF the last field (column), which is the contents of the Output last fields [[email Protected] shell]# Free-m | grep buffers\/-/+ buffers/cache: 1815 1859[[email protected] shell]# free-m | grep buffers\/| awk ' {print $NF} ' 1 859[[email protected] shell]# free-m | grep buffers\/| awk ' {print NF} ' 4[[email protected] shell]#--------------------------------%x represents hexadecimal%o is octal%d or%i represents a decimal integer,%c is the character%s is a string, %f or%e is an input real number, a decimal or exponential input can be%ld is a long double type, and a percent percent is entered.
Javaapi way to achieve word frequency statistics:
PackageCN. Wordtongji;ImportJava.io.*;ImportJava.util.HashMap;ImportJava.util.Map;/*** Created by Administrator on 2018/6/1 0001.*/ Public classWorddemo { Public Static voidMain (string[] args)throwsIOException {//Read file contents, get file ObjectBufferedReader br =NewBufferedReader (NewFileReader ("D:\\test\\aaa.txt")); //get words from ObjectsString nextlines= ""; Map<String,Integer> map =NewHashmap<string,integer>(); while((Nextlines=br.readline ())! =NULL){ //split a word with a space and get to an array of wordsstring[] Data =nextlines.split (""); //put the word into the map and use the For loop to traverse for(String word:data) {//first define a hashmap outside the loop//put the word into the map//< The form of Word,1>Map.put (word,1); } } //traverse a word in a map//KeySet (): Deposits all the keys in the map into the set set for(String key:map.keySet ()) {//calculates the value of the value corresponding to the key, based on the key valueSystem.out.println (key+ "----" +Map.get (key)); } }}
MapReduce implements word frequency statistics:
PackageCn.bcqm1711.mr.day01;/*** Created by Administrator on 2018/5/2.*/Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;Importorg.apache.hadoop.io.LongWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.mapreduce.lib.partition.HashPartitioner;Importjava.io.IOException;/** * @author: Yongke.pan * @Desc: Custom Word Frequency statistics * @create 2018-05-02 9:44 **/ Public classCustomwordcount {//maptask Stage: By default a data block corresponds to a split Shard, and a shard corresponds to a maptask//longwritable, text represents the offset of each line and the data type of each row of content//Text, intwritable represents the data type of each map output Key/value Public Static classWcmapperextendsmapper<longwritable, text, text, intwritable> { Private Static FinalIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); //called once before the business code is started@Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {}//Write the business logic code, calling the map method once per line@Overrideprotected voidMap (longwritable key, Text value, context context)throwsIOException, interruptedexception {//get the contents of each rowString line =value.tostring (); //split a branch to get the wordstring[] Words = Line.split (""); for(String wd:words) {word.set (wd); //output to Local disk:< word,1>Context.write (Word, one); } } //after the execution of the business code is completed, the last call cleanup@Overrideprotected voidCleanup (context context)throwsIOException, Interruptedexception {}}//Reducertask Stage//Text, intwritable two parameters receive the Key/value data type of the Maptask output//Text, intwritable reducertask stage the Key/value data type that is output to the HDFS system after the data service is processed Public Static classWcreduceextendsReducer<text, Intwritable, Text, intwritable> { //called once before starting to execute the reduce business code@Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {}//The hashcode code of the key is assigned to a reduce@Overrideprotected voidReduce (Text key, iterable<intwritable> values, context context)throwsIOException, interruptedexception {intsum = 0; for(intwritable v:values) {sum+=V.get (); } //output the number of words and times after aggregation to HDFsContext.write (Key,Newintwritable (sum)); } //use once after processing the reduce business code@Overrideprotected voidCleanup (context context)throwsIOException, Interruptedexception {}}//driver portion of job jobs Public Static voidMain (string[] args)throwsException {//Get Configuration ObjectConfiguration conf =NewConfiguration (); //Customwordcount is the name of the job and can be easily viewed on the history server//Job Job=new Job ();Job Job = job.getinstance (conf, "Customwordcount")); //set the entry class for the programJob.setjarbyclass (Customwordcount.class); //Package Maptask StageJob.setmapperclass (Wcmapper.class);//set up business processing code for the map stageJob.setmapoutputkeyclass (Text.class);//tell Mr Frame map to output key data typeJob.setmapoutputvalueclass (intwritable.class);//tell the MR Framework that the map outputs the data type of value//receive the parameters of the main method (parameters passed in when the job was submitted:/words3.txt)Fileinputformat.addinputpath (Job,NewPath (Args[0]));//tell the MR Phase to process the file path//Package Reducetask StageJob.setreducerclass (wcreduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); //which file (/out0502) The data is exported to the HDFS systemFileoutputformat.setoutputpath (Job,NewPath (args[1])); Job.setpartitionerclass (Hashpartitioner.class); Job.setnumreducetasks (2); //Submit Job BooleanIsOk = Job.waitforcompletion (true); System.exit (IsOk? 0:1); }}
The Scala approach enables word frequency statistics:
PackageCn.qmScala.day04Scala/*** Created by Administrator on 2018/6/2 0002. */Object Demo15wordcount {val acc=truedef main (Args:array[string]) {val data=array ("Jin Tian Tian qi bu cuo xiang chu qu Wan") //splits the word. Using the Flatmap methodVal Words:array[string]=data.flatmap (_.split ("")) //the form of word--(word, 1)Val word_one:array[(String,int)]=words.map ((_,1)) //GroupingVal groupbyword:map[string,array[(String,int)]]=word_one.groupby (_._1)//1. Count the number of each wordVal words_times:map[string,int]=groupbyword.mapvalues (_.size)//for ((k,v) <-words_times) println (S "$k, $v")//2. Sort by the number of occurrences of the word. Put the word in the collection, sort by the method of the collectionVal wordstimeslist:list[(String,int)]=words_times.tolist//Val wordcounttimesort:list[(string,int)]=wordstimeslist.sortby (_._2)Val wordcounttimesort:list[(String,int)]=Wordstimeslist.sortby (_._2)//for ((k,v) <-Wordcounttimesort) println (S "$k, $v")//3. Find the most top three most frequently used words .... Scala's approachVal Wordcounttop3=wordcounttimesort.take (3) for((k,v) <-wordCountTop3) println (S "$k, $v") }}
Javaapi,mapreduce,awk,scala four ways to achieve word frequency statistics