Word Frequency Statistics
1. Requirement: read data from a specified directory and implement the word count function
2. Implementation Scheme:
Spout is used to read the specified folder (directory), read the file, and send each row of the file to bolt.
Splitbolt is used to receive data from spout and split the data to countbolt.
Countbolt receives every word sent by splitbolt and performs the word count operation.
3. Topology Design:
Datasourcespout + splitbolt + countbolt
The Code is as follows:
Package COM. csylh; import Org. apache. commons. io. fileutils; import Org. apache. storm. config; import Org. apache. storm. localcluster; import Org. apache. storm. spout. spoutoutputcollector; import Org. apache. storm. task. outputcollector; import Org. apache. storm. task. topologycontext; import Org. apache. storm. topology. outputfieldsdeclarer; import Org. apache. storm. topology. topologybuilder; import Org. apache. storm. topology. Base. baserichbolt; import Org. apache. storm. topology. base. baserichspout; import Org. apache. storm. tuple. fields; import Org. apache. storm. tuple. tuple; import Org. apache. storm. tuple. values; import Java. io. file; import Java. io. ioexception; import Java. util. *;/*** Description: Use storm to complete Word Frequency Statistics ** @ Author: changge 36 * Date: */public class localwordcountstormtopology {/*** read data and send it to bolt */Public stat IC class performancespout extends baserichspout {// defines a transmitter private spoutoutputcollector collector; /*** the initialization method will only be called once * @ Param CONF configuration parameter * @ Param context * @ Param collector data transmitter */@ override public void open (MAP Conf, topologycontext context, spoutoutputcollector collector) {// assign the initial value this to the transmitter defined above. collector = collector;}/*** used for data generation * business: * 1. read the data in the folder of the specified directory * 2. transmit each row of Data */@ override Pu BLIC void nexttuple () {// obtain all files. Here, specify the file suffix collection <File> files = fileutils. listfiles (new file ("E: \ stormtext"), new string [] {"TXT"}, true ); // cyclically traverse each file ==> because the directory under the folder is specified here, it is necessary to traverse it cyclically (File file: files) {try {// obtain the list of each row of each file <string> lines = fileutils. readlines (File); For (string line: lines) {// transmits each row of data this. collector. emit (new values (line);} // After todo data processing is complete, rename it. Otherwise, fileutils will be executed all the time.. Movefile (file, new file (file. getabsolutepath () + system. currenttimemillis ();} catch (ioexception e) {e. printstacktrace () ;}}/ *** declares the name of the output field * @ Param declarer */@ override public void declareoutputfields (outputfieldsdeclarer declarer) {declarer. declare (new fields ("line") ;}/ *** splits the data sent from spout */public static class splitbolt extends baserichbolt {private outputcollector collector; /** * The initialization method is only executed once * @ Param stormconf * @ Param context * @ Param collector bolt transmitter, specifying the next bolt address */@ override public void prepare (MAP stormconf, topologycontext context, outputcollector collector) {This. collector = collector ;} /*** get the data sent by spout * business logic * The data sent by spout is a line in a row * Here line is required to be split ** @ Param input */@ override public void execute (tuple input) {string line = input. getstringbyfield ("Line"); string [] words = line. split (","); For (string word: words) {// each word is sent out of this. collector. emit (new values (Word) ;}@ override public void declareoutputfields (outputfieldsdeclarer declarer) {declarer. declare (new fields ("word "));}} /*** bolt */public static class countbolt extends baserichbolt {/*** you do not need to define collector * @ Param stormconf *@ param context * @ Param Co Llector */@ override public void prepare (MAP stormconf, topologycontext context, outputcollector collector) {} Map <string, integer> map = new hashmap <string, integer> (); /*** business logic ** 1. get every word * 2. summarize each word * 3. output result * @ Param input */@ override public void execute (tuple input) {// obtain each word string word = input. getstringbyfield ("word"); integer COUNT = map. get (Word); If (COUNT = NULL) {COUNT = 0 ;} Count ++; // map words. put (word, count); // output system. out. println ("~~~~~~~~~~~~~~~~~~~~~~~ "); Set <map. entry <string, integer> entryset = map. entryset (); For (map. entry <string, integer> entry: entryset) {system. out. println (entry) ;}}@ override public void declareoutputfields (outputfieldsdeclarer declarer) {}}/*** main function * @ Param ARGs */public static void main (string [] ARGs) {// use topologybuilder to build topology topologybuilder = new topologybuilder () based on Spout and bolt; // set the association between bolt and spout and set the spout and bolt builder. setspout ("datasourcespout", new datasourcespout (); builder. setbolt ("splitbolt", new splitbolt ()). shufflegrouping ("performancespout"); builder. setbolt ("countbolt", new countbolt ()). shufflegrouping ("splitbolt"); // create a local cluster localcluster cluster = new localcluster (); cluster. submittopology ("localwordcountstormtopology", new config (), builder. createtopology ());}}
Summary: The steps for developing the storm program are as follows:
Design and implement the Scheme Planning topology as needed
Generally, the spout data generator is first written to the bolt
Next, bolts are used to process data. If there are multiple bolts, the transmitter collector should be written if not the last bolt.
The last bolt directly outputs the results or outputs them to HDFS or relational databases.
Finally, spout and bolt need to be assembled (with topologybuilder)
Use storm for Word Frequency Statistics