WordCount is the most commonly used example of distributed computing, such as Hadoop, storm,iveely computing, and so on. Understand the WordCount on the iveely computing on the operating principle, it is easy to write a new distributed program. I already know how to deploy iveely computing and submit tasks in the previous article, and now we'll dive into WordCount code.
First, the code structure
Figure 3-1
As you can see from figure 3-1, there are two subclasses wordinput, Wordoutput, and one main method in class WordCount, Wordcount.java is a topology, which contains at least one input and output (integral, otherwise meaningless), as well as the main function, the main function is still the entry function of topology.
Now the question is, what does input and output really matter? and topology?
Each topology is a complete chain of tasks, which can contain multiple input, and multiple output,input data can only be passed to one or more output,output only to one or more output, thus forming a complete topological structure.
Second, Input in-depth
Input is the source of the data, and it is wordinput to see how the data is generated and passed to output.
Public Static classWordinputextendsIinput {/*** Output data to collector. */ PrivateStreamchannel _channel; /*** All sample words. */ Private Finalstring[] _words =NewString[] {"Welcome", "iveely", "Computing", "0.9.0", "Build", "by", "Liufanping", "Thanks", "github.com" }; Private int_index; @Override Public voidStart (hashmap<string, object>conf, Streamchannel Channel) { //here,must be initialize channel._channel =Channel; _index= _words.length-1; } @Override Public voiddeclareoutputfields (Fieldsdeclarer declarer) {Declarer.declare (NewString[] {"word"},NewInteger[] {0 }); } @Override Public voidnexttuple () {if(_index < 0) {_channel.emitend (); } Else { for(inti = 0; I < 100; i++) {_channel.emit (_words[_index]); } _index--; }} @Override Public voidEnd () {System.out.println (GetName ()+ "finished."); } @Override Public voidTooutput () {_channel.addoutputto (Newwordoutput ()); } }
Function Explanation:
Start function |
Before executing this input, a function called in advance, user initialization, and other related work, similar to the constructor, must initialize the channel when there is data output. |
Declareoutputfields function |
The data information used to declare the output. |
Nexttuple function |
This function will be called frequently to output data and to use Channel.emit to submit data to an output. |
End Function |
Is the code that executes after input executes, similar to a destructor. |
Tooutput function |
is the output to which the data for input is specified. |
There are several issues to note in the above code:
2.1 Wordinput must inherit iinput.
2.2 Input, the channel must be initialized in start because input is bound to produce data.
2.3 Input, the data flow must be specified in the Tooutput function.
Third, output in depth
Output is the processing unit of the data, or it can be the generating unit of the new data.
Public Static classWordoutputextendsIoutput {PrivateTreemap<string, integer>_map; @Override Public voidStart (hashmap<string, object>conf, Streamchannel Channel) {_map=NewTreemap<>(); } @Override Public voiddeclareoutputfields (Fieldsdeclarer declarer) {Declarer.declare (NewString[] {"word", "totalcount"},NULL); } @Override Public voidExecute (tuple tuple) {String word= Tuple.get (0). toString (); if(_map.containskey (Word)) {intCurrentcount =_map.get (word); _map.put (Word, Currentcount+ 1); } Else{_map.put (Word,1); }} @Override Public voidEnd () {//Output map to database or print.Iterator<string> it =_map.keyset (). iterator (); while(It.hasnext ()) {String key=It.next (); intValue =_map.get (key); System.out.println (GetName ()+ ":" + key + "," +value); }} @Override Public voidtooutput () {}}
There is no nexttuple function in output compared to input, instead of the Execute function. Nexttuple is generating data, and execute is processing data. If the data after execute processing is also required to be submitted to the new output, you need to submit the data in execute using the Channel.emit method, and you need to specify the data flow in the Tooutput.
There are several issues to note here:
3.1 If output needs to continue to pass data, you need to initialize the channel in start.
3.2 If the current output accepts a data source from a different input, and the data format is not uniform, you need to determine the data format yourself, such as passing the array, the first one to identify with an int is the data format.
Four, main function
The main function, which is still the execution entry for topology, differs in that it has two modes of execution, one local and one remote. Local mode is used to tune the trial.
Public Static void Main (string[] args) { new Topologybuilder (true, WordCount. Class. GetName (), "WordCount"); Builder.setinput (new wordinput (), 1); Builder.setoutput (new wordoutput (), 4); Builder.setslave (2); Topologysubmitter.submit (builder, args); }
Main function, mainly do the work.
4.1 Creates a new Topologybuilder object, and in the first parameter of the constructor specifies whether the current local mode (TRUE) or remote mode (FALSE), the second parameter, specifies the name of the class to execute, the third argument, and the current topology.
4.2 Set input and output. and specifies the number of runs (threads).
4.3 Specifies the number of nodes to run (process).
4.4 Submit a task with Topologysubmitter.
4.5 Note: Be sure to change the first parameter of Topologybuilder to remote mode (false) when generating a jar submission to the server.
Open source distributed real-time computing engine iveely Computing WordCount detailed (3)