A more complex example
The above DRPC example is just a simple example to introduce the concept of DRPC. Let's take a look at a complex example that really requires storm's parallel computing capability. This example calculates the Reach value of a URL on Twitter.
First, we will introduce what is the reach value. to calculate the Reach value of a URL, We need:
- Get all people who contain this URL
- Get fans of these people
- Remove these fans
- Obtain the number of followers after deduplication-this is reach
A simple reach computing may have thousands of database calls, and may be designed for millions of small users. This is indeed the computing of the CPU intensive. What you will see is that it is very simple to implement it on storm. On a single machine, it may take several minutes for a reach calculation. In a storm cluster, it takes only a few seconds to instantly become the most male URL.
An example of reaching topolgoy can be found here (storm-starter ). Reach topology is defined as follows:
[Java]View plaincopy
- Lineardrpctopologybuilder
- = Newlineardrpctopologybuilder ("reach ");
- Builder. addbolt (newgettweeters (), 3 );
- Builder. addbolt (newgetfollowers (), 12)
- . Shufflegrouping ();
- Builder. addbolt (newpartialuniquer (), 6)
- . Fieldsgrouping (newfields ("ID", "follower "));
- Builder. addbolt (newcountaggregator (), 2)
- . Fieldsgrouping (newfields ("ID "));
The topology is executed in four steps:
GetTweeters
Obtain all users whose URLs are included in the meager content. It receives the input stream:[id, url]
, Which outputs:[id, tweeter]
. No URL tuple will correspond to manytweeter
Tuple.
GetFollowers
Get these tweeter fans. It receives the input stream:[id, tweeter]
, Which outputs:[id, follower]
PartialUniquer
You can use the fan ID to group fans. This leads the same analysis to a unified task. Therefore, different tasks receive different fans-to remove duplicates. Its output stream:[id, count]
That is, the number of fans on the task is output.
- Finally,
CountAggregator
After receiving all the local numbers, we can sum up them to calculate the Reach value.
Let's take a look.PartialUniquer
Implementation:
[Java]View plaincopy
- Publicstatic class partialuniquer
- Implementsirichbolt, finishedcallback {
- Outputcollector _ collector;
- Map <object, set <string> _ Sets
- = Newhashmap <object, set <string> ();
- Publicvoid prepare (MAP Conf,
- Topologycontext context,
- Outputcollector collector ){
- _ Collector = collector;
- }
- Publicvoid execute (tuple ){
- Object ID = tuple. getvalue (0 );
- Set <string> curr = _ sets. Get (ID );
- If (curr = NULL ){
- Curr = newhashset <string> ();
- _ Sets. Put (ID, curr );
- }
- Curr. Add (tuple. getstring (1 ));
- _ Collector. Ack (tuple );
- }
- Publicvoid cleanup (){
- }
- Publicvoid finishedid (Object ID ){
- Set <string> curr = _ sets. Remove (ID );
- Intcount;
- If (curr! = NULL ){
- Count = curr. Size ();
- } Else {
- Count = 0;
- }
- _ Collector. emit (newvalues (ID, count ));
- }
- Publicvoid declareoutputfields (outputfieldsdeclarer declarer ){
- Declarer. Declare (newfields ("ID", "partial-count "));
- }
- }
WhenPartialUniquer
Inexecute
Method to receiveFan tuple
It adds the tuple toSet
.
PartialUniquer
At the same timeFinishedCallback
Interface.LinearDRPCTopologyBuilder
It wants to be notified after receiving all tuple of a request-id. The callback function is the code> finishedid method. In this callback functionPartialUniquer
The number of followers of the current request-ID on this task.
Behind this simple interface, we useCoordinatedBolt
To detect when a bolt receives all the tuple of a request.CoordinatedBolt
It is achieved by using direct stream.
The rest of this topology is very clear. We can see that each step of REACH computing is calculated in parallel, and implementing the topology of this DRPC is so simple.
Non-linear DRPC Topology
LinearDRPCTopologyBuilder
Only DRPC topology with "Linearity" can be implemented. The so-called Linearity means that your computing process goes one step after another and is connected in series. It is not hard to imagine that there are other possibilities-parallel connections (think back to the parallel circuits learned in junior high school physics). Now, if you want to solve this parallel case, you need to use it on your own.CoordinatedBolt
To handle everything. If there is such a use case, let's discuss it on the mailing list.
How lineardrpctopologybuilder works
- Drpcspout launch tuple:
[args, return-info]
.return-info
The host address, port, and request-ID of the current request of the DRPC server.
- DRPC topology includes the following elements:
- Drpcspout
- Preparerequest (generate request-ID, return info and ARGs)
- Coordinatedbolt
- Joinresult -- combination result and return info
- Returnresult -- connect to the DRPC server and return the result
- Lineardrpctopologybuilder is a good example of using storm primitives to build high-level abstraction.
Advanced features
- How to Use keyedfairbolt to process multiple requests simultaneously
- How to Use coordinatedbolt directly
Storm real-time analysis-Example 2