Storm real-time analysis-Example 2

Source: Internet
Author: User
A more complex example

The above DRPC example is just a simple example to introduce the concept of DRPC. Let's take a look at a complex example that really requires storm's parallel computing capability. This example calculates the Reach value of a URL on Twitter.

First, we will introduce what is the reach value. to calculate the Reach value of a URL, We need:

  • Get all people who contain this URL
  • Get fans of these people
  • Remove these fans
  • Obtain the number of followers after deduplication-this is reach

A simple reach computing may have thousands of database calls, and may be designed for millions of small users. This is indeed the computing of the CPU intensive. What you will see is that it is very simple to implement it on storm. On a single machine, it may take several minutes for a reach calculation. In a storm cluster, it takes only a few seconds to instantly become the most male URL.

An example of reaching topolgoy can be found here (storm-starter ). Reach topology is defined as follows:

 

[Java]View plaincopy
  1. Lineardrpctopologybuilder
  2. = Newlineardrpctopologybuilder ("reach ");
  3. Builder. addbolt (newgettweeters (), 3 );
  4. Builder. addbolt (newgetfollowers (), 12)
  5. . Shufflegrouping ();
  6. Builder. addbolt (newpartialuniquer (), 6)
  7. . Fieldsgrouping (newfields ("ID", "follower "));
  8. Builder. addbolt (newcountaggregator (), 2)
  9. . Fieldsgrouping (newfields ("ID "));

 

 

The topology is executed in four steps:

  • GetTweetersObtain all users whose URLs are included in the meager content. It receives the input stream:[id, url], Which outputs:[id, tweeter]. No URL tuple will correspond to manytweeterTuple.
  • GetFollowersGet these tweeter fans. It receives the input stream:[id, tweeter], Which outputs:[id, follower]
  • PartialUniquerYou can use the fan ID to group fans. This leads the same analysis to a unified task. Therefore, different tasks receive different fans-to remove duplicates. Its output stream:[id, count]That is, the number of fans on the task is output.
  • Finally,CountAggregatorAfter receiving all the local numbers, we can sum up them to calculate the Reach value.

Let's take a look.PartialUniquerImplementation:

 

[Java]View plaincopy
  1. Publicstatic class partialuniquer
  2. Implementsirichbolt, finishedcallback {
  3. Outputcollector _ collector;
  4. Map <object, set <string> _ Sets
  5. = Newhashmap <object, set <string> ();
  6. Publicvoid prepare (MAP Conf,
  7. Topologycontext context,
  8. Outputcollector collector ){
  9. _ Collector = collector;
  10. }
  11. Publicvoid execute (tuple ){
  12. Object ID = tuple. getvalue (0 );
  13. Set <string> curr = _ sets. Get (ID );
  14. If (curr = NULL ){
  15. Curr = newhashset <string> ();
  16. _ Sets. Put (ID, curr );
  17. }
  18. Curr. Add (tuple. getstring (1 ));
  19. _ Collector. Ack (tuple );
  20. }
  21. Publicvoid cleanup (){
  22. }
  23. Publicvoid finishedid (Object ID ){
  24. Set <string> curr = _ sets. Remove (ID );
  25. Intcount;
  26. If (curr! = NULL ){
  27. Count = curr. Size ();
  28. } Else {
  29. Count = 0;
  30. }
  31. _ Collector. emit (newvalues (ID, count ));
  32. }
  33. Publicvoid declareoutputfields (outputfieldsdeclarer declarer ){
  34. Declarer. Declare (newfields ("ID", "partial-count "));
  35. }
  36. }

 

 

WhenPartialUniquerInexecuteMethod to receiveFan tupleIt adds the tuple toSet.

 

PartialUniquerAt the same timeFinishedCallbackInterface.LinearDRPCTopologyBuilderIt wants to be notified after receiving all tuple of a request-id. The callback function is the code> finishedid method. In this callback functionPartialUniquerThe number of followers of the current request-ID on this task.

Behind this simple interface, we useCoordinatedBoltTo detect when a bolt receives all the tuple of a request.CoordinatedBoltIt is achieved by using direct stream.

The rest of this topology is very clear. We can see that each step of REACH computing is calculated in parallel, and implementing the topology of this DRPC is so simple.

Non-linear DRPC Topology

LinearDRPCTopologyBuilderOnly DRPC topology with "Linearity" can be implemented. The so-called Linearity means that your computing process goes one step after another and is connected in series. It is not hard to imagine that there are other possibilities-parallel connections (think back to the parallel circuits learned in junior high school physics). Now, if you want to solve this parallel case, you need to use it on your own.CoordinatedBoltTo handle everything. If there is such a use case, let's discuss it on the mailing list.

How lineardrpctopologybuilder works
  • Drpcspout launch tuple:[args, return-info].return-infoThe host address, port, and request-ID of the current request of the DRPC server.
  • DRPC topology includes the following elements:
    • Drpcspout
    • Preparerequest (generate request-ID, return info and ARGs)
    • Coordinatedbolt
    • Joinresult -- combination result and return info
    • Returnresult -- connect to the DRPC server and return the result
  • Lineardrpctopologybuilder is a good example of using storm primitives to build high-level abstraction.
Advanced features
    • How to Use keyedfairbolt to process multiple requests simultaneously
    • How to Use coordinatedbolt directly

Storm real-time analysis-Example 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.