Do you know real-time computing?

Source: Internet
Author: User

What is real-time computing?

Take a look at the following figure:

We take the statistics of hot products as an example, look at the traditional means of calculation:

1. 将用户行为、log等信息清洗后保存在数据库中.2. 将订单信息保存在数据库中.3. 利用触发器或者协程等方式建立本地索引,或者远程的独立索引.4. join订单信息、订单明细、用户信息、商品信息等等表,聚合统计20分钟内热卖产品,并返回top-10.5. web或app展示.

This is a hypothetical scenario, but assuming you have experience dealing with a similar scenario, you should be aware of some of these problems and difficulties:

  1. Horizontal scaling issues (scale-out)

    Obviously, if it is a certain scale of e-commerce website, the amount of data is very large. Transaction information is difficult to directly discard the transactional capabilities of the relational database and migrate to a NoSQL database with better scale-out capabilities because it involves transactions.

    Then, the general will do sharding. Historical data Fortunately, we can archive by date, and we can cache the results by using batch-based offline calculations.
    However, the requirements here are within 20 minutes , which is difficult.

  2. Performance issues

    The problem, which is consistent with scale-out, assumes that we do sharding because the tables are scattered across the nodes, so we need to get them multiple times and do aggregate calculations at the business level.

    The question is, how many times do we need to put in the 20-minute time requirement?
    What about 10 minutes?
    What about 5 minutes?
    What about real time?
    Moreover, the business layer also faces the limitation of the single-point computing capability and needs to scale horizontally, so the consistency problem needs to be considered.
    So, everything seems complicated here.

  3. Business expansion issues

    Suppose we not only deal with the statistics of hot goods, but also statistics ads click, or quickly according to the user's access behavior to determine the user characteristics to adjust the information they see, more in line with the potential needs of users, and so on, the business layer will be more complex.

Maybe you have a better idea, but in fact, what we need is a new kind of cognition:

What's happening in this world is real time.
So we need a model for real-time computing, not a batch model.
The model that we need must be able to handle very large data, so there is a good scale-out ability, preferably, we do not need to consider too much consistency, replication problems.

Then, this model is a real-time computing model, or it can be considered as a flow model.

Now, assuming we have such a model, we can happily design new business scenarios:

    1. What is the most forwarded microblogging?
    2. What are the best selling items?
    3. What are the hot spots everyone is searching for?
    4. Which of our ads, in which location, was clicked most?

Or, we can ask:

What's going on in this world?

What's the hottest microblogging topic?

We use a simple sliding window counting problem to uncover the mysterious veil of so-called real-time computing.

Assuming that our business requirements are:

Count the hottest 10 Weibo topics in 20 minutes.

To solve this problem, we need to consider:

    1. Data source
      Here, let's say our data comes from the Twitter-connected tweets topic.

    2. Problem modeling
      The topic we think of is the topic of # expansion, and the hottest topic is that this topic appears more frequently than any other topic.
      For example: @foreach_break: Hello, #世界 #, I love you, #微博 #.
      "The World" and "Weibo" are the topics.

    3. Compute engine
      We use storm.

    4. Define Time
How do I define time?

The definition of time is a difficult task, depending on how much precision is required.
According to reality, we generally use tick to represent the concept of time.

In Storm's infrastructure, the executor start-up phase uses timers to trigger the "over time" event.
As shown below:

(defnsetup-ticks![worker Executor-data]  ( Let [storm-conf (: storm-conf executor-data) tick-time-secs (storm-conf Topology-tick -tuple-freq-secs) receive-queue (: Receive-queue executor-data) context (: Wo Rker-context executor-data)]    ( whenTick-time-secs(if  (or    (system-id?   (:component-id  executor-data) )   (and    (=   false  Span class= "Hljs-list" > (storm-conf  topology-enable-message-timeouts) )   (=   :spout   (:type  executor-data ))))         (log-message   "Timeouts Disabled for executor " :component-id  Executor-data)    ( : Executor-id  executor-data) )         (schedule-recurring (: User-timer worker) tick-time-secs tick-time-secs (fn  [] (disruptor/publish receive-queue [[nil] (Tupleimpl. context [Tick-time-secs] constants/system_task_id constants/system_tick_stream_id)  ]))))))

In the previous blog post, the relationship between these infrastructures has been analyzed in detail, and children's shoes that do not understand can be viewed in the previous article.

Every once in a while, an event is triggered that, when the downstream bolt of a stream receives an event like this, it can choose whether to increment the count or aggregate the result and send it to the stream.

How does bolt determine that the received tuple represents "tick"?
The executor thread that is responsible for managing the bolt, when consuming messages from its subscribed message queue, calls the Execute method of the bolt, which can be judged in execute:

publicstaticbooleanisTick(Tuple tuple) {    returnnull           && Constants.SYSTEM_COMPONENT_ID  .equals(tuple.getSourceComponent())           && Constants.SYSTEM_TICK_STREAM_ID.equals(tuple.getSourceStreamId());}

Combining the Clojure code of the above setup-tick!, we can know that system_tick_stream_id in the callback of the timed event is passed to the tuple by the constructor parameter, then how does system_component_id come?
As you can see, in the following code, SYSTEM_TASK_ID is also passed to the tuple:

;; 请注意SYSTEM_TASK_ID和SYSTEM_TICK_STREAM_ID(TupleImpl. context [tick-time-secs] Constants/SYSTEM_TASK_ID Constants/SYSTEM_TICK_STREAM_ID)

Then use the following code to get SYSTEM_COMPONENT_ID:

    publicgetComponentId(int taskId) {        if(taskId==Constants.SYSTEM_TASK_ID) {            return Constants.SYSTEM_COMPONENT_ID;        else {            return _taskToComponent.get(taskId);        }    }
sliding window

With the infrastructure above, we also need some means to complete the "engineering" to turn the vision into reality.

Here, let's look at the sliding window design of Michael G. Noll.


Note: Images from http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

topology
String Spoutid ="Wordgenerator"; String Counterid ="Counter"; String Intermediaterankerid ="Intermediateranker"; String Totalrankerid ="Finalranker";//Here, suppose Testwordspout is the source of the tuple we send the topic toBuilder.setspout (Spoutid,NewTestwordspout (),5);//Rollingcountbolt time window is 9 seconds, send statistic results every 3 seconds to downstreamBuilder.setbolt (Counterid,NewRollingcountbolt (9,3),4). Fieldsgrouping (Spoutid,NewFields ("word"));//Intermediaterankingsbolt, will complete part of the aggregation, statistics out the topic of top-nBuilder.setbolt (Intermediaterankerid,NewIntermediaterankingsbolt (Top_n),4). Fieldsgrouping (Counterid,NewFields ("obj"));///Totalrankingsbolt, complete aggregation will be completed, the topic of top-n is countedBuilder.setbolt (Totalrankerid,NewTotalrankingsbolt (Top_n)). globalgrouping (Intermediaterankerid);

The above topology is designed as follows:

Combine aggregation calculations with time

In the preceding article, we describe the Tick event, which triggers the Execute method of the bolt in the callback, which can be done:

Rollingcountbolt:

  @Override   Public void Execute(Tuple tuple) {if(Tupleutils.istick (tuple)) {Log.debug ("Received tick tuple, triggering emit of current window counts");//Tick has come, send statistics results within the time window and let the window scrollEmitcurrentwindowcounts (); }Else{//Regular tuple, the topic count can beCountobjandack (tuple); }  }//obj is the topic, add a Count count++  //Note that the speed here depends largely on the speed of the stream, which may be million per second or dozens of per second.  //Insufficient memory? Bolts can be scale-out.  Private void Countobjandack(Tuple tuple) {Object obj = Tuple.getvalue (0);    Counter.incrementcount (obj);  Collector.ack (tuple); }//Send statistical results to downstream    Private void emitcurrentwindowcounts() {map<object, long> counts = Counter.getcountsthenadvancewindow ();intActualwindowlengthinseconds = Lastmodifiedtracker.secondssinceoldestmodification (); Lastmodifiedtracker.markasmodified ();if(Actualwindowlengthinseconds! = windowlengthinseconds)      {Log.warn (String.Format (Window_length_warning_template, Actualwindowlengthinseconds, windowLengthInSeconds));    } Emit (counts, actualwindowlengthinseconds); }

The code above may be a little abstract, and the graph will see that the window will scroll as soon as the Tick is reached:


Note: Images from http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

Intermediaterankingsbolt & Totalrankingsbolt:

  publicfinalvoidexecute(Tuple tuple, BasicOutputCollector collector) {    if (TupleUtils.isTick(tuple)) {      getLogger().debug("Received tick tuple, triggering emit of current rankings");      // 将聚合并排序的结果发送到下游      emitRankings(collector);    }    else {      // 聚合并排序      updateRankingsWithTuple(tuple);    }  }

Of these, Intermediaterankingsbolt and Totalrankingsbolt are slightly different in the aggregation ordering method:

Intermediaterankingsbolt method of aggregation sorting:

// IntermediateRankingsBolt的聚合排序方法:  @Override  void updateRankingsWithTuple(Tuple tuple) {    // 这一步,将话题、话题出现的次数提取出来    Rankable rankable = RankableObjectWithFields.from(tuple);    // 这一步,将话题出现的次数进行聚合,然后重排序所有话题    super.getRankings().updateWith(rankable);  }

Totalrankingsbolt method of aggregation sorting:

// TotalRankingsBolt的聚合排序方法  @Override  void updateRankingsWithTuple(Tuple tuple) {  // 提出来自IntermediateRankingsBolt的中间结果    Rankings rankingsToBeMerged = (Rankings) tuple.getValue(0);  // 聚合并排序    super.getRankings().updateWith(rankingsToBeMerged);  // 去0,节约内存    super.getRankings().pruneZeroCounts();  }

And the reordering method is relatively simple and rough, because only the first n, n is not very large:

  privatevoidrerank() {    Collections.sort(rankedItems);    Collections.reverse(rankedItems);  }
Conclusion

May be the result we want, we have completed the t0-t1 moment between the hot topic statistics, which foreach_break only to piracy:].

In this paper, the concept of sliding window count and the key code are explained in more detail, if you do not understand, please refer to http://www.michael-noll.com/blog/2013/01/18/ implementing-real-time-trending-topics-in-storm/design and Storm's source code.

I hope you understand what real-time computing is:]

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Do you know real-time computing?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.