What is real time computing?
take a look at the figure below:
We take the hot product statistics for example, look at the traditional computing means:
1 The user behavior, log and other information after cleaning to save in the database.
2 Keep The order information in the database.
3 use triggers or a coprocessor to establish a local index, or a remote independent index.
4join order information, order details, user information, merchandise information and so on table, aggregation statistics in 20 minutes hot products, and return to top-10.
5web or app show.
This is a hypothetical scenario, but assuming you have experience with a similar scenario, you should be aware of such problems and difficulties:
1, horizontal expansion problem (Scale-out)
Obviously, if is a certain scale E-commerce website, the data quantity is very big. Transaction information is difficult to abandon the transaction ability of relational database directly and migrate to the NoSQL database with better scale-out capability because of the transaction.
Then, generally will do sharding. The historical data is fine, we can file by date, and we can cache the results by using a batch-style off-line calculation.
But the requirement here is 20 minutes, it's hard.
2, Performance problems
This problem, and scale-out is consistent, assuming we do the sharding, because the table is dispersed in each node, so we need to go over the storage, and in the business layer do aggregation calculation.
The question is, how many times do we need to be in storage for 20 minutes?
10 minutes?
5 minutes?
What about real time?
Furthermore, the business layer also faces the limitation of single point of computing and needs to be extended horizontally, so the problem of consistency needs to be considered.
So, everything here is complicated.
3, Business expansion issues
Assuming that we not only deal with the statistics of hot goods, but also to statistical ads click, or quickly based on user's access behavior to determine user characteristics to adjust the information they see, more in line with the potential needs of users, then the business layer will be more complex.
Maybe you have a better idea, but in fact, what we need is a new kind of cognition:
What is happening in this world is in real time.
So we need a model of real time computing, not a batch model.
We need this model, must be able to handle a lot of data, so have a good scale-out ability, preferably, we do not need to consider too many consistency, replication problems.
So, this model is a real-time computing model, and it can be considered as a flow-type computing model.
Now, assuming we have this model, we can happily design new business scenarios:
What are the tweets that are forwarded most?
What are the best selling items?
What are the hotspots that everyone is searching for?
Which of our ads, in which position, was clicked most?
Or, we can ask:
This world, what's going on?
What is the hottest microblogging topic?
We use a simple sliding window counting problem to uncover the mysterious veil of real time computing.
Suppose that our business requirements are:
The hottest 10 Twitter topics in 20 minutes are counted.
To solve this problem, we need to consider:
1. Data source
Here, let's assume that our data comes from the Twitter-long link push topic.
2. Problem modeling
we think the topic is the expansion of the topic of #, the hottest topic is that this topic appears more frequently than other topics.
For example: @foreach_break: Hello, #世界 #, I love you, #微博 #
"The World" and "microblogging" is the topic.
3. Calculation engine
we use storm.
4. Define Time
How do I define time?
The definition of time is a difficult thing, depending on how much precision is required.
According to reality, we generally use tick to express the concept of time.
In the storm infrastructure, the executor startup phase uses a timer to trigger the "over time" event.
As shown below:
(Defn setup-ticks! [Worker Executor-data]
(Let [storm-conf (: storm-conf executor-data)
tick-time-secs (storm-conf topology-tick-tuple-freq-secs)
Receive-queue (: Receive-queue executor-data) Context
(: Worker-context executor-data)]
(when Tick-time-secs
(if or (System-id? (: Component-id executor-data))
(and (= False (storm-conf topology-enable-message-timeouts))
(=: Spout (: Type Executor-data)
)) (log-message "Timeouts Disabled for Executor" (: Component-id executor-data) ":" (: Executor-id executor-data))
(Schedule-recurring
(: User-timer worker)
Tick-time-secs
tick-time-secs
(fn []
(disruptor/publish
receive-queue
[[Nil] (Tupleimpl. context [Tick-time-secs] constants/system_task_id constants/system_tick_stream_id)])))
Every once in a while, an event is triggered that, when the downstream bolt of the stream receives an event like this, can choose whether to increment the count or aggregate the result and send it to the stream.
How does bolt judge that the tuple received is "tick"?
The executor thread that is responsible for managing bolt calls the Execute method to bolt when consuming messages from its subscribed Message Queuing, so you can judge in execute:
public static Boolean Istick (Tuple Tuple) {return
Tuple!= null
&& constants.system_component_id. Equals (Tuple.getsourcecomponent ())
&& Constants.SYSTEM_TICK_STREAM_ID.equals ( Tuple.getsourcestreamid ());
}
Combining the above setup-tick! Clojure code, we can know that system_tick_stream_id in the callback of the timed event is passed to the tuple with the parameter of the constructor, then how does system_component_id come?
As you can see, in the following code, SYSTEM_TASK_ID also passes to tuple:
;; Please note that system_task_id and system_tick_stream_id
(Tupleimpl. Context [Tick-time-secs] constants/system_task_id constants/system_tick_stream_id)
You can then use the following code to get SYSTEM_COMPONENT_ID:
Public String getcomponentid (int taskId) {
if (taskid==constants.system_task_id) {return
constants.system_ component_id;
} else {return
_tasktocomponent.get (taskId);
}
}
Sliding window
With the infrastructure above, we also need some means to achieve "engineering" and make the vision a reality.
Here, let's look at the sliding window design of Michael G. Noll.
Topology
String Spoutid = "Wordgenerator";
String Counterid = "Counter";
String Intermediaterankerid = "Intermediateranker";
String Totalrankerid = "Finalranker";
Here, suppose Testwordspout is the source of the topic tuple we are sending
builder.setspout (Spoutid, New Testwordspout (), 5);
The Rollingcountbolt time window is 9 seconds, sending statistics every 3 seconds to downstream
Builder.setbolt (Counterid, New Rollingcountbolt (9, 3), 4). Fieldsgrouping (Spoutid, New Fields ("word"));
Intermediaterankingsbolt, will complete the partial aggregation, the Statistics top-n topic
Builder.setbolt (Intermediaterankerid, new Intermediaterankingsbolt (Top_n), 4). fieldsgrouping (Counterid, New Fields (
"obj"));
Totalrankingsbolt, will complete the aggregation, Statistical top-n topic
Builder.setbolt (Totalrankerid, New Totalrankingsbolt (Top_n)). Globalgrouping (Intermediaterankerid);
The above topology design is as follows:
Combine aggregation calculations with time
in the previous article, we described the Tick event, which triggers the bolt Execute method, which can be done:
Rollingcountbolt:
@Override public void Execute (Tuple Tuple) {if (Tupleutils.istick (Tuple)) {log.debug ("Received tick Tuple, Trigg
ering emit of current window counts ");
The tick comes, sends the statistic result in the time window, and lets the window scroll emitcurrentwindowcounts ();
else {//general tuple, to the topic count can be countobjandack (tuple);
}//obj is the topic, adding a Count count++//Note that the speed here depends largely on the speed of the stream, possibly millions per second, or dozens of per second. Not enough memory?
Bolt can be scale-out.
private void Countobjandack (Tuple Tuple) {Object obj = tuple.getvalue (0);
Counter.incrementcount (obj);
Collector.ack (tuple); //Send statistic results to downstream private void emitcurrentwindowcounts () {map<object, long> counts = Counter.getcountsthenadvan
Cewindow ();
int actualwindowlengthinseconds = Lastmodifiedtracker.secondssinceoldestmodification ();
Lastmodifiedtracker.markasmodified (); if (actualwindowlengthinseconds!= windowlengthinseconds) {Log.warn String.Format (window_length_warning_template,
Actualwindowlengthinseconds, windowlengthinseconds)); } Emit (counts, ActualwindowlengthinsecoNDS);
}
The above code may be a little abstract, look at this figure is clear, tick, the window will scroll:
Intermediaterankingsbolt & Totalrankingsbolt:
Public final void Execute (Tuple Tuple, basicoutputcollector collector) {
if (Tupleutils.istick (Tuple)) {
GetLogger (). Debug ("Received tick tuple, triggering emit of current rankings");
Sends the aggregated and sorted results to the downstream
emitrankings (collector);
else {
//poly-merge sort
updaterankingswithtuple (tuple);
}
}
Among them, the aggregation sorting method of Intermediaterankingsbolt and Totalrankingsbolt is slightly different:
Intermediaterankingsbolt Aggregation Sorting method:
Intermediaterankingsbolt Aggregation Sorting method:
@Override
void Updaterankingswithtuple (Tuple Tuple) {
//This step, the topic , the number of topics appeared
rankable rankable = rankableobjectwithfields.from (tuple);
In this step, you aggregate the number of topics that appear, and then reorder all topics
super.getrankings (). Updatewith (rankable);
Totalrankingsbolt Aggregation Sorting method:
Totalrankingsbolt's aggregation Sort method
@Override
void Updaterankingswithtuple (Tuple Tuple) {
// The intermediate results from Intermediaterankingsbolt
rankings rankingstobemerged = (rankings) tuple.getvalue (0) are presented.
Aggregates and sorts
super.getrankings (). Updatewith (rankingstobemerged);
Go to 0, save memory
super.getrankings (). prunezerocounts ();
And the reordering method is simple and rough, because only the first n, N is not very big:
private void Rerank () {
collections.sort (rankeditems);
Collections.reverse (Rankeditems);
}
Conclusion
The following figure may be the result we want, and we've completed the hot topic statistics between T0-T1 moments, and the foreach_break is just for anti-piracy:].
The above is the entire content of this article, I hope you like.