Several difficult points of flow-statistic

Source: Internet
Author: User
Tags ack current time emit message queue redis sleep zookeeper


Streaming statistics listen to a very easy thing, in the final analysis is not counting, each alarm system basically has a simple streaming statistics module. But when I was doing it based on storm, these questions still bothered me for a long time. Without using spark streaming/flink, I don't know if the following questions have been solved well in spark streaming/flink. 

time Window slicing problem



The first problem in streaming statistics is to count the data in a time window together. The question is, what is the time window. There are two options for log time (event timestamp) wall time (wall clock)



The simplest time window is based on "wall time", and every 1 minutes a new window is cut out. For example, STATSD, its window segmentation is like this. A very serious problem with this "wall-time" statistic is the inability to replay data streams. When the data flow is generated in real time, a minute of "wall time" will only be generated by one minute of event. But if the data flow of statistics is based on historical event, then the number of event that can generate consumption in a minute is limited by the speed of data processing. In addition, the event in the distributed collection also encountered a fast and slow problem, the event generated within a minute may not be able to accurately reach the statistical end in a minute, which will affect the accuracy of statistical data due to the delay fluctuations collected. Actually based on the "wall time" statistic needed



Collection latency = Wall clock-event timestamp



Statistics based on "wall time" require very little delay in collection, and very little volatility to work well. Most of the time, a more realistic choice is to use the "Log Time" to do the window statistics.



The use of "Log Time" will introduce the problem of data chaos, for a real-time event stream stream, the timestamp of each event may not necessarily be strictly incremental. There are two factors for this disorder: The machine's clock is not fully synchronized (NTP has a 100ms or so of different steps) event from acquisition to reach Kafka speed imbalance (different network lines have fast and slow)



The streaming statistics We want are like this:






But in fact the data is basically orderly, that is, at the edge of the time window there will be some event need to cross to another window:






The simplest distribution event-to-time window code is like this



Window index = event Timestamp/window size



For a time window of 1 minutes Windows size is 60,timestamp divided by 60 for the same window index event is in the same time windows. The point is, when I can be sure that the event in this time window is already there. If you are, you can start counting the indicators in this time window. Then suddenly there is an event that falls behind everyone and how to deal with the time window that has been calculated. For most statistics, it is not a big problem for a time window to count multiple results in db, and to merge multiple results when querying from DB. For some types of statistics (non-monad), such as averages, the event in the time window is divided into two batches and the result is no way to be aggregated again. The computation of real-time classes is time-sensitive and the data that comes late is meaningless. such as alarm, a time window past there is no need to ignore this time window.



So for the late data there are two strategies: either count the results out, or discard them directly. To determine when an event in a time window has been reached, there are several strategies: sleep waits for a period of time (wall time) event timestamp exceeds the time window a little bit does not close the current time window, but waits for the event The window is closed when the timestamp is significantly out of time windows. For example, 12:05:30 seconds of the event to close 12:04:00 ~ 12:05:00 time window. One or two event exceeds the time window does not close, only if "a large amount" of the event exceeds the time window. For example, 1 event is more than 12:05 minutes does not close, if there are 100 event more than 12:05 of the time window to close it.



Three strategies are actually "waiting", but the basis for such a difference. In practice, the second strategy is based on the "Log Time" waiting is the most easily implemented. If the expired event is not discarded, but a result is counted again, the expired window is reopened and a round of "waiting" to determine when the past window is closed.



There has been a similar attempt on Spark: Building Big Data operational Intelligence platform with Apache spark-eric Carr (guavus) 

Multi-stream Merge problem



A Kafka partition is a stream, and multiple partition of a Kafka topic are multiple independent streams (offset grows independently of each other). Multiple Kafka topic are clearly multiple independent streams. Streaming statistics often requires merging multiple streams together. In this case, you will encounter two problems. Multiple streams are not the same speed, how to judge the event in a time window is all up. If you follow the previous wait policy, it is possible to deal with the basic ordered local chaos within a flow, but there is nothing to do with multiple flows with very different flow velocities. A very fast stream can easily push the window back a long way, putting the other streams far behind. The flow rate can not be on the downstream of the pocket, the downstream memory is limited. Essentially a "back pressure" mechanism is needed to let downstream notifications flow too fast upstream, you slow to generate new event, and so on others.



To give a specific example:


Spout 1 emit 12:05
spout 1 emit 12:06
spout 2 emit 12:04
spout 1 emit 12:07
spout 2 emit 12:05//This is When 12:05 are ready


To know that 12:05 this time window of the event is all up, first of all to know the relevant flow has several (in this case, SPOUT1 and spout2 two streams), and then to know when SPOUT1 produced 12:05 of the data, when Spout2 produced 12:05 of the data , the last to be able to determine that 12:05 of the data is to be aligned. In a place to save a copy of this flow of data to track, in the window after the data to the signal to send the relevant downstream forward to push the time window. Consider a distributed system where the trace is to be placed and how to notify all interested parties.



An extreme example


Spout 1 emit 13:05
spout 2 emit 12:31
spout 1 emit 13:06
spout 2 emit 12:32


The flow rate for multiple streams can vary from more than half an hour. Considering that if the historical data is remitted into the real-time statistical system, it is easy to make the processing progress between different nodes inconsistent because of the different computation speed. To calculate the correct results, the downstream needs to cache all the data in half an hour of these differences, which can easily burst memory. But upstream how to perceive downstream to be processed not to come over. How does multiple upstream perceive the difference in speed between each other? And who is going to arbitrate who should be slow to flow?



A relatively simple approach is to introduce a coordinator role throughout the distributed system of streaming statistics. It is responsible for tracking the flow rate of the different streams, notifying downstream flush after the data of the time window has been aligned, and when some upstream flow rate is too fast (e.g. the fastest flow compared to the slowest flow gap greater than 10 minutes) sent by the coordinator Backoff instruction to the upstream of the flow rate too fast, Then sleep for a while after receiving the instruction. A basic code for tracking different flow rates: https://gist.github.com/taowen/2d0b3bcc0a4bfaecd404 Click to preview 

data consistency issues



Low-grade Some of the argument is this. Let's say this is the curve of statistics:






If the middle, for example, around 08:35 restarts the statistical program, then the curve can still be continuous.



High-end some of the argument is that the flow of statistics can be understood between the main database and the analysis database through the Kafka Message queue asynchronous synchronization. Eventual consistency should be maintained between the primary database and the analysis database.






To ensure that the data is not lost, it is necessary to produce to Kafka, the primary database and the Kafka message queue to maintain a transactional consistency. To give a simple example:


The user has placed a
data record of an order in the main database of orders
Kafka a orderplaced event in the message queue


One problem with this process is that after the primary data has been inserted successfully, it is possible to enqueue the event to the Kafka message queue. If you reverse this operation


The user placed an order
Kafka message queue more than one orderplaced in the event
main database insert an order data record


It may also appear that the Kafka message queue is enqueue, but the primary database insert fails. As far as the current design of the Kafka queue is concerned, there is no solution to this problem. Once the event is enqueue, it cannot be deleted unless it expires.



On the consumer side, when we take the data out of the Kafka, the process of updating the Analysis database also maintains the consistency of a distributed transaction.


Take out the next orderplaced evnet (point to offset+1)
statistic value of the current time window +1
Repeat the process until the window is closed and the data is written to the analysis database


The data of the Kafka can be replayed, as long as the offset is specified and the subsequent data can be read out. The process of the so-called consumption is to add 1 of the value of the client-saved offset. The question is where the offset pointer is stored. The usual practice is to save the offset of consumption into the zookeeper. Then there is a distributed consistency problem, zookeeper offset+1, but the analysis database does not actually count the values. Taking into account that statistics are generally not each input event will update the analysis database, but the intermediate state is cached in memory. Then it is possible to consume thousands of event, the state is in memory, and then "snapped" the machine power down. If you move offset each time the event is read, the event is discarded. If you do not move the offset every time, it may cause duplicate statistics at the time of the restart.



Do statistics people care about such one or two data. In fact, most people do not care. Many teams don't even have offset at all, and each time they start counting direct seek to the tail of the queue. Real-time computing is the most important thing in real time. Accurate calculations. Replay history. It's good to have Hadoop done. But if it is to be more true. Or we do not pursue strict strong consistency, only requires a restart after the curve is constantly open so ugly just fine.



Other flow-computing frameworks are unclear, and Storm's ACK mechanism is unhelpful.






The ACK mechanism for storm is based on each message. This requires that if you do a statistic of 1 million event per minute, you will be tracking 1 million message IDs in one minute. Even 1 million int is a considerable amount of memory overhead. You know, the event read from the Kafka is in order offset, processing is also the order, as long as the record of an offset can track the entire flow of consumption progress. 1 int, the mechanism for the per-message ack of 1 million int,storm for streaming progress tracking, does not take advantage of the ordering of message processing (storm essentially assumes that the message is handled independently of each other) and becomes inefficient.



It is difficult to be consistent, and it is necessary to insert an update-saved offset update into the analysis database



into an atomic transaction to complete. Most analysis databases do not have the ability to have atomic transactions, and even inserting three of data cannot remain visible at the same time, and does not say that it is used to record offset. Given that Kafka cannot provide distributed transactions at the production end, the event is not completely consistent from production (more or less produced), and the real high consistent billing scenario is the other technology stack. So the question to be solved is how to restore the memory state that was discarded when the reboot was restarted, so that the calculated curve is still continuous.



The solution has three points: upstream backup strategy: Replay Kafka historical data at reboot, restore memory state intermediate state persistence: Put the statistic state in the external persistent database, do not put the memory in the same time run two parts: at the same time there are two exactly the same statistic task, restart one, the other can run normally. 


 problems with memory state management



There are two ways to do streaming statistics: depending on the external storage management state: For example, to seize an event, go to Redis incr increase 1 Pure memory statistics: Set a counter in memory, each received an event +1



Based on external storage, the entire pressure is pressed to the database. Generally speaking, the flow rate is very fast, much larger than the normal relational database, and may even exceed the load of a single redis. This makes statistics based on pure memory very attractive. Most of the time is in the Update window of the memory state, only when the time window is closed to the data brush to the analysis database. Swipe the data out to record where the current stream is being consumed (offset).






This pure memory state is relatively easy to manage. Calculations are done directly based on this memory state. If the reboot is lost, replay a piece of historical data to rebuild it.



But the problem with memory is that it's not always enough. When the statistical dimension combination is particularly high, such as if one of the fields is the user's ID, then the memory state will soon exceed the memory limit of the machine. There are two ways to do this: using partition to split input inputs, a stream into multiple streams, and the combination of dimensions that each statistic program needs to track is less likely to move the storage outside



Simply switching the database connection in the streaming statistics program can solve this capacity problem:






But this careless use of external databases can cause two problems: slow processing. Without some bulk operations, the database operation will quickly become the bottleneck of the database state is not always. The state of the memory is restarted and is lost, not lost after the external state is restarted. Replay data streams can result in repetitive statistics of data



But the benefits of such a window-counting middle state are also obvious. The memory state is not restored after a reboot without a re-count. If a time window has 24 hours, it may be expensive to re-calculate the 24-hour historical data.



Version tracking, batch, etc. should not be the responsibility of the implementation of the specific statistical logic. The theoretical framework should be responsible for separating the hot and cold data and automatically sinking the cold data to the external storage to free up local memory. At the same time, each time a small batch of event processing to record the offset, instead of waiting to wait until the window closes.






The state of the database and the state of memory become a tightly integrated whole. The relationship between the two can be imagined as the FileSystem page cache of the operating system. Using Mmap to map the state into memory, the framework is responsible for when the memory changes are persisted to the external storage.

  Summary



Storm-based streaming statistics lack a proven solution to the following four basic issues. The Trident framework may provide some answers, but in practice it seems that there are not many people and too little information. You can be more confident that it's not just storm, it's true for most streaming computing platforms. Problem of time window segmentation problem data consistency problem with multi-stream merging (problem of curve breaking after restart) memory state management issues



It will take some effort to solve these problems properly. A new generation of streaming computing frameworks such as Spark Streaming/flink should have a lot of improvements. Even if the underlying framework provides support, it's a good thing to look at how they support it from these four perspectives.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.