Background
Compared with the traditional batch analysis platform such as Hadoop, the advantage of streaming analysis is real-time, that is, can be analyzed in the second-level delay.
Of course, the disadvantage is that it is difficult to ensure strong consistency, that is, exactly-once semantics (in the premise of massive data, in order to guarantee the throughput, can not use similar transaction strong consistency scheme).
General streaming analysis platform will promise weak consistency, that is, least-once semantics, ensure data is not lost but allow data duplication.
But this is only under normal circumstances, when the flow analysis of any part of the failure, the entire flow is blocked, will cause the layer queue is full, and eventually will still lose data.
Therefore, for the streaming analysis platform, if you want to ensure consistency, you must rely on the ability of external replay.
LAMDA Architecture
The storm's author, Nathan, presented the famous LAMDA architecture to solve the problem of consistency in real-time systems in how-to beat the CAP theorem.
The principle is actually very simple, since flow analysis can not guarantee consistency, then we use Hadoop to save the full amount of data, through batch data analysis to ensure strong consistency.
Stream analysis is only used to calculate real-time thermal data, and cold data is done by offline computing, when users query, only need to make two copies of the data into the merge.
Strictly speaking, this cannot be counted as beat CAP, because it is just a framework that combines the strong consistency of batch analysis with the high availability of streaming analytics.
But it does give a very constructive solution to how flow analysis guarantees consistency.
The flaw in the LAMDA architecture is also obvious, too complex, too heavy, and requires a real-time and off-line two set of systems that are too costly for operations.
More trouble is that the analysis logic needs to be implemented two times, although there are similar summingbird such a scheme, but still more idealized, the face of massive data reality, or very skinny.
The architecture of LinkedIn
To address this issue, LinkedIn architect Jay Kreps, in questioning the LAMBDA architecture, proposes a purely KAKFA and streaming analysis-based architecture,
The principle is not complicated, is to make full use of Kafka replay ability, as long as the disk enough, with Kafka can save long enough data.
And because Kafka data exists on disk, it can be read repeatedly, which is the reason that Kafka is better than other queue middleware in streaming scenario.
1. Using streaming job_n to calculate the thermal data in real time, the results are stored in Table_n, which can be used for real-time user query.
2. When needed (failure data partial loss or processing logic changes) open streaming job_n+1 to process the full amount of data, deposited in table_n+1, when the data catch up, the user traffic to table_n+1.
3. Delete Job_n and Table_n.
This architecture is relatively light and can be used to solve the consistency problem of streaming analysis platform to a large extent, and can also be used as a reference.
Tradeoff Solutions
But for our scenario, this method is too idealistic:
The reason is that the amount of data is too large and the 7-day log requires nearly 2PB of disk space (Kafka needs to do replica).
If you want to replay this data within an acceptable time frame, the analytical resources needed are difficult to meet.
and the online business to do the data source switching is not so simple.
So our idea is to complete the lost data, not replay the whole amount of data.
Step 1. Reset the online job to Kafka latest offset and read the latest data.
The use of online job to fill the old data, will affect the user's experience, because the real-time traffic itself is very large, catchup speed will be slower, will cause users to see the latest log for a long time.
Step 2. Find the data that needs to be complete.
There are many ways to do this, and our approach is to
With Monitorbolt to provide real-time business monitoring, we can know when the service is abnormal and when to recover (second level).
Step 3. Start the catchup Job and start reading from earliest offset.
By configuring the time filter in the processing bolt, only the data within the specified time range is processed, and the rest of the data is discarded.
Step 4. After data recovery, stop the catchup Job.
This solution can solve the data need not lost, of course, this program is not perfect, the problem is as follows,
1. Cannot guarantee exactly-once, can only guarantee least-once
Due to the exception of 10 hours, there is still a relatively small amount of log data is successfully written, replay, this part of the data will be repeated.
2. Read some data that does not need to be replay
For simple processing, our catchup job is read from earliest offset and filtered in the business bolt.
A better way is to periodically do checkpoint (such as the minute level) on the processed offset in kafkaspout.
Then, when recovering, you can start reading from some checkpoint, which is more accurate, but the solution is much more complex.
We finally recovered the lost user SQL log through this scheme, can be used as a way of thinking for everyone to learn from.
Summarize
The CAP theory still works for streaming and is not beat.
For scenarios where streaming is such an emphasis on high data availability, ensuring strong data consistency is dependent on the replay capabilities of the external system and is costly (stored and processed) for large amounts of data.
In actual combat, we can guarantee the consistency of the least-once in the case of limited resources, and ensure the tradeoff of the flow-type processing if there is a certain problem.
How to ensure data consistency in streaming processes