Principle of realization of exactly once by Spark streaming __spark

Source: Internet
Author: User
Tags failover

Yesterday saw this article: why Spark Streaming + Kafka hard to guarantee exactly once? After looking at the author's understanding of exactly once to disagree, so want to write this article, explain my spark streaming to ensure exactly once semantic understanding. the integrity of exactly once implementation

First of all, a very important truth is: the whole system to the exactly once guarantee, never rely on a part of the system to achieve can be done, the need for the entire flow of the system together to achieve.

For spark Streaming, the realization of exactly once needs the overall guarantee of three parts of the system:

Input source--> Spark streaming Calculation---> Output operation

"Input source" for the implementation of exactly once: Kafka directly API is to solve the input source input data exactly once semantics;

The "Spark streaming" section of the exactly once Shi implementation: Use Wal guarantee (note I did not mention checkpoint and replication, since these two failover mechanisms are not specifically addressed exactly Once this problem).

"Output operation" for the implementation of exactly once: the need for output to guarantee power, this official document has been said to be more clear:

In order to achieve exactly-once semantics for output of your results, your output operation this saves the data to a ext Ernal data store must be either idempotent, or a atomic transaction that saves results and offsets (A/semantics UT operations in the main programming Guide for further information).
Among them, "input source" for the realization of exactly once can be dedicated to find a special topic, do not involve the understanding of the core content of this article; "Output operation" on the implementation of exactly once very good understanding, but is independent of the results of each batch calculation and guarantee can be reentrant, so this article focuses on the spark stream core to exactly once implementation principle. The implementation of Spark streaming calculation framework for failover

From the Internet can be found in the same streaming calculation logic, I do not speak here, focusing on the calculation of the framework of the failover processing part.

The computational framework uses 3 ways to implement the overall failover mechanism:

1 Checkpoint (note that the checkpoint of this checkpoint and Rdd are different): implemented in the driver layer to restore the driver scene after the driver crash;

2 replication: In the receiver to solve the single executor hang up, the problem of unsaved data loss.

3 WAL: Implemented in driver and receiver to resolve:

(1) Driver hang off, all executor will hang off, then all the unsaved data are lost, replication no use;

(2) After the driver hang up, which blocks are registered in the driver before hanging up and which blocks were assigned to the batch job that was running at that time, the information is lost, so it is necessary to Wal this information to be persisted.

Of these, the 2nd of the 3rd mechanism is the effort of the computational framework to implement exactly once, which is discussed in detail below. wal the principle of solving data loss problem

Driver the steps to enter data processing:

1 Addblock: Convert the input data into block and save it to Blockqueue; In the Receivertrackerendpoint threading process;
2 Allocateblockstobatch: Assigns all the currently unhandled blocks to batch, and then deletes all blocks in the blockqueue; Handled in a jobgenerator eventloop thread;

    3 job calculation with all data assigned to this batch; Handled in a Jobhandler thread;
The completion of each step deletes data from the previous data form. Therefore, the first 2 steps, will be Wal operation, persistent data.
Simply put, the data stored in the Wal is like this:
       A             B                    C                             D              E
addblock1--> addblock2--> all blocks to batch then deletes all block--> addblock3 add Block4 .....
At any one stage driver crash and then recover, according to Wal, you can restore the data at that time;
For example:
    1 when the execution to C, do a C wal save, and then start to run the corresponding batch job, then driver hang off; When restart recovered, he found Wal inside
There is a-> B-> C, which will be replayed once, that is, Block1 and Block2 are put into Blockqueue, then the 2 blocks are assigned a batch job and then deleted
Drop the two blocks in the blockqueue, so that the data is restored to the scene before the crash;
    2 when execution to E, driver crashes, then after resuming, once executes a,b,c,d,e; The result is that Block1 and Block2 are assigned to the batch job.
and execute, BLOCK3 and Block4 are stored in Blockqueue, waiting to be assigned to the next batch job the next time.
Spark streaming A job run that the calculation framework cannot solve half of the problemHowever, there is still no solution to the problem, that is, if the job runs in half (such as a part of the result database), driver crashes, and then resumes, the job will re-run, but the last half of the job has written a partial to the database.
Solve this problem, need "output operation" is idempotent, this is not the problem that spark streaming solves, need the application oneself to guarantee. Summary

Input source for exactly once need to implement is: Before the crash part of the data has been entered as a block1 to spark, then after the crash recovery, this part of the data can not be entered into the spark;

The Spark streaming computing framework for exactly once needs to be achieved by receiving input data and assigning it to batch job data, both of which cannot be reduced in a single step because of the inflow of data into the block and the distribution of block data to batch. is a two-step separation, with no transactions; This is essentially the reason why the Kafka Direct API must be in fact created. For example, the persistent block is only made, but it is not persistent which blocks should be assigned to the batch job, so the next time you recover, which data belongs to the batch job, it is completely unknown. The most fatal is, for example, running 2 batch job, first run JOB1, and then run JOB2; If Job2 first ran, Job1 is still running, then collapsed, then after the recovery, if there is no persistent allocation of batch job block information, then all the data will be allocated to a batch job execution, then the equivalent of JOB2 data is executed output 2 times.
Spark streaming framework, only to ensure that the incoming data is not lost, and the execution of the batch job before the crash, the data assigned to the batch job (whether from data content, or data size) is exactly the same as the batch job that ran before the crash. (specifically, using the Wal implementation). As to whether the input source will repeatedly send data to the Spark streaming framework, the Spark streaming framework is completely uncontrollable.

In general, the Spark streaming overall architecture implementation exactly Once, is to rely on "input source", "The calculation Framework itself", "output operation" coordinated to complete, if there is no Kafka direct API, even if the calculation framework itself to achieve the Wal, Nor does it provide true exactly once semantics (in fact, the Kafka direct API can already implement exactly once semantics without having to throw out Wal, but the implementation is actually merging the input source with the computing framework).


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.