Contents of this issue:
1 exactly once fault tolerance
2 data output is not duplicated
All data that cannot be streamed in real time is invalid data. In the stream processing era, Sparkstreaming has a strong appeal, and development prospects, coupled with Spark's ecosystem, streaming can easily call other powerful frameworks such as Sql,mllib, it will eminence.
The spark streaming runtime is not so much a streaming framework on spark core as one of the most complex applications on spark core. If you can master the complex application of spark streaming, then other complex applications are a cinch. It is also a general trend to choose spark streaming as a starting point for custom versions.
We all know how it's important to be able to handle it and only process it once, and that data can be output and output only once. So how does spark guarantee this, this section will discuss specific scenarios where problems may arise and provide a solution
One: Exactly once
1 transaction-processing data source security
from executor, when receiver receives data from Kafka, it is first written to memory (or disk) via Blockmanager, or through Wal to ensure the security of the data, while executor will generate an ACK signal after completion of the replication;
Starting from Kafka, when the information is determined and the next data is read, the Kafka will perform updateoffsets operation;
Starting with the Wal mechanism, the WAL mechanism will allow all data to be handled in an hdfs-like manner for security-tolerant processing, thus addressing the loss of data caused by executor hangs
2 output security after transaction processing
From the point of see, to solve the driver end data output security is mainly two points:
First, fault-tolerant based on checkpoint
The second is based on lineage (descent) fault tolerance
3 transaction processing of Spark exactly
in view of the above two points, we know to make sure data 0 is lost, must have reliable data sources and reliable data to accept this, and the entire application must be checkpoint, and the need to secure data through Wal. For this Spark streaming to avoid the performance loss of Wal and implement exactly once and provide the Kafka Direct API, Kafka as a file storage system, so With the advantages of streaming and the advantages of file systems, all executors direct consumption of data directly through the Kafka Direct API, directly managing offset without consuming performance or repeating consumption data.
Two: Data output is not duplicated
Output non-repetition is also an important issue in the production environment.
1 Why does the data repeat output occur?
Task retry
Slow task speculation
Stage repeat
Job retry
2 Solutions
2.1 About retry problems with job,stage and tasks, a task failure is the failure of the job, we can set the number of spark.task.maxFailures to 1;
2.2 Setting Spark.speculation to turn off slow task guessing status
2.3 If Spark Streamingk is combined with Kafka, the job fails to set Kafka Auto.offset.reset to largest.
Final summary
You can use transform and FOREACHRDD based on business logic code for logical control to achieve data non-repetition consumption and output is not duplicated! These two methods are similar to the rear doors of spark streaming and can be manipulated in any conceivable way!
Spark Version Custom 4th day: Exactly once transaction processing