This article is mainly from two aspects:
Contents of this issue
1 exactly Once
2 output is not duplicated
1 exactly Once
Transaction:
Bank Transfer For example, a user to transfer to the User B, if the B users confiscated, or received multiple accounts, is to undermine the consistency of the transaction. Transactions are handled and processed only once, that is, a is only turned once and B is only received once.
Decrypt the sparkstreaming schema from a transactional perspective:
The sparkstreaming application starts and allocates resources unless the entire cluster hardware resource crashes, generally without problems. Sparkstreaming program is divided into part, part is driver, the other part is executor. Receiver receives the data and sends the metadata to Driver,driver after receiving the metadata information for checkpoint processing. Among them, Checkpoint includes: configuration information including spark Conf, spark streaming, Block MetaData, Dstreamgraph, unhandled, and waiting jobs. Of course receiver can perform job,job execution on multiple executor nodes entirely based on Sparkcore scheduling mode.
Executor only function processing logic and data, the external InputStream flows into receiver by Blockmanager write to disk, memory, Wal for fault tolerance. Wal writes to disk and then writes to executor, with little likelihood of failure. If the 1G data is to be processed, the executor receives a single receipt, and receiver receives data that is accumulated to a certain record before it is written to the Wal, and if the receiver thread fails, the data is likely to be lost.
Driver processing metadata before the Checkpoint,spark streaming get data, generate jobs, but did not resolve the implementation of the problem, execution must go through sparkcontext. Data repair At the driver level requires data to be read from the driver checkpoint, and internally rebuilt Sparkcontext, StreamingContext, Sparkjob, and then submitted to the Spark cluster. Receiver recovery will recover from disk through the disk's Wal.
Spark streaming and Kafka combine without the problem of Wal data loss, and spark streaming has to consider an external pipelining approach.
How can complete semantics, transactional consistency, guaranteed 0 loss of data, exactly once transaction processing:
1, how to ensure that data 0 lost?
Must have reliable data sources and reliable receiver, the entire application of the metadata must be checkpoint, through the Wal to ensure data security (receiver receiving Kafka data in production environment, By default there will be two data in executor, and two copies of data must be backed up by default, and if receiver crashes when receiving data, there is no copy, then copy , copy The is based on the zookeeper metadata).
You can consider Kafka as a simple file storage system, In executor, receiver determines that each record received by Kafka is replication to another executor and then sends a confirmation message to Kafka by ACK and continues to read the next message from Kafka.
2, driver fault tolerance as shown:
Think again where the data might be lost?
The main scenarios for data loss are as follows:
When receiver receives the data and passes the driver dispatch, executor starts to calculate the data when the driver suddenly crashes (causes the executor to be killed), at this time executor will be killed, Then the data in the executor will be lost, it must be through such as the Wal mechanism to make all the data through an HDFS-like security fault-tolerant processing, so that the executor is killed after the loss of data can be restored through the Wal mechanism back.
Here are two important scenarios to consider:
How is the processing of data guaranteed to be processed only once?
Data 0 loss does not guarantee exactly Once, if receiver receives and does not have time to update updateoffsets, it will cause the data to be processed repeatedly.
A more detailed description of the scenario where the data is read repeatedly:
Receiver collapses when receiver receives data and is saved to HDFs, and the persistence engine does not have time to Updateoffset, Receiver restarts and reads the metadata again from the management Kafka zookeeper, resulting in repeated read metadata, which is successful from spark streaming, but Kafka is considered a failure (because receiver collapses Failure to update offsets to zookeeper) will re-consume once again, causing the data to be re-consumed.
Performance Supplement:
1. The Wal way to ensure that the data is not lost, but the disadvantage is that through the Wal way will greatly damage the receiver receiver data in the spark streaming performance (the current network production environment is usually Kafka direct API directly processing). 2.It is important to note that if you use Kafka as a source of data, there is data in the Kafka, and then receiver receives the data, there will be a copy of the data, which is actually a waste of storage resources. (repeated read data resolution, the data can be read when the metadata information into the in-memory database, and again to check whether the metadata is calculated).
Spark1.3 in order to avoid Wal performance loss and implementation of exactly once and provide the Kafka direct API, Kafka as a file storage System!!! At this time Kafka with the advantages of flow and file system advantages, so far, Spark Streaming+kafka to build the perfect stream processing world!!!
The data does not require copy copies, does not require the Wal performance loss, does not need receiver, and directly through the Kafka direct API directly consume data, all executors through the Kafka API directly consume data, directly manage offset, Therefore, the consumption data will not be repeated, the transaction is realized!!!
2 output is not duplicated
Why this problem, because sparkstreaming in the calculation of the time based on Sparkcore,sparkcore is born to do the following things lead to sparkstreaming results (partial) repeated output:
1.Task retry;
2. Slow task speculation;
3.Stage repetition;
4.Job retry;
can result in data loss.
Corresponding solutions:
1. A task failure is a job failure, set the number of spark.task.maxFailures to 1;
2. Set Spark.speculation to OFF (because slow-task speculation is actually very performance-intensive, so it can significantly improve spark streaming processing performance when turned off)
3.Spark streaming on Kafka, if the job fails, you can set Kafka Auto.offset.reset to largest to automatically resume job execution.
Finally, we emphasize again:
You can use transform and FOREACHRDD based on business logic code for logical control to achieve data non-repetition consumption and output is not duplicated! These two methods are similar to spark's back door and can be manipulated in any conceivable way!
Note:
Data from: Dt_ Big Data DreamWorks (Spark version Custom course)
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
Spark Customization class 4th: Spark Streaming's exactly-one transaction and non-repetitive output complete mastery