First, we must know what is a transaction and its consistency?
A transaction should have 4 properties: atomicity, consistency, isolation, persistence. These four properties are often called acid properties.
Atomicity (atomicity). A transaction is an inseparable unit of work, and the operations included in the transaction are either done or not.
Consistency (consistency). The transaction must be to change the database from one consistency state to another. Consistency is closely related to atomicity.
Isolation (Isolation). Execution of one transaction cannot be disturbed by other transactions. That is, the operations inside a transaction and the data used are isolated from other transactions that are concurrently executing, and cannot interfere with each other concurrently.
Persistence (durability). Persistence, also known as permanence (permanence), refers to the fact that once a transaction is committed, its changes to the data in the database should be permanent. The next operation or failure should not have any effect on it.
Take bank transfer as an example, a customer transfers to B customer once (if the transfer of 10,000 yuan), normally a client's account will only be deducted once and the amount is 10,000 yuan, B client's account will only receive a customer's transfer of money and the amount is also 10,000 yuan, this is the specific embodiment of business and its consistency, This means that the data will be processed and processed correctly once.
However, the transaction processing of spark streaming differs from the transaction in the above case and its consistency; the spark streaming transaction focuses on the consistency of a job execution.
This lecture explores the spark streaming architecture mechanism from a transactional perspective
The Spark streaming application starts and allocates resources unless the entire cluster hardware resource crashes, generally without problems. The Spark streaming program is divided into parts, partly driver, and executor in part. Receiver receives the data and sends the metadata to Driver,driver after receiving the metadata information for checkpoint processing. Among them, Checkpoint includes: configuration information including spark Conf, spark streaming, Block MetaData, Dstreamgraph, unhandled, and waiting jobs. Of course receiver can perform job,job execution on multiple executor nodes entirely based on Sparkcore scheduling mode.
Executor only function processing logic and data, the external InputStream flows into receiver by Blockmanager write to disk, memory, Wal for fault tolerance. Wal writes to disk and then writes to executor, with little likelihood of failure. If the 1G data is to be processed, the executor receives a single receipt, and receiver receives data that is accumulated to a certain record before it is written to the Wal, and if the receiver thread fails, the data is likely to be lost.
Driver processing metadata before the Checkpoint,spark streaming get data, generate jobs, but did not resolve the implementation of the problem, execution must go through sparkcontext. Data repair At the driver level requires data to be read from the driver checkpoint, and internally rebuilt Sparkcontext, StreamingContext, Sparkjob, and then submitted to the Spark cluster. Receiver recovery will recover from disk through the disk's Wal.
Spark streaming and Kafka combine without the problem of Wal data loss, and spark streaming has to consider an external pipelining approach.
The above illustration is a good explanation of how the complete semantics, transactional consistency, guaranteed 0 loss of data, exactly once transaction processing?
A, how to guarantee the loss of data 0?
Must have reliable data sources and reliable receiver, the entire application of the metadata must be checkpoint, through the Wal to ensure data security (receiver receiving Kafka data in production environment, By default, two data is present in the executor, and two copies of data must be backed up by default, and if receiver crashes when receiving data, there is no copy, which is copied from the Kafka, and the copy is based on the zookeeper metadata.
You can consider Kafka as a simple file storage system, In executor, receiver determines that each record received by Kafka is replication to another executor and then sends a confirmation message to Kafka by ACK and continues to read the next message from Kafka.
B, driver fault tolerance as shown:
Think again where the data might be lost?
The main scenarios for data loss are as follows:
When receiver receives the data and passes the driver dispatch, executor starts to calculate the data when the driver suddenly crashes (causes the executor to be killed), at this time executor will be killed, Then the data in the executor will be lost, it must be through such as the Wal mechanism to make all the data through an HDFS-like security fault-tolerant processing, so that the executor is killed after the loss of data can be restored through the Wal mechanism back.
Here are two important scenarios to consider:
How is the processing of data guaranteed to be processed only once?
Data 0 loss does not guarantee exactly Once, if receiver receives and does not have time to update updateoffsets, it will cause the data to be processed repeatedly.
A more detailed description of the scenario where the data is read repeatedly:
Receiver crashes when receiver receives data and is saved to HDFs, and the persistence engine does not have time to updateoffset. Receiver restarts and reads the metadata again from the zookeeper of the management Kafka, resulting in repeated read of the meta data; from Sparkstreaming, it is successful. But Kafka thought it was a failure (because receiver ran out of time without updating offsets to zookeeper) and would re-consume it again when it resumed, which would result in data re-consumption.
Performance Supplement:
A, through the Wal way to ensure that the data is not lost, but the disadvantage is that through the Wal way will greatly damage receiver Sparkstreaming receiver data performance (the current network production environment is usually Kafka direct API directly processing).
b, it should be noted that if you use Kafka as a source of data, there is data in the Kafka, and then receiver accepts the data when there will be a copy of the data, this time is actually a waste of storage resources. (repeated read data resolution, the data can be read when the metadata information into the in-memory database, and again to check whether the metadata is calculated).
Spark1.3 in order to avoid Wal performance loss and implementation of exactly once and provide the Kafka direct API, Kafka as a file storage System!!! At this time Kafka with the advantages of flow and file system advantages, so far, Spark Streaming+kafka to build the perfect stream processing world!!!
The data does not require copy copies, does not require the Wal performance loss, does not need receiver, and directly through the Kafka direct API directly consume data, all executors through the Kafka API directly consume data, directly manage offset, Therefore, the consumption data will not be repeated, the transaction is realized!!!
About spark streaming data output multiple rewrite and its solution
A, why this problem, because spark streaming in the calculation of the time based on Spark Core,spark Core is born to do the following things lead to the results of spark streaming (part) Repeat output:
Task retry;
Slow task speculation
stage repetition;
Job retry;
B, Specific solutions:
Set the number of spark.task.maxFailures times to 1, the maximum allowable failure, set to 1 There is no task, stage, job and other retries;
Set Spark.speculation to OFF state (because slow-task speculation is actually very performance-intensive, so it can significantly improve spark streaming processing performance when turned off)
Spark streaming on Kafka, the job failure will cause the task to fail, the job fails to set the Auto.offset.reset to "largest" way;
Finally, we emphasize again
You can use transform and FOREACHRDD based on business logic code for logical control to achieve data non-repetition consumption and output is not duplicated! These two methods are similar to spark's back door and can be manipulated in any conceivable way!
pasted from: >
4.Spark Streaming transaction Processing