Contents of this issue:
- Executor's Wal
- Message Replay
Data security perspective to consider the entire spark streaming:
1, Spark streaming will receive data sequentially and constantly generate jobs, continuous submission job to the cluster operation, the most important issue to receive data security
2. Since spark streaming is based on Spark core, which means that there are errors or failures during the operation, spark streaming can also use
The ability to fault-tolerance of the RDD in Spark core is automatically restored, with the premise that data is safe and reliable.
Therefore, it is very important for executor to receive the data, which is based on the security fault tolerance of the data, and the fault tolerance of the dispatch level is basically by Spark Core,
the security fault tolerance for executor is mainly the security fault tolerance of the data, when calculating, spark streaming is fault tolerant with the RDD on the spark core.
Security fault tolerance for data:
1, the most natural security fault tolerance is a copy, processing data when the first copy of a copy
2, do not use the copy when receiving data, the data source supports replay, can read data repeatedly, such as reading the data in the past 10S, error can read the data in the past 10S again
First, the executor Wal
Spark Core's Blockmanager is responsible for data read and write operations on specific executor, and is a msteastoragelevel structure.
Storagelevel of backup with Spark's underlying storage system Blockmanager.
1. Blockmanagerbasedblockhandler copy mechanism
2. Writeaheadlogbasedblockhandler Wal log mode
In its specific directory will make a log, subsequent processing problems can be based on log recovery, logs need to be written in the directory:
You need to set up a directory written in CheckPoint, directories can have many directories: Streamingcontext.checkpoint specify a specific directory in context,
Typically placed in HDFs, the advantage is security, multiple copies, the disadvantage is the impact of performance, waste storage space.
Also put data in Wal and Blockmanager:
Executor write data is in order to write, because is to do Wal use will not modify the data, generally according to the index read, do not need a full search, so the reading speed is very fast.
3. Specific implementation: Management of specific Wal files, periodic write files, output when writing files, clean up old files
Backup Storage Summary:
1, based on Blockmanager, for example, two machines have data, one of which went wrong and switched to another one.
2, Wal way, Wal way more time-consuming, if you are very demanding performance requirements, Wal is generally not a good choice, if you can tolerate more than 1 minutes of delay, Wal is often more secure
Note: Data may also be lost if you haven't had time to make a Wal.
second, Support message replay:
Mainly based on Kafka, Natural is a copy and fault-tolerant, has been as a storage system.
Kafka has receiver's way, direct way:
1, Receiver mode: is to give zookeeper management Mtdata offset if the Kafka will be based on offset re-read, if you read the failure at this time will not send an ACK to zookeeper,
Zookeeper let me you do not consume this data, this is zookeeper guarantee, there is a data duplication consumption problem, is the consumption is finished but have not had time to zookeeper synchronization, may be repeated.
2, Direct mode: directly to operate Kafka, and is the management of the offset, Kafka itself has offset, this way can ensure that there is and once the operation of processing, this need to checkpoint operation, more time-consuming.
To manage this offset, Bach will call this method, and the last offset minus this value will determine the range data for this offset.
Note:
-
- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains
Spark Streaming source interpretation of executor fault-tolerant security