One, Spark streaming data security considerations:
- Spark Streaming constantly receive data, and constantly generate jobs, and constantly submit jobs to the cluster to run. So this involves a very important problem with data security.
- Spark Streaming is based on the spark core, if you can ensure that the data is safe, when the spark streaming generated the job is based on the RDD, even if the problem occurs when running, then spark streaming can also use spark The core's fault-tolerant mechanism is automatically fault tolerant.
- executor capacity Fault is primarily security-tolerant to data
- > Why not consider the fault-tolerant data calculation: When computing, spark streaming is fault-tolerant on the spark core, so nature is safe and reliable.
Executor fault Tolerant mode:
1. The simplest fault tolerance is the copy mode, which is based on the underlying Blockmanager replica fault tolerance, and is the default fault tolerant method.
2.WAL Log Mode
3. After receiving the data do not make a copy, support data replay, so-called replay is to support the repeated reading of data.
Blockmanager Backup:
- By default in memory two copies, that is, the spark streaming receiver received data after the storage of the time to specify Storagelevel as memory_and_disk_ser_2, the underlying storage is given to Blockmanager, The semantics of Blockmanager ensure that if two copies are specified, they are generally in memory. So at least two executor will have the data.
Receiver will give the data to Blockmanger is handled by Receiveredblockhandler, there are two kinds ofimplementation of Receiveredblockhandler:1.Writeaheadlogbasedblockhandler2.Blockmanagerbasedblockhandler
The storagelevel here is passed in when the Inputdstream is built,Sockettextstream'sThe default storage level is storagelevel. memory_and_disk_ser_2
If you useWriteaheadlogbasedblockhandler need to open Wal, default does not open:
wal log mode:This way the data is now written to the log file, which is the checkpoint directory, where the exception is to re-read the data from the checkpoint directory for recovery. When you start the Wal, it is not necessary to set the number of replicas to be greater than 1 and not require serialization.
The Wal will write the data simultaneously to Blockmanager and write ahead log, and is parallel to write block, of course, two blocks of storage is completed before it is returned.
To deposit a block into Blockmanager:
To deposit a block into the Wal log:
wal write data in sequential , the data is not modifiable, so just read it by the pointer (that is, the record you want to read is there, how long it is). So Wal's speed is very fast.
browse Writeaheadlog , he is an abstract class:
看一下
an implementation class for WriteaheadlogFilebasedwriteaheadlogThe Write method:
根据不同时间获取不同Writer将序列化结果写入文件,返回一个
filebasedwriteaheadlogsegmentan object of type Filesegment.
Read data:
It creates a Filebasewriteaheadlograndomreader object, and then calls the Read method of the object:
Supports data replay.
Kafka has receiver mode and direct mode
receiver way: is to give zookeeper to manage the data, That is, offset offsets. If it fails, Kafka will re-read based on offset, because the data is processed in the middle of the crash, will not send an ACK to zookeeper, at this time zookeeper think you do not have the message this data. But the more you use in practice, the more direct it is to directly manipulate offset. and manage offset.
- Directkafkainputdstream will check the latest offset and put the offset in batch.
- When batch is generated each time, the latestleaderoffsets is called to view the nearest offset, at which point the offset is subtracted from the previous offset to get the batch range. So that you can read the data.
Protected final def latestleaderoffsets(retries:int):Map[topicandpartition, Leaderoffset] = {val o = kc.getlatestleaderoffsets (currentoffsets.keyset)//Either.fold would C Onfuse@tailrec, do it manuallyif(O.isleft) {val err = o.left.get.tostringif(Retries <=0) {throw new Sparkexception (ERR)}Else{Log.error (Err) thread.sleep (kc.config.refreshLeaderBackoffMs) latestleaderoffsets (Retries-1) } }Else{O.right.get}}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
From Wiznote
12th lesson: Spark Streaming Source interpretation of executor fault-tolerant security