Spark Streaming source interpretation of executor fault-tolerant security

Source: Internet
Author: User

Contents of this issue:

    • Executor's Wal
    • Message Replay

Data security perspective to consider the entire spark streaming:

1, Spark streaming will receive data sequentially and constantly generate jobs, continuous submission job to the cluster operation, the most important issue to receive data security

2. Since spark streaming is based on Spark core, which means that there are errors or failures during the operation, spark streaming can also use

The ability to fault-tolerance of the RDD in Spark core is automatically restored, with the premise that data is safe and reliable.

Therefore, it is very important for executor to receive the data, which is based on the security fault tolerance of the data, and the fault tolerance of the dispatch level is basically by Spark Core,

the security fault tolerance for executor is mainly the security fault tolerance of the data, when calculating, spark streaming is fault tolerant with the RDD on the spark core.

Security fault tolerance for data:

1, the most natural security fault tolerance is a copy, processing data when the first copy of a copy

2, do not use the copy when receiving data, the data source supports replay, can read data repeatedly, such as reading the data in the past 10S, error can read the data in the past 10S again

  

First, the executor Wal

Spark Core's Blockmanager is responsible for data read and write operations on specific executor, and is a msteastoragelevel structure.

Storagelevel of backup with Spark's underlying storage system Blockmanager.

  

  

  

  

  

  1. Blockmanagerbasedblockhandler copy mechanism

    

    

    

    

 2. Writeaheadlogbasedblockhandler Wal log mode

In its specific directory will make a log, subsequent processing problems can be based on log recovery, logs need to be written in the directory:

You need to set up a directory written in CheckPoint, directories can have many directories: Streamingcontext.checkpoint specify a specific directory in context,

Typically placed in HDFs, the advantage is security, multiple copies, the disadvantage is the impact of performance, waste storage space.

    

    

    

Also put data in Wal and Blockmanager:

    

    

    

    

    

Executor write data is in order to write, because is to do Wal use will not modify the data, generally according to the index read, do not need a full search, so the reading speed is very fast.

    

   

3. Specific implementation: Management of specific Wal files, periodic write files, output when writing files, clean up old files

    

    

    

    

    

  

  Backup Storage Summary:

1, based on Blockmanager, for example, two machines have data, one of which went wrong and switched to another one.

2, Wal way, Wal way more time-consuming, if you are very demanding performance requirements, Wal is generally not a good choice, if you can tolerate more than 1 minutes of delay, Wal is often more secure

Note: Data may also be lost if you haven't had time to make a Wal.

  

second, Support message replay:

Mainly based on Kafka, Natural is a copy and fault-tolerant, has been as a storage system.

  Kafka has receiver's way, direct way:

1, Receiver mode: is to give zookeeper management Mtdata offset if the Kafka will be based on offset re-read, if you read the failure at this time will not send an ACK to zookeeper,

Zookeeper let me you do not consume this data, this is zookeeper guarantee, there is a data duplication consumption problem, is the consumption is finished but have not had time to zookeeper synchronization, may be repeated.

2, Direct mode: directly to operate Kafka, and is the management of the offset, Kafka itself has offset, this way can ensure that there is and once the operation of processing, this need to checkpoint operation, more time-consuming.

    

To manage this offset, Bach will call this method, and the last offset minus this value will determine the range data for this offset.

    

Note:

      • Data from: Liaoliang (Spark release version customization)
      • Sina Weibo:http://www.weibo.com/ilovepains

Spark Streaming source interpretation of executor fault-tolerant security

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.