Spark Streaming source interpretation of executor fault-tolerant security

Last Update:2016-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

Executor's Wal
Message Replay

Data security perspective to consider the entire spark streaming:

1, Spark streaming will receive data sequentially and constantly generate jobs, continuous submission job to the cluster operation, the most important issue to receive data security

2. Since spark streaming is based on Spark core, which means that there are errors or failures during the operation, spark streaming can also use

The ability to fault-tolerance of the RDD in Spark core is automatically restored, with the premise that data is safe and reliable.

Therefore, it is very important for executor to receive the data, which is based on the security fault tolerance of the data, and the fault tolerance of the dispatch level is basically by Spark Core,

the security fault tolerance for executor is mainly the security fault tolerance of the data, when calculating, spark streaming is fault tolerant with the RDD on the spark core.

Security fault tolerance for data:

1, the most natural security fault tolerance is a copy, processing data when the first copy of a copy

2, do not use the copy when receiving data, the data source supports replay, can read data repeatedly, such as reading the data in the past 10S, error can read the data in the past 10S again

First, the executor Wal

Spark Core's Blockmanager is responsible for data read and write operations on specific executor, and is a msteastoragelevel structure.

Storagelevel of backup with Spark's underlying storage system Blockmanager.

　　1. Blockmanagerbasedblockhandler copy mechanism

　2. Writeaheadlogbasedblockhandler Wal log mode

In its specific directory will make a log, subsequent processing problems can be based on log recovery, logs need to be written in the directory:

You need to set up a directory written in CheckPoint, directories can have many directories: Streamingcontext.checkpoint specify a specific directory in context,

Typically placed in HDFs, the advantage is security, multiple copies, the disadvantage is the impact of performance, waste storage space.

Also put data in Wal and Blockmanager:

Executor write data is in order to write, because is to do Wal use will not modify the data, generally according to the index read, do not need a full search, so the reading speed is very fast.

3. Specific implementation: Management of specific Wal files, periodic write files, output when writing files, clean up old files

　　Backup Storage Summary:

1, based on Blockmanager, for example, two machines have data, one of which went wrong and switched to another one.

2, Wal way, Wal way more time-consuming, if you are very demanding performance requirements, Wal is generally not a good choice, if you can tolerate more than 1 minutes of delay, Wal is often more secure

Note: Data may also be lost if you haven't had time to make a Wal.

second, Support message replay:

Mainly based on Kafka, Natural is a copy and fault-tolerant, has been as a storage system.

　　Kafka has receiver's way, direct way:

1, Receiver mode: is to give zookeeper management Mtdata offset if the Kafka will be based on offset re-read, if you read the failure at this time will not send an ACK to zookeeper,

Zookeeper let me you do not consume this data, this is zookeeper guarantee, there is a data duplication consumption problem, is the consumption is finished but have not had time to zookeeper synchronization, may be repeated.

2, Direct mode: directly to operate Kafka, and is the management of the offset, Kafka itself has offset, this way can ensure that there is and once the operation of processing, this need to checkpoint operation, more time-consuming.

To manage this offset, Bach will call this method, and the last offset minus this value will determine the range data for this offset.

Note:

- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains

Spark Streaming source interpretation of executor fault-tolerant security

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Streaming source interpretation of executor fault-tolerant security

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Streaming source interpretation of executor fault-tolerant security

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support