Spark Version Custom 4th day: Exactly once transaction processing

Last Update:2016-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

1 exactly once fault tolerance

2 data output is not duplicated

　　All data that cannot be streamed in real time is invalid data. In the stream processing era, Sparkstreaming has a strong appeal, and development prospects, coupled with Spark's ecosystem, streaming can easily call other powerful frameworks such as Sql,mllib, it will eminence.

The spark streaming runtime is not so much a streaming framework on spark core as one of the most complex applications on spark core. If you can master the complex application of spark streaming, then other complex applications are a cinch. It is also a general trend to choose spark streaming as a starting point for custom versions.

　　We all know how it's important to be able to handle it and only process it once, and that data can be output and output only once. So how does spark guarantee this, this section will discuss specific scenarios where problems may arise and provide a solution

　　One: Exactly once

1 transaction-processing data source security

from executor, when receiver receives data from Kafka, it is first written to memory (or disk) via Blockmanager, or through Wal to ensure the security of the data, while executor will generate an ACK signal after completion of the replication;

Starting from Kafka, when the information is determined and the next data is read, the Kafka will perform updateoffsets operation;

Starting with the Wal mechanism, the WAL mechanism will allow all data to be handled in an hdfs-like manner for security-tolerant processing, thus addressing the loss of data caused by executor hangs

2 output security after transaction processing

From the point of see, to solve the driver end data output security is mainly two points:

First, fault-tolerant based on checkpoint

The second is based on lineage (descent) fault tolerance

3 transaction processing of Spark exactly

in view of the above two points, we know to make sure data 0 is lost, must have reliable data sources and reliable data to accept this, and the entire application must be checkpoint, and the need to secure data through Wal. For this Spark streaming to avoid the performance loss of Wal and implement exactly once and provide the Kafka Direct API, Kafka as a file storage system, so With the advantages of streaming and the advantages of file systems, all executors direct consumption of data directly through the Kafka Direct API, directly managing offset without consuming performance or repeating consumption data.

　　Two: Data output is not duplicated

　　Output non-repetition is also an important issue in the production environment.

1 Why does the data repeat output occur?

　　Task retry

Slow task speculation

Stage repeat

Job retry

2 Solutions

2.1 About retry problems with job,stage and tasks, a task failure is the failure of the job, we can set the number of spark.task.maxFailures to 1;

2.2 Setting Spark.speculation to turn off slow task guessing status

2.3 If Spark Streamingk is combined with Kafka, the job fails to set Kafka Auto.offset.reset to largest.

Final summary

　　You can use transform and FOREACHRDD based on business logic code for logical control to achieve data non-repetition consumption and output is not duplicated! These two methods are similar to the rear doors of spark streaming and can be manipulated in any conceivable way!

Spark Version Custom 4th day: Exactly once transaction processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Version Custom 4th day: Exactly once transaction processing

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support