4th Lesson Version customization: Spark streaming transaction Processing Complete Mastery

Source: Internet
Author: User

Contents of this issue

1, exactly Once

2, output does not repeat

Transaction:

Bank Transfer For example, a user transfer to B users, B users may receive a lot of money, how to ensure transactional consistency, that is, the transaction output, can output and only output once, that is, a only once, B only once.

Decrypt the sparkstreaming schema from a transactional perspective:

The sparkstreaming application starts and allocates resources unless the entire cluster hardware resource crashes, generally without problems. Sparkstreaming program is divided into part, part is driver, the other part is executor. Receiver receives the data and sends the metadata to Driver,driver after receiving the metadata information for checkpoint processing. Among them, Checkpoint includes: configuration information including spark Conf, spark streaming, Block MetaData, Dstreamgraph, unhandled, and waiting jobs. Of course receiver can perform job,job execution on multiple executor nodes entirely based on Sparkcore scheduling mode.

Executor only function processing logic and data, the external InputStream flows into receiver by Blockmanager write to disk, memory, Wal for fault tolerance. Wal writes to disk and then writes to executor, with little likelihood of failure. If the 1G data is to be processed, the executor receives a single receipt, and receiver receives data that is accumulated to a certain record before it is written to the Wal, and if the receiver thread fails, the data is likely to be lost.

Driver processing metadata before the checkpoint,sparkstreaming to obtain data, generate jobs, but did not resolve the implementation of the problem, execution must go through sparkcontext. Data repair At the Dirver level requires data to be read from the driver checkpoint, and internally rebuilt Sparkcontext, StreamingContext, Sparkjob, and then submitted to the Spark cluster. Receiver recovery will recover from disk through the disk's Wal.


Sparkstreaming and Kafka do not have the problem of Wal data loss, sparkstreaming must consider the external pipeline approach.

How can complete semantics, transactional consistency, guaranteed 0 loss of data, exactly once transaction processing:

1, how to ensure that data 0 lost?

Must have reliable data sources and reliable receiver, the entire application of the metadata must be checkpoint, through the Wal to ensure data security (receiver receiving Kafka data in production environment, By default there will be two data in executor, and two copies of data must be backed up by default, and if receiver crashes when receiving data and there is no copy copy, copy will be re-copied from Kafka. Copy is based on the zookeeper metadata).

You can consider Kafka as a simple file storage system, In executor, receiver determines that each record received by Kafka is replication to another executor and then sends a confirmation message to Kafka by ACK and continues to read the next message from Kafka.

2, driver fault tolerance as shown:


Think again where the data might be lost?

  Data loss and how to solve it specifically:

If the driver suddenly crashes when receiver receives the data and begins to calculate the data via driver Dispatch executor, then executor will be killed (driver crash will cause executor to be killed) , then the data in the executor will be lost, it must be done through, for example, the Wal mechanism to allow all the data through such as HDFS first security fault-tolerant processing, if the executor in the loss of data can be recovered through the Wal mechanism.

How is the processing of data guaranteed to be processed only once? ( important )

Data 0 loss does not guarantee exactly Once, if receiver receives and is not able to update updateoffsets when it is saved, it will cause data to be processed repeatedly (repeated consumption).

A more detailed description of the scenario where the data is read repeatedly:

When receiver receives data and saves to a persistence engine such as HDFS but does not have time to updateoffsets, the receiver crashes after restarting and then reads the metadata again from the management Kafka zookeeper, causing the metadata to be read repeatedly , Sparkstreaming is successful, but Kafka thinks it is a failure (because receiver crashes when it is not updated offsets to zookeeper) and re-consumes it again when it resumes, which can lead to data re-consumption.


Performance Supplement:

    1. The Wal way to ensure that data is not lost, but the disadvantage is that through the Wal way will greatly damage receiver Sparkstreaming receiver data performance (the current network production environment is usually Kafka direct API directly processing).
    2. It is important to note that if you use Kafka as a source of data, there is data in the Kafka, and then receiver receives the data, there will be a copy of the data, which is actually a waste of storage resources. (repeated read data resolution, the data can be read when the metadata information into the in-memory database, and again to check whether the metadata is calculated).

Spark1.3 in order to avoid Wal performance loss and implementation of exactly once and provide the Kafka direct API, Kafka as a file storage System!!! At this time Kafka with the advantages of flow and file system advantages, so far, Spark Streaming+kafka to build the perfect stream processing world!!!

The data does not require copy copies, does not require the Wal performance loss, does not need receiver, and directly through the Kafka direct API directly consume data, all executors through the Kafka API directly consume data, directly manage offset, Therefore, the consumption data will not be repeated, the transaction is realized!!!

One last question about the spark streaming data output multiple rewrite and solution:

Why this problem, because sparkstreaming in the calculation of the time based on Sparkcore,sparkcore is born to do the following things lead to sparkstreaming results (partial) repeated output:

1.Task retry;

2. Slow task speculation;

3.Stage repetition;

4.Job retry;

can result in data loss.

Corresponding solutions:

1. A task failure is a job failure, set the number of spark.task.maxFailures to 1;

2. Set Spark.speculation to OFF (because slow-task speculation is actually very performance-intensive, so it can significantly improve spark streaming processing performance when turned off)

3.Spark streaming on Kafka, if the job fails, you can set Kafka Auto.offset.reset to largest to automatically resume job execution.

Finally, we emphasize again:

You can use transform and FOREACHRDD based on business logic code for logical control to achieve data non-repetition consumption and output is not duplicated! These two methods are similar to the rear doors of spark streaming and can be manipulated in any conceivable way!

Data from: Dt_ Big Data DreamWorks (Spark version Custom course)

For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

4th Lesson Version customization: Spark streaming transaction Processing Complete Mastery

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.