Spark Streaming fault-tolerant improvements and 0 data loss

Source: Internet
Author: User
Keywords Spark
Tags application block blog cached checkpoint company data data sources

This article comes from a blog article from the spark streaming project leader Tathagata Das, who is now working for the Databricks company. In the past, the Amplab laboratory in UC Berkeley has been working on large data and spark streaming. This paper mainly talks about the improvement of spark streaming fault tolerance and 0 data loss.

The following is the original text:

A real-time streaming system must be able to work within 24/7 hours, so it needs to be able to recover from a variety of system failures. Initially, Spark streaming support the ability to recover from driver and worker failures. However, input from some data sources may lose data after a recovery. In the Spark 1.2 release, we have initially supported the spark streaming, also known as journaling, to improve the recovery mechanism and to provide a reliable guarantee for the loss of 0 data from more data sources. This article describes in detail how this feature works and how developers can use this mechanism in spark streaming applications.

Background

Spark and its RDD abstract design allow for seamless handling of any worker nodes in the cluster. Since spark streaming is built on spark, its worker nodes have the same fault-tolerant capability. However, the Spark streaming has a long uptime requirement that its application must also have the ability to recover from the driver process, which coordinates the main application processes of each worker. Making spark driver capable of fault tolerance is tricky because it can be any user program implemented in any computing mode. However, the spark streaming application has an intrinsic structure for computing-performing the same spark calculations periodically in each segment of the Micro-batch data. This structure allows the application's state (also known as checkpoint) to be periodically stored in a secure storage space and restored when the driver is restarted.

For source data such as files, this driver recovery mechanism is sufficient for 0 data loss because all data is stored in a fault-tolerant file system such as HDFs or S3. However, for other data sources such as Kafka and Flume, some of the received data is only cached in memory and has not been processed and may be lost. This is due to the distributed operation of the spark application. When the driver process fails, all executor that run in the Standalone/yarn/mesos cluster, along with all of their data in memory, are also terminated. For spark streaming, all data received from data sources, such as Kafka and Flume, is slow to exist in executor memory until they are processed. These cached data cannot be restored, even if the driver is restarted. To avoid this loss of data, we introduced the Write projectile Logs feature in the Spark 1.2 release.

Pre-write Log

Pre-write logs (also known as journal) are commonly used in databases and file systems to ensure the durability of any data manipulation. The idea of this operation is to first record the operation in a persistent log before applying the operation to the data. If the intermediate system in which the operation is applied fails, the system is restored by reading the log and then applying the previously scheduled operation again. Let's take a look at how to use this concept to guarantee the persistence of the data received.

Data sources such as Kafka and Flume use receivers (Receiver) to receive data. They run as long standing tasks in executor, responsible for receiving data from the data source, and are also responsible for confirming the data received when the data source is supported. The data received is stored in executor memory and then driver run in executor to process the task.

When the pre write log is enabled, all incoming data is also saved to the log file in the system fault-tolerant file system. Therefore, even if the spark streaming fails, the data received will not be lost. In addition, the data is received correctly only after the data has been written to the log receiver will be confirmed, the cached but not yet saved data can be driver after the restart by the data source to send again. These two mechanisms ensure that 0 data is lost, that is, all data is either recovered from the log or sent back by the data source.

Configuration

If you need to enable the pre-write logging feature, you can do so by doing the following actions.

Set the checkpoint's directory through Streamingcontext.checkpoint (path-to-directory). This directory can be set in any file system that is compatible with the HADOOPAPI port, and it is used both as a store-flow checkpoint and as a store-write log.

Set sparkconf properties Spark.streaming.receiver.writeAheadLog.enable True (default is False).

After logging is enabled, all receivers gain the advantage of being able to recover from reliably received data. We recommend that you disable the in-memory replication mechanism (as replication) by setting the appropriate persistence level (persistence levels) in the input stream, because the fault-tolerant file system that is used to write the log is likely to also replicate the data.

In addition, if you want to even be able to recover cached data, you need to use a data source that supports acking (just like Kafka,flume and Kinesis) and implements a reliable receiver that confirms the data source correctly after it has been reliably saved to the log. The built-in Kafka and Flume polling receivers are already reliable.

Finally, note that after you enable the write-ahead log, there is a slight reduction in the throughput rate of the data. Because all data is written to a fault-tolerant file system, the file system write throughput rate and the network bandwidth used for data replication may be potential bottlenecks. In this case, it is best to create more receivers to increase the degree of parallelism received and/or use better hardware to increase the throughput of the fault-tolerant file system.

Implementation Details

Let's take a closer look at this question and figure out how the pre-write log works. Let's revisit the usual spark streaming architecture in such contexts.

At the start of a spark streaming application (that is, the driver start), the associated StreamingContext (the basis of all streaming functions) uses the Sparkcontext boot receiver to become a long-running task. These sinks receive and save stream data to spark memory for processing. The life cycle of the user's data transfer is shown in the following illustration (see diagram below).

Receive data (blue arrow)--The receiver divides the data stream into a series of small chunks and is stored in executor memory. In addition, when enabled, the data is also written to the pre write log of the fault-tolerant file system.

Notification driver (green arrow)--the metadata in the receiving block (metadata) is sent to the StreamingContext of driver. This metadata includes: (i) a block reference ID that locates its data position in executor memory, and (ii) the offset information (if enabled) of block data in the log.

Processing data (Red Arrows)--the interval of each batch of data, the stream context uses block information to produce the flex-distributed DataSet Rdd and their jobs (job). StreamingContext executes a job by running a task to handle executor in-memory blocks.

Checkpoint calculations (orange arrows)--flow calculations (in other words, the dstreams provided by StreamingContext) periodically set checkpoints for recovery purposes and are saved to a different set of files in the same fault-tolerant file system.

  

When a failed driver is restarted, the following events appear (refer to the next illustration).

Recovery calculations (orange arrows)--restart driver with checkpoint information, reconstruct context, and restart the receiver.

Recovery metadata block (green arrow)--all of the required metadata blocks are restored to ensure that they continue.

Read block data stored in logs (blue arrows)--When these jobs are executed, the block data is read directly from the pre-write log. This restores all the necessary data that is reliably saved in the log.

Resend unconfirmed data (purple arrows)--cached data that fails to be saved to the log is sent again by the data source. Because the receiver has not yet confirmed it.

  

Thus, by writing a log and a reliable receiver, Spark streaming can ensure that no input data is lost due to driver failure (or, in other words, any failure).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.