The Checkpoint__spark of Spark streaming

Source: Internet
Author: User

Reproduced in "Beef round powder without onions"
Link: http://www.jianshu.com/p/00b591c5f623

A streaming application often requires 7*24 uninterrupted running, so it needs to be able to withstand unexpected abilities (such as machine or system hangs, JVM crash, etc.). To make this possible, Spark streaming needs to checkpoint enough information to a fault-tolerant storage system in order for application to recover from failure. Spark streaming will checkpoint two kinds of data.

Metadata (metadata) checkpointing-saves a fault-tolerant storage system that defines streaming computing logic to a similar HDFS. Used to restore driver, metadata includes: Configuration-all configuration Dstream actions to create the streaming application-dstream operations of some columns are not completed batches-those that have committed a job but have not yet performed or completed Batches

Data checkpointing-Saves the generated rdds to reliable storage. This is required in some stateful transformations in which the generation RDD need to rely on the preceding batches, causing the dependency chain to grow over time. To avoid this never-ending change, periodically save the intermediate generated RDDs to reliable storage to cut off the chain of dependencies

In short, metadata checkpointing is primarily used to restore driver, and the checkpointing of RDD data is necessary for stateful conversion operations. when you need to enable checkpoint.

When should the checkpoint be enabled? Any of the following conditions are true: stateful conversion is used-if stateful operations such as Updatestatebykey or Reducebykeyandwindow are used in application, a checkpoint directory must be provided to allow timed RDD checkpoint hope to recover from the accident driver

If the streaming app does not have stateful operation, it also allows the driver to be hung up again after the progress is lost, there is no need to enable checkpoint. how to use checkpoint.

With checkpoint enabled, you need to set up a directory that supports fault tolerant, reliable file systems (such as HDFS, S3, etc.) to hold the checkpoint data. Done by calling Streamingcontext.checkpoint (Checkpointdirectory). In addition, if you want your application to recover from driver failure, your application should be satisfied: If application is the first reboot, a new Streamcontext instance will be created if application is lost Fail-in reboot, the checkpoint data will be imported from the checkpoint directory to recreate the StreamingContext instance

Through the streamingcontext.getorcreate can achieve the purpose:

Function to create and setup a new StreamingContext
def functiontocreatecontext (): StreamingContext = {
    val SSC = new StreamingContext (...)   New Context
    val lines = Ssc.sockettextstream (...)//create Dstreams
    ...
    Ssc.checkpoint (checkpointdirectory)   //Set checkpoint directory
    SSC
}

//Get StreamingContext from Checkpoint data or create a new one
val context = Streamingcontext.getorcreate (Checkpointdirectory, Functiontocreatecontext _)

//Do additional the setup on "context" that needs the to is done,
//irrespective of whether I T is being started or restarted context
....

Start the context
Context.start ()
context.awaittermination ()

If checkpointdirectory exists, then the context will import checkpoint data. If the directory does not exist, the function functiontocreatecontext is invoked and a new context is created

In addition to calling Getorcreate, you also need your cluster mode support driver hang up and restart. For example, in yarn mode, driver is running in Applicationmaster, and if Applicationmaster hangs, yarn automatically launches a new applicationmaster on another node.

It is to be noted that as the streaming application continues to run, the storage space occupied by the checkpoint data will grow larger. Therefore, you need to carefully set the checkpoint time interval. The smaller the setting, the more checkpoint the number of times, and the larger the footprint; if the setting is larger, the more data and progress will be lost when the recovery occurs. The general recommendation is set to batch duration 5~10 times. Export Checkpoint Data

As mentioned above, checkpoint data is periodically exported to a reliable storage system, so

At what time to checkpoint

Checkpoint the form of what is the timing of the checkpoint

In Spark streaming, Jobgenerator is used to generate jobs for each batch, it has a timer, and the timer's cycle is the StreamingContext set when the batchduration is initialized. As soon as this cycle is over, Jobgenerator will invoke the Generatejobs method to generate and submit jobs, after which the Docheckpoint method is invoked to checkpoint. The Docheckpoint method determines whether the difference between the current time and the streaming application start is a multiple of the checkpoint duration, which is checkpoint only if it is true. the form of checkpoint

The final form of checkpoint is to serialize an instance of the class checkpoint to the external store, and it is worth mentioning that there is a dedicated thread that writes the serialized checkpoint to the external store. The class Checkpoint contains the following data

In addition to the Checkpoint class, there are checkpointwriter classes that are used to export Checkpoint,checkpointreader to import Checkpoint. limitations of the Checkpoint

Spark Streaming's checkpoint mechanism looks good, but it has a mishap. As mentioned above, the final brush to the external store is the serialized data of the class Checkpoint object. Then, after the Spark streaming application recompile, the checkpoint data will fail to deserialize. This time you must create a new StreamingContext.

In response to this situation, we combine Spark streaming + Kafka application, we maintain the consumption of the offsets, so that in time to recompile application, or can be from the need for the offsets to consume data, here is just for example, Not expanded in detail.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.