Spark Checkpoint Complete decryption (41)

Source: Internet
Author: User


What exactly isCheckpoint ?

1,Sparkin the production environment, we often facetranformationsof theRDDvery much (such as aJobincluded in1million ofRDD) or specifictranformationproduced byRDDThe calculations themselves are particularly complex and time-consuming (for example, calculations often exceed1hours), we must consider the persistence of the calculated results data;

2, Spark is good at multi-step iterations, and is good at Job -based reuse, which can greatly improve efficiency if the data generated by the previously calculated process is reused .

3, If you use persist to put the data in memory, it is the fastest but the most unreliable, if placed on the disk is not completely reliable! For example, the disk is damaged.

checkpoint checkpoint You can specify how to place the data locally and in multiple copies, but in a normal production environment is placed in hdfs high-fault-tolerant high-reliability features to achieve maximum reliable persistence of data;

5,  rdd of the computed data spark checkpoint hdfs

6,  rdd rdd ) started based on hdfs rdd start checkpoint mechanism for fault tolerance and high availability;

Second,Checkpoint principle mechanism

1,  by calling sparkcontext.setcheckpointdir checkpoint rdd Where to put the data, in the production cluster is placed in the hdfs checkpoint

2, All of the Rdd that it relies on in the checkpoint of the rdd will be emptied out of the calculation chain;

3,As a best practice, it is generallyCheckpointbefore the method call is passedpersistto bring the currentRDDdata is persisted to memory or disk becauseCheckpointis aLazylevel, there must beJobthe execution and inJobAfter the execution is completed, it will be traced backRDDwas carried outCheckpointtag, and then Mark theCheckpointof theRDDNew Start aJobimplementation of specificCheckpointthe process;

4, Checkpoint changed the lineage of the RDD ;

5,When we call theCheckpointmethod toRDDmakeCheckpointoperation, the frame is automatically generated when theRddcheckpointdata, whenRDDran on aJobit will trigger immediately after theRddcheckpointdatain theCheckpointmethod, which is called inside theDocheckpointis actually called in the production environment.Reliablerddcheckpointdataof theDocheckpoint, in the production environment will causeReliablecheckpointrddof thewriterddtocheckpointdirectorythe call, while thewriterddtocheckpointdirectorymethod is triggered inside theRunjobto carry out the currentRDDwrites the data in theCheckpointin the catalog, it also producesReliablecheckpointrddexample;




Note:

Sina Weibo: Http://www.weibo.com/ilovepains

Public Number: Dt_spark






This article is from the "onepeople" blog, make sure to keep this source http://5233240.blog.51cto.com/5223240/1773649

Spark Checkpoint Complete decryption (41)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.