Spark checkpoint by writing the RDD to disk as a checkpoint, spark lineage fault-tolerant auxiliary, lineage too long can cause fault-tolerant cost is too high, this time in the middle stage to do a checkpoint fault tolerance, if there is a problem with the node after the loss of partitions, Starting with the RDD that makes the checkpoint starts to redo the lineage, which reduces overhead. Checkpoint is mainly applicable to the following two cases: 1. The lineage in the DAG is too long and can be too expensive to re-calculate, such as in PageRank, ALS, etc.; 2. Especially suitable for checkpoint on a wide dependency, this time avoids the redundant computations that should be caused by lineage recalculation.
This article is from the "Liaoliang Big Data Quotes" blog, please be sure to keep this source http://wangjialin2dt.blog.51cto.com/10467465/1723419
Liaoliang daily Big Data quotes Spark 0022 (2015.11.18 in Zhuhai)