"Spark" spark fault tolerance mechanism

Source: Internet
Author: User

Introduced

In general, there are two ways to fault-tolerant distributed datasets: data checkpoints and the updating of record data .
For large-scale data analysis, data checkpoint operations are costly and require a large data set to be replicated between machines through a network connection in the data center, while network bandwidth tends to be much lower than memory bandwidth and consumes more storage resources.
Therefore, Spark chooses how to record updates. However, if the update granularity is too thin, then the record update cost is not low. Therefore, the RDD only supports coarse-grained conversions, that is, only a single action performed on a single block is recorded, and then a series of transformation sequences of the RDD are created (each RDD contains information about how he was transformed by other Rdd and how to reconstruct a piece of data. Therefore, the RDD fault-tolerant mechanism is also known as "pedigree (lineage)" fault tolerance) recorded in order to recover the lost partition.
Lineage is essentially similar to the redo log (Redo log) in the database, except that the redo log is very granular and restores the data by doing the same redo of the global data.

Brief introduction of lineage mechanism lineage

The lineage of the Rdd records the behavior of coarse-grained specific data transformation operations (such as filter, map, join, and so on) compared to other systems with fine-grained memory data update level backup or log mechanism. When part of this RDD partition data is lost, it can get enough information through lineage to re-compute and recover the lost data partition. Because this coarse-grained data model limits the use of spark, Spark does not apply to all high-performance scenarios, but it also provides performance improvements over fine-grained data models.

Two kinds of dependency relationships

There are two types of rdd in lineage dependencies: Narrow dependency (Narrow Dependencies) and wide dependency (Wide Dependencies, called Shuffle in the source code)
Dependencies) to address the efficiency of data fault tolerance.

  • A narrow dependency is a partition of a parent rdd that is used by a partition of at most one child rdd to represent a partition of a parent rdd corresponding to a child RDD
    or more than one parent RDD partition corresponds to a sub-rdd partition, meaning that one partition of a parent RDD cannot correspond to multiple partitions of a child rdd.
    1 parent RDD partitions correspond to 1 sub-RDD partitions, which in two cases: 1 sub-RDD partitions correspond to 1 parent RDD partitions (such as map, filter, and so on), and 1 child rdd partitions correspond to n parent RDD partitions (such as co-paritioned (co-partitioned) join).
  • A wide dependency is a partition of a child rdd that relies on more than one partition of the parent RDD or all partitions, that is, one partition of the parent RDD that corresponds to a child rdd.
    1 parent RDD partitions correspond to multiple child RDD partitions, which in two cases: one parent rdd corresponds to all child rdd partitions (without a co-partitioned join) or 1 parent Rdd corresponding to a non-full number of RDD partitions (such as Groupbykey).

The essence of Understanding: According to the parent RDD partition is the corresponding 1 or more sub-RDD partition to distinguish between narrow dependency (parent partition corresponding to a child partition) and wide dependency (parent partition corresponding to multiple sub-sub-
Area). If there are multiple, then when the fault-tolerant partition is partitioned, because only a portion of the parent partition data requires a heavy operator partition, the remaining data recalculation results in redundant computation.

For wide dependencies, the stage calculates the input and output on different nodes, for the input node intact, and the output node freezes, by re-computing the recovery of the data in this case, this method of fault tolerance is valid, otherwise invalid, because cannot retry, It is necessary to retrace its ancestors to see if it is possible to retry (this is the meaning of lineage, descent), and a narrow reliance on data is much less computationally expensive than a wide-dependent data re-calculation.

The concept of narrow dependency and wide dependency is mainly used in two places: one is the function of redo log in fault tolerance, and the other is to construct Dag as the dividing point of different stage in scheduling.

Characteristics of dependency relationships

First, a narrow dependency can calculate a block of data for a child rdd directly on a compute node by calculating a block of data for the parent Rdd; a wide dependency waits until all the data of the parent RDD has been computed and the parent RDD evaluates the hash and passes it to the corresponding node before calculating the child Rdd.
Second, when data is lost, it is only necessary to recalculate the missing piece of data for a narrow dependency, and for a wide dependency, all the blocks of the ancestor Rdd are recalculated to recover. So in the long "descent" chain, especially when there is a wide dependency, you need to set the data checkpoint at the appropriate time. Also, these two features require different task scheduling mechanisms and fault-tolerant recovery mechanisms for different dependency relationships.

Fault tolerance principle

In a fault-tolerant mechanism, if a node freezes and the operation is narrowly dependent, the lost parent RDD partition can be counted as long as it is not dependent on the other nodes. It is expensive to have all the partitions that rely on a parent rdd for wide dependencies. It is possible to understand the economics of overhead: in a narrow dependency, when the partition of the child Rdd is lost and the parent RDD partition is recalculated, all the data for the corresponding partition of the parent RDD is the data of the child RDD partition, and there is no redundant computation. In the case of wide dependency, the loss of one child RDD partition for each partition of each parent RDD is not all data for the Lost Child Rdd partition, and some of the data corresponds to the data that is required in the non-missing sub-RDD partition, thus creating a redundant computational overhead. This is also why it is much more expensive to rely on. Therefore, if you use the checkpoint operator to do the checkpoint, not only to consider whether the lineage is long enough, but also to consider whether there is a wide dependency on the wide dependency plus checkpoint is the best value for money.

Checkpoint mechanism

The above analysis shows that in the following two situations, the RDD needs to be checked.

  1. The lineage in the Dag is too long, and if you re-calculate, the overhead is too high (as in PageRank).
  2. The benefits of doing checkpoint on a wide dependency are even greater.

Because the RDD is read-only, the consistency of the spark's RDD calculation is not the main concern, and the memory is relatively easy to manage, which is also a visionary aspect of the designer, which reduces the complexity of the framework, improves performance and scalability, and lays a strong foundation for the richness of the upper frame.
In the RDD calculation, fault tolerance is performed through checkpoint mechanism, which is traditionally done in two ways: through redundant data and logging update operations. The Docheckpoint method in the RDD is equivalent to caching the data through redundant data, and the lineage previously described is fault tolerant through fairly coarse-grained records update operations.

Checkpoint (essentially by writing an RDD to disk to do the checkpoint) is to do fault-tolerant support through lineage, lineage too long can cause fault-tolerant costs too high, so it is better to do in the intermediate phase of fault-tolerant, if there is a problem with the node after the loss of partitions, Starting with the RDD that makes the checkpoint starts to redo the lineage, which reduces overhead.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" spark fault tolerance mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.