Explanation of Spark RDD Dependency Source Code

Last Update:2020-06-04 Source: Internet

Author: User

Keywords spark spark rdd spark rdd dependency

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Dependency of RDD
RDD's dependencies are divided into two categories: wide dependencies and narrow dependencies. We can think of it this way:

(1) Narrow dependency: each parent RDD partition is used by at most one partition of child RDD.
(2) Wide dependency: each parent RDD partition is used by multiple child RDD partitions.
The generation of partitions with narrow dependency on each child RDD can be performed in parallel, while the wide dependency requires all parent RDD partition shuffle results to be obtained before proceeding.

2, org.apache.spark.Dependency.scala source code analysis
Dependency is an abstract class:

// Denpendency.scala
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

It has two subclasses: NarrowDependency and ShuffleDenpendency, which correspond to narrow and wide dependencies, respectively.

(1) NarrowDependency is also an abstract class
The abstract method getParents is defined, and the input partitionId is used to obtain all partitions of the parent RDD that a certain partition of the child RDD depends on.

// Denpendency.scala
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

Narrow dependencies have two specific implementations: OneToOneDependency and RangeDependency.
(A) OneToOneDependency means that the partition of child RDD only depends on one partition of parent RDD. The operators that generate OneToOneDependency include map, filter, flatMap, etc. It can be seen that the implementation of getParents is very simple, that is, pass in a partitionId, and then pass the partitionId in the List.

// Denpendency.scala
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
        (B) RangeDependency means that the child RDD partition depends on the parent RDD partition one-to-one within a certain range, and is mainly used for union.

// Denpendency.scala
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {//inStart represents the starting index of parent RDD, and outStart represents the starting index of child RDD
  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId <outStart + length) {
      List(partitionId-outStart + inStart)//Represents the relative position of the current index
    } else {
      Nil
    }
  }
}

(2) ShuffleDependency refers to wide dependence
Indicates that a parent RDD partition will be used multiple times by the child RDD partition. It takes shuffle to form.

// Denpendency.scala
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {//shuffle is based on PairRDD, so the incoming RDD should be of key-value type
  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName) //Get shuffleId
  val shuffleId: Int = _rdd.context.newShuffleId() //Register shuffle information with shuffleManager
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

Since shuffle involves network transmission, there must be a serializer. In order to reduce network transmission, you can map-side aggregation, control through mapSideCombine and aggregator, keyOrdering related to key sorting, and partitioner for how to re-output data. Some class information. The relationship between partitions ends abruptly at the shuffle, so shuffle is the basis for dividing the stage.

3, the distinction between two kinds of dependence
First, narrow dependencies allow pipeline calculation of all parent partitions on a cluster node. For example, to perform a map element by element and then filter operation; wide dependency requires that all parent partition data be calculated first, and then Shuffle between nodes, which is similar to MapReduce. Second, narrow dependencies can more effectively recover failed nodes, that is, only the parent partition of the lost RDD partition needs to be recalculated, and different nodes can be calculated in parallel; while for a lineage graph with a wide dependency, a single node may fail As a result, all ancestors of this RDD lost part of their partitions, so they need to be recalculated as a whole.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More