Explanation of Spark RDD Dependency Source Code

Source: Internet
Author: User
Keywords spark spark rdd spark rdd dependency
1. Dependency of RDD
RDD's dependencies are divided into two categories: wide dependencies and narrow dependencies. We can think of it this way:

(1) Narrow dependency: each parent RDD partition is used by at most one partition of child RDD.
(2) Wide dependency: each parent RDD partition is used by multiple child RDD partitions.
The generation of partitions with narrow dependency on each child RDD can be performed in parallel, while the wide dependency requires all parent RDD partition shuffle results to be obtained before proceeding.

2, org.apache.spark.Dependency.scala source code analysis
Dependency is an abstract class:

// Denpendency.scala
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

It has two subclasses: NarrowDependency and ShuffleDenpendency, which correspond to narrow and wide dependencies, respectively.

(1) NarrowDependency is also an abstract class
The abstract method getParents is defined, and the input partitionId is used to obtain all partitions of the parent RDD that a certain partition of the child RDD depends on.

// Denpendency.scala
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

Narrow dependencies have two specific implementations: OneToOneDependency and RangeDependency.
(A) OneToOneDependency means that the partition of child RDD only depends on one partition of parent RDD. The operators that generate OneToOneDependency include map, filter, flatMap, etc. It can be seen that the implementation of getParents is very simple, that is, pass in a partitionId, and then pass the partitionId in the List.

// Denpendency.scala
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
        (B) RangeDependency means that the child RDD partition depends on the parent RDD partition one-to-one within a certain range, and is mainly used for union.

// Denpendency.scala
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {//inStart represents the starting index of parent RDD, and outStart represents the starting index of child RDD
  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId <outStart + length) {
      List(partitionId-outStart + inStart)//Represents the relative position of the current index
    } else {
      Nil
    }
  }
}

(2) ShuffleDependency refers to wide dependence
Indicates that a parent RDD partition will be used multiple times by the child RDD partition. It takes shuffle to form.

// Denpendency.scala
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {//shuffle is based on PairRDD, so the incoming RDD should be of key-value type
  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName) //Get shuffleId
  val shuffleId: Int = _rdd.context.newShuffleId() //Register shuffle information with shuffleManager
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

Since shuffle involves network transmission, there must be a serializer. In order to reduce network transmission, you can map-side aggregation, control through mapSideCombine and aggregator, keyOrdering related to key sorting, and partitioner for how to re-output data. Some class information. The relationship between partitions ends abruptly at the shuffle, so shuffle is the basis for dividing the stage.

3, the distinction between two kinds of dependence
First, narrow dependencies allow pipeline calculation of all parent partitions on a cluster node. For example, to perform a map element by element and then filter operation; wide dependency requires that all parent partition data be calculated first, and then Shuffle between nodes, which is similar to MapReduce. Second, narrow dependencies can more effectively recover failed nodes, that is, only the parent partition of the lost RDD partition needs to be recalculated, and different nodes can be calculated in parallel; while for a lineage graph with a wide dependency, a single node may fail As a result, all ancestors of this RDD lost part of their partitions, so they need to be recalculated as a whole.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.