Spark Growth Path (2)-RDD partition dependent system

Source: Internet
Author: User
Tags abstract shuffle split spark rdd

Reference article:
Deep understanding of the spark RDD abstract model and writing RDD functions
Rdd Dependency
Spark Dispatch Series
Partial function

Introduction Dependency Graph Dependency Concept Class narrow dependency class Onetoonedependency Rangedependency prunedependency wide dependency class diagram shuffledependency

Introduction

The dependency between rdd is broadly divided into two categories: narrow dependency and wide dependency.
Borrowed from the reference article: narrow dependency: A partition in the parent RDD will only be used by a partition in the child Rdd, in other words, in the parent RDD, the data within a partition cannot be split and must be delivered to a partition in the sub-rdd. Wide dependency: The partition in the parent RDD may be used by multiple child RDD partitions and the data will be sliced. Dependency Diagram

The relationship embodied in the source code is as follows:

Dependency (Org.apache.spark)
    shuffledependency (Org.apache.spark)
    narrowdependency (Org.apache.spark)
        prunedependency (Org.apache.spark.rdd)
        Rangedependency (Org.apache.spark)
        onetoonedependency (Org.apache.spark)
Dependency Concept Class

Dependency

@DeveloperApi
abstract class Dependency[t] extends Serializable {
  def rdd:rdd[t]
}

Relatively simple, only a simple way to provide the RDD, get the parent of the Rdd object, that is, its dependent rdd. Narrow Dependency Classes

Narrowdependency

@DeveloperApi
abstract class Narrowdependency[t] (_rdd:rdd[t]) extends Dependency[t] {
  /**
   * Get the parent partitions for a child partition.
   * @param PartitionID A partition of the child rdd
   * @return The partitions of the parent RDD, the child partition Depends upon
   *
  /def getparents (partitionid:int): Seq[int]
  override def rdd:rdd[t] = _rdd
}

Added a getparents method that gets the partition of the parent RDD on which a partition of the child Rdd is dependent, and returns a sequence of partition ID numbers.

This class is also an abstract class, and narrow dependencies also subdivide 3 sub-classes, respectively:
Prunedependency,rangedependency,onetoonedependency onetoonedependency

Each partition of the child RDD corresponds to a partition of the parent Rdd, and one parent partition corresponds to only one child partition, and the number of child partitions is the same as the number of parent partitions.

Diagram A

Diagram B

Both of these cases are 1:1 dependent.

@DeveloperApi
class Onetoonedependency[t] (Rdd:rdd[t]) extends Narrowdependency[t] (RDD) {
  override Def Getparents (Partitionid:int): list[int] = List (PartitionID)
}

The way to get the parent partition is simply to encapsulate the ID number passed in as a SEQ return. Because of the 1:1 dependency, the number of the parent partition is the same as the number of the child partition. rangedependency

Multiple parent Rdd, a child rdd, except the partition number of the first parent Rdd and the child rdd are the same, the others are different.
There is only a dependency between the Unionrdd and the parent RDD that the union operation produces. We can testify that only Unionrdd uses this dependency by calling the Rangedependency object in the source:

Graphic

@DeveloperApi
class Rangedependency[t] (Rdd:rdd[t], Instart:int, Outstart:int, length:int)
  extends Narrowdependency[t] (RDD) {

  override def getparents (Partitionid:int): list[int] = {
    if (PartitionID >= OutStart && PartitionID < OutStart + length) {
      List (Partitionid-outstart + instart)
    } else {
      Nil
    }
  }
}

Gets the parent partition ID number of the child partition is an if judgment, not as simple as before. First explain several concepts Instart: The parent RDD partition starting ID is good. OutStart: The parent RDD partition start ID is the partition ID number in the child Rdd. Length: The number of partitions in the parent RDD.

The code that generates the dependency is as follows:

Unionrdd.getdependencies

Override Def Getdependencies:seq[dependency[_]] = {
    val deps = new Arraybuffer[dependency[_]]
    var pos = 0
    for (Rdd <-Rdds) {
      Deps + = new Rangedependency (RDD, 0, POS, rdd.partitions.length)
      pos + = rdd.partitions.length
    }
    deps< c7/>}

The 2 parent Rdd generation dependency in the figure above is dependent on the following rdd0:rangedependency (rdd0,0,0,2) rdd1:rangedependency (rdd1,0,2,3)

So if you want the partition number of partitionid=3 in the parent rdd in the child Rdd

outstart=2
instart=0
3-2+0 = 1

is the partitionid=1 partition in the parent RDD. prunedependency

Specifically did not find relevant articles, can only be simple according to their own understanding.

This dependency is only dependent on the Partitionpruningrdd object. Prune has a clipping meaning, here means that a portion of the parent RDD partition is dependent on the sub-RDD partition, partly because the filter function is filtered out. Usually filter according to the index value of parttion, such as the following code:

The partition of the parent RDD is filtered out with an index value of 0 or both.

Private[spark] class Prunedependency[t] (Rdd:rdd[t], partitionfilterfunc:int = Boolean)
  extends Narrowdependency[t] (RDD) {

  @transient
  val partitions:array[partition] = rdd.partitions
    . Filter (s = = Partitionfilterfunc (S.index)). Zipwithindex
    . Map {case (split, idx) = = new Partitionpruningrddpartition (idx, Split): Partition}

  override Def getparents (Partitionid:int): list[int] = {
    List (partitions (PartitionID). Asinstanceof[partitionpruningrddpartition].parentsplit.index)
  }
}

Focus on the function partitions, first filter the partition through filter, filter function is created when the object is specified, filtered by Zipwithindex to generate a (K,V) partitions,array[partition], the format is as follows:

Array ((partition,0), (partition,1) ....)

The Partitionpruningrddpartition object is then generated using the map function. wide Dependency class

The data for a partition in the parent RDD is dependent on multiple sub-partitions and the data is sliced. Illustrations

One thing to understand here is that the data that can be defined as shuffledependency depends on a partition that is cut into small chunks, each of which is dependent on a different sub-partition. Because one of the above partitions is dependent on multiple partitions in full dependency, he is all dependent on the entire data in the parent partition, not the Shard. Spark defines only 2 types of total dependency: 1:1,n:1, and discards N, because, like shuffledependency, it's not a good distinction. shuffledependency

@DeveloperApi class Shuffledependency[k:classtag, V:classtag, C:classtag] (@transient private Val _rdd:rdd[_ < : Product2[k, V]], Val partitioner:partitioner, val serializer:serializer = SparkEnv.get.serializer, Val key Ordering:option[ordering[k]] = None, Val aggregator:option[aggregator[k, V, C]] = none, Val mapsidecombine:bool EAN = false) extends Dependency[product2[k, V]] {override def rdd:rdd[product2[k, V]] = _rdd.asinstanceof[rdd[produ Ct2[k, V]]] Private[spark] val keyclassname:string = Reflect.classtag[k].runtimeclass.getname Private[spark] Val val ueclassname:string = reflect.classtag[v].runtimeclass.getname//Note:it ' s possible that the Combiner class tag is nul
  L, if the Combinebykey//methods in Pairrddfunctions is used instead of Combinebykeywithclasstag. Private[spark] val combinerclassname:option[string] = Option (Reflect.classtag[c]). Map (_.runtimeclass.getname) Val Shuffleid:int = _rdd.context.newshuffleid() Val shufflehandle:shufflehandle = _rdd.context.env.shufflemanager.registershuffle (Shuffleid, _RDD.PARTITIONS.L Ength, this) _rdd.sparkcontext.cleaner.foreach (_.registershuffleforcleanup (This))}

The RDD in shuffledependency must be in the form of (K,V) key-value pairs.

Member explains the following: The Rdd method: Convert the Rdd to the K,v form of the RDD. The keyclassname:k type. The type of valueclassname:v. Combinerclassname: A combo that combines the cut data in an RDD used by a combination. Shuffleid: Each time the shuffle process, there is an ID number identification. Shufflehandle:shuffle handle (awkward translation), register an event in Shufflemanager, perform shuffle operation.

The last statement is to ensure that the current code can be serialized. This place has a shufflemanager follow-up will look again.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.