The conversion of RDD and the generation of Dags
Spark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.
The Spark Scala version of Word count program is as follows:
1:val file = Spark.textfile ("hdfs://...") 2:val counts = File.flatmap (line = Line.split ("")) 3: . Map (Word = Word, 1)) 4: . Reducebykey (_ + _) 5:counts.saveastextfile ("hdfs://...")
Both file and counts are RDD, where file reads files from HDFs and creates an rdd, and counts is generated on file based on the three Rdd transformations of Flatmap, map, and Reducebykey. Finally, counts invokes the action Saveastextfile, and the user's computational logic is calculated from the cluster that is being submitted here. So what is the implementation of the above 5 lines of code?
1) Line 1:spark is an example of Org.apache.spark.SparkContext, which is the interface between the user program and Spark. Spark will be responsible for connecting to the cluster manager and applying the calculation resources according to user settings or system default settings, complete the creation of the RDD, etc.
Spark.textfile ("hdfs://...") A Org.apache.spark.rdd.HadoopRDD was created, and an RDD conversion was completed: A map to a org.apache.spark.rdd.mappartitions-rdd.
That is, file is actually a mappartitionsrdd, which holds the data contents of all the rows of the file.
2) Line 2: Divides the contents of all rows in file into a list of words, and then merges the list of words that are made up of lines into a list. Finally, a list of elements for each word is saved to Mappartitionsrdd.
3) Line 3: The Mappartitionsrdd generated by step 2nd passes through the map again to the tuple of Word (word, 1) for each word. These tuples are eventually put into a mappartitionsrdd.
4) Line 4: First generates a mappartitionsrdd that acts as a map-end combiner, and then generates a SHUFFLEDRDD that reads the data from the output of the previous RDD as the beginning of the reducer; It also generates a mappartitionsrdd that acts as a reducer-end reduce.
5) Line 5: First generates a MAPPARTITIONSRDD, this rdd will be called by org.apache.spark.rdd.pairrddfunctions# Saveashadoopdataset outputs the data contents of the RDD to HDFs. Finally, call Org.apache.spark.sparkcontext#runjob to submit this compute task to the cluster.
The relationship between RDD can be understood from two dimensions: one is the RDD from which the RDD is converted, that is, what the RDD's parent RDD (s) is, and what partition (s) dependent on the parent RDD (s). This relationship is the dependency between the Rdd, Org.apache.spark.Dependency. Depending on the partitions of the parent RDD (s), spark divides this dependency into two types: a wide dependency and a narrow dependency.
The dependency relationship of the RDD
The relationship between the RDD and the parent RDD (s) It relies on has two different types, narrow dependencies (narrow dependency) and wide dependencies (wide dependency).
1) Narrow dependency refers to the partition of each parent RDD with a partition of up to a quilt rdd, as shown in 1.
2) wide dependency refers to the partition of multiple sub-rdd that will depend on the same parent rdd as shown in partition,2.
Figure 1 The narrow dependency of the RDD
Figure 2 The wide dependency of the RDD
The difference between the narrow dependency and the wide dependency of the RDD can then be further understood from the different types of conversions, as shown in 3.
For the conversion of map and filter forms, they simply convert the partition data according to the rules of the transformation, and do not involve other processing, and can simply think of simply converting the data from one form to another. For union, just merging multiple rdd into one, the partition (s) of the parent RDD has no change, and can be considered simply copying and merging the partition (s) of the parent Rdd. For joins, if each partition is only a join with a known, specific partition, then this dependency is also narrow dependent. The join of this regular data does not introduce expensive shuffle. For narrow dependencies, because the RDD each partition relies on a fixed number of partition (s) of the parent RDD (s), these partition can be handled through a computational task, and these partition are independent of each other, These computational tasks can also be executed in parallel.
For Groupbykey, all partition (s) of the child Rdd will depend on all partition (s) of the parent Rdd, and the partition of the child rdd is the result of all partition shuffle of the parent Rdd, Therefore, these two rdd can not be done by a computational task. Similarly, the conversion of joins for all partition that require the parent RDD is also required shuffle, and the dependency of such joins is wide dependency rather than the narrow dependency mentioned earlier.
*******************************************************
All dependencies are implemented trait Dependency[t]:
abstract
class
Dependency[T]
extends
Serializable {
def rdd: RDD[T]
}
其中rdd就是依赖的parent RDD。
对于窄依赖的实现是:
abstract
class
NarrowDependency[T](_rdd: RDD[T])
extends
Dependency[T] {
//返回子RDD的partitionId依赖的所有的parent RDD的Partition(s)
def getParents(partitionId: Int): Seq[Int]
override def rdd: RDD[T] = _rdd
}
现在有两种窄依赖的具体实现,一种是一对一的依赖,即OneToOneDependency:
class
OneToOneDependency[T](rdd: RDD[T])
extends
NarrowDependency[T](rdd) {
override def getParents(partitionId: Int) = List(partitionId)
by get Parents implementation is not difficult to see, the RDD only relies on the parent RDD same ID partition.
There is also a range of dependencies, namely Rangedependency, which is only used by Org.apache.spark.rdd.UnionRDD. Unionrdd is to synthesize multiple rdd into an RDD, which is stitched together, that is, the relative order of the partition of each parent RDD does not change, except that each parent RDD has a different starting position in the Unionrdd partition. So it's getpartents as follows:
override def getParents(partitionId: Int) = {
if
(partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
}
else
{
Nil
}
}
******************************************************* *******************************************************
Where Instart is the starting position of partition in the parent RDD, OutStart is the starting position in Unionrdd, and length is the number of partition in the parent RDD.
There is only one implementation of wide dependency: shuffledependency. The child RDD relies on all the partition of the parent RDD and therefore requires the shuffle process:
class
ShuffleDependency[K, V, C](
@transient
_rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Option[Serializer] = None,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean =
false
)
extends
Dependency[Product2[K, V]] {
override def rdd = _rdd.asInstanceOf[RDD[Product2[K, V]]]
//获取新的shuffleId
val shuffleId: Int = _rdd.context.newShuffleId()
//向ShuffleManager注册Shuffle的信息
val shuffleHandle: ShuffleHandle =
_rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.size,
this
)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(
this
))
}
Wide dependency support for two shuffle Manager, i.e. Org.apache.spark.shuffle.hash.HashShuffleManager (hash-based shuffle mechanism) and Org.apache.spark.shuffle.sort.SortShuffleManager (based on the sort The shuffle mechanism). *******************************************************
Generation of Dags
The original Rdd (s) forms a DAG through a series of transformations. The dependency between the RDD includes which parent RDD (s) the RDD is converted from and which partitions it relies on the parent RDD (s), which is an important attribute of the DAG. With these dependencies, the DAG can assume that lineage (descent) is formed between these rdd. With lineage, the parent rdd that it relies on has been calculated before an RDD is computed, and it also implements the RDD's fault tolerance, which means that if some or all of the calculation results of an RDD are lost, the missing data needs to be recalculated.
So how does spark generate compute tasks based on a DAG? First, the DAG is divided into different stages (stage), depending on the dependencies. For narrow dependencies, due to the certainty of partition dependencies, partition conversion processing can be done in the same thread, and narrow dependencies are divided by spark into the same execution phase; For wide dependencies, due to the existence of shuffle, only the parent RDD (s) Shuffle processing is complete before the next calculation can begin, so a wide dependency is the basis for spark dividing the stage, where spark divides the DAG into different stages based on a wide dependency. Within a stage, each partition is assigned a compute task (task) that can be executed in parallel. The stage is transformed from a dependency to a large-grained DAG, and the order of execution of the DAG is also from the front to the back. In other words, the stage can only be executed if it does not have the parent stage or the parent stage has completed execution.
Apache Spark Rdd First Talk 3