The conversion of the RDD
Spark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.
The Spark Scala version of Word count program is as follows:
1:val file = Spark.textfile ("hdfs://...") 2:val counts = File.flatmap (line = Line.split ("")) 3: . Map (Word = Word, 1)) 4: . Reducebykey (_ + _) 5:counts.saveastextfile ("hdfs://...")
Both file and counts are RDD, where file reads files from HDFs and creates an rdd, and counts is generated on file based on the three Rdd transformations of Flatmap, map, and Reducebykey. Finally, counts invokes the action Saveastextfile, and the user's computational logic is calculated from the cluster that is being submitted here. So what is the implementation of the above 5 lines of code?
1) Line 1:spark is an example of Org.apache.spark.SparkContext, which is the interface between the user program and Spark. Spark will be responsible for connecting to the cluster manager and applying the calculation resources according to user settings or system default settings, complete the creation of the RDD, etc.
Spark.textfile ("hdfs://...") A Org.apache.spark.rdd.HadoopRDD was created, and an RDD conversion was completed: A map to a org.apache.spark.rdd.mappartitions-rdd.
That is, file is actually a mappartitionsrdd, which holds the data contents of all the rows of the file.
2) Line 2: Divides the contents of all rows in file into a list of words, and then merges the list of words that are made up of lines into a list. Finally, a list of elements for each word is saved to Mappartitionsrdd.
3) Line 3: The Mappartitionsrdd generated by step 2nd passes through the map again to the tuple of Word (word, 1) for each word. These tuples are eventually put into a mappartitionsrdd.
4) Line 4: First generates a mappartitionsrdd that acts as a map-end combiner, and then generates a SHUFFLEDRDD that reads the data from the output of the previous RDD as the beginning of the reducer; It also generates a mappartitionsrdd that acts as a reducer-end reduce.
5) Line 5: First generates a MAPPARTITIONSRDD, this rdd will be called by org.apache.spark.rdd.pairrddfunctions# Saveashadoopdataset outputs the data contents of the RDD to HDFs. Finally, call Org.apache.spark.sparkcontext#runjob to submit this compute task to the cluster.
The relationship between RDD can be understood from two dimensions: one is the RDD from which the RDD is converted, that is, what the RDD's parent RDD (s) is, and what partition (s) dependent on the parent RDD (s). This relationship is the dependency between the Rdd, Org.apache.spark.Dependency. Depending on the partitions of the parent RDD (s), spark divides this dependency into two types: a wide dependency and a narrow dependency.
Apache Spark Rdd RDD Conversion