Apache Spark Rdd RDD Conversion

Source: Internet
Author: User
Tags spark rdd

The conversion of the RDD

Spark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.

The Spark Scala version of Word count program is as follows:

1:val file = Spark.textfile ("hdfs://...") 2:val counts = File.flatmap (line = Line.split ("")) 3:  . Map (Word = Word, 1)) 4:  . Reducebykey (_ + _) 5:counts.saveastextfile ("hdfs://...")

Both file and counts are RDD, where file reads files from HDFs and creates an rdd, and counts is generated on file based on the three Rdd transformations of Flatmap, map, and Reducebykey. Finally, counts invokes the action Saveastextfile, and the user's computational logic is calculated from the cluster that is being submitted here. So what is the implementation of the above 5 lines of code?

1) Line 1:spark is an example of Org.apache.spark.SparkContext, which is the interface between the user program and Spark. Spark will be responsible for connecting to the cluster manager and applying the calculation resources according to user settings or system default settings, complete the creation of the RDD, etc.

Spark.textfile ("hdfs://...") A Org.apache.spark.rdd.HadoopRDD was created, and an RDD conversion was completed: A map to a org.apache.spark.rdd.mappartitions-rdd.

That is, file is actually a mappartitionsrdd, which holds the data contents of all the rows of the file.

2) Line 2: Divides the contents of all rows in file into a list of words, and then merges the list of words that are made up of lines into a list. Finally, a list of elements for each word is saved to Mappartitionsrdd.

3) Line 3: The Mappartitionsrdd generated by step 2nd passes through the map again to the tuple of Word (word, 1) for each word. These tuples are eventually put into a mappartitionsrdd.

4) Line 4: First generates a mappartitionsrdd that acts as a map-end combiner, and then generates a SHUFFLEDRDD that reads the data from the output of the previous RDD as the beginning of the reducer; It also generates a mappartitionsrdd that acts as a reducer-end reduce.

5) Line 5: First generates a MAPPARTITIONSRDD, this rdd will be called by org.apache.spark.rdd.pairrddfunctions# Saveashadoopdataset outputs the data contents of the RDD to HDFs. Finally, call Org.apache.spark.sparkcontext#runjob to submit this compute task to the cluster.

The relationship between RDD can be understood from two dimensions: one is the RDD from which the RDD is converted, that is, what the RDD's parent RDD (s) is, and what partition (s) dependent on the parent RDD (s). This relationship is the dependency between the Rdd, Org.apache.spark.Dependency. Depending on the partitions of the parent RDD (s), spark divides this dependency into two types: a wide dependency and a narrow dependency.

Apache Spark Rdd RDD Conversion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.