Topic Center

Contact Sales

Home > Developer > Web Develop

Apache Spark Rdd RDD Conversion

Last Update:2016-07-31 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The conversion of the RDD

Spark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.

The Spark Scala version of Word count program is as follows:

1:val file = Spark.textfile ("hdfs://...") 2:val counts = File.flatmap (line = Line.split ("")) 3:  . Map (Word = Word, 1)) 4:  . Reducebykey (_ + _) 5:counts.saveastextfile ("hdfs://...")

Both file and counts are RDD, where file reads files from HDFs and creates an rdd, and counts is generated on file based on the three Rdd transformations of Flatmap, map, and Reducebykey. Finally, counts invokes the action Saveastextfile, and the user's computational logic is calculated from the cluster that is being submitted here. So what is the implementation of the above 5 lines of code?

1) Line 1:spark is an example of Org.apache.spark.SparkContext, which is the interface between the user program and Spark. Spark will be responsible for connecting to the cluster manager and applying the calculation resources according to user settings or system default settings, complete the creation of the RDD, etc.

Spark.textfile ("hdfs://...") A Org.apache.spark.rdd.HadoopRDD was created, and an RDD conversion was completed: A map to a org.apache.spark.rdd.mappartitions-rdd.

That is, file is actually a mappartitionsrdd, which holds the data contents of all the rows of the file.

2) Line 2: Divides the contents of all rows in file into a list of words, and then merges the list of words that are made up of lines into a list. Finally, a list of elements for each word is saved to Mappartitionsrdd.

3) Line 3: The Mappartitionsrdd generated by step 2nd passes through the map again to the tuple of Word (word, 1) for each word. These tuples are eventually put into a mappartitionsrdd.

4) Line 4: First generates a mappartitionsrdd that acts as a map-end combiner, and then generates a SHUFFLEDRDD that reads the data from the output of the previous RDD as the beginning of the reducer; It also generates a mappartitionsrdd that acts as a reducer-end reduce.

5) Line 5: First generates a MAPPARTITIONSRDD, this rdd will be called by org.apache.spark.rdd.pairrddfunctions# Saveashadoopdataset outputs the data contents of the RDD to HDFs. Finally, call Org.apache.spark.sparkcontext#runjob to submit this compute task to the cluster.

The relationship between RDD can be understood from two dimensions: one is the RDD from which the RDD is converted, that is, what the RDD's parent RDD (s) is, and what partition (s) dependent on the parent RDD (s). This relationship is the dependency between the Rdd, Org.apache.spark.Dependency. Depending on the partitions of the parent RDD (s), spark divides this dependency into two types: a wide dependency and a narrow dependency.

Apache Spark Rdd RDD Conversion

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

rdd meaning rdd usa apache spark windows apache spark client apache spark testing apache spark python tutorial apache spark performance tuning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark Rdd RDD Conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support