"Spark" Rdd operation detailed 1--transformation and actions overview

Last Update:2015-07-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The role of the spark operator

Describes how spark transforms an rdd through operators in a run conversion. Operators are functions defined in the RDD and can be transformed and manipulated into the data in the RDD.

Input: During the Spark program run, data is entered into spark from the external data space (such as distributed storage: Textfile read HDFs, parallelize method input Scala collection or data), data enters spark runtime data space, Transform into a block of data in Spark, managed by Blockmanager.
Operation: After the spark data input forms an RDD, the data can be manipulated by a transform operator, such as filter, and the RDD is transformed into a new rdd, triggering the spark submission job via the action operator. If the data needs to be reused, the data can be cached to memory through the cache operator.
Output: Program run end data outputs spark runtime space, stored in distributed storage (e.g. Saveastextfile output to HDFs), or Scala data or collection (collect output to Scala collection, COUNT returns scala int data).

The core data model of Spark is the RDD, but the RDD is an abstract class that is implemented by subclasses such as Mappedrdd, Shuffledrdd, and so on. Spark translates common big data operations into a subclass of the RDD.

Transformation and Actions overview transformation specific content

Map (Func): Returns a new distributed dataset composed of each original element after the Func function is converted
Filter (func): Returns a new dataset consisting of the original elements that return a value of true after the Func function
*flatmap (func): similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
FlatMap (func): similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
Sample (Withreplacement, Frac, Seed):
Random sampling of FRAC data based on a given seed seed
Union (Otherdataset): Returns a new DataSet, combined with the original data set and parameters
Groupbykey ([Numtasks]):
Called on a dataset composed of (K,V) pairs that returns a (K,seq[v]) pair of data sets. Note: By default, 8 parallel tasks are used to group, you can pass in the Numtask optional parameter, set a different number of task according to the amount of data
Reducebykey (func, [Numtasks]): Used on a (k,v) pair of data sets, returns a (K,V) pair of data sets, the same value of key, all aggregated together using the specified reduce function. Similar to Groupbykey, the number of tasks can be configured with a second optional parameter.
Join (Otherdataset, [numtasks]):
Called on a dataset of type (K,V) and (k,w) type, returns a DataSet (K, (v,w)) pair, with all the elements in each key together
Groupwith (Otherdataset, [Numtasks]): Called on datasets of type (K,V) and (k,w) type, returns a dataset consisting of elements (K, seq[v], seq[w]) tuples. This operation is in other frameworks, called Cogroup
Cartesian (Otherdataset): Cartesian product. But when called on the dataset T and U, returns a (t,u) pair of data sets, all elements interacting with the Cartesian product.
FlatMap (func):
Similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)

Actions specific Content

Reduce (func): Aggregates all elements of a dataset by function Func. The Func function takes 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently.
Collect (): Returns all elements of a dataset in the form of an array, in the driver program. This usually returns a subset of data that is small enough to be used after using filter or other operations, directly returning the entire RDD set collect, which is likely to cause the driver program to Oom
COUNT (): Returns the number of elements in the dataset
Take (N): Returns an array consisting of the first n elements of a dataset. Note that this operation is not currently performed in parallel on multiple nodes, but rather on the machine where the driver program is located, and all of the elements are computed (the gateway's memory pressure will increase and need to be used with caution)
First (): Returns the initial element of a dataset (similar to take (1))
Saveastextfile (PATH): Saves the elements of the dataset, in textfile form, to the local file system, HDFs, or any other Hadoop-supported file system. Spark will invoke the ToString method of each element and convert it to a line of text in the file
Saveassequencefile (PATH): Saves the elements of the dataset in Sequencefile format, to the specified directory, to the local system, to HDFs, or to any other Hadoop-supported file system. The elements of the RDD must be composed of key-value pairs and both implement the writable interface of Hadoop, or implicitly can be converted to writable (spark includes basic types of transformations, such as int,double,string, etc.)
foreach (func): Runs the function func on each element of a dataset. This is typically used to update an accumulator variable, or to interact with an external storage system

Operator classification

It can be broadly divided into three main types of operators:
1. The transformation operator of the value data type, which does not trigger the commit job, is data of value type for the processed data item.
2. The transfromation operator of the Key-value data type, which does not trigger the commit job, is the data pair of the Key-value type for the processed data item.
3. Action operators, which trigger sparkcontext to submit job jobs.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

"Spark" Rdd operation detailed 1--transformation and actions overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Spark" Rdd operation detailed 1--transformation and actions overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Spark" Rdd operation detailed 1--transformation and actions overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support