Spark RDD Operations

Last Update:2015-08-12 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The above is the corresponding RDD operation, compared to maoreduce only map, reduce two operations, spark for RDD operation is more

***********************************************

Map (func)

Returns a new distributed dataset consisting of each original element after the Func function is converted

***********************************************
Filter (func)
Returns a new dataset consisting of the original elements that return a value of true after the Func function

***********************************************

FlatMap (func)
Similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)

***********************************************

Sample (Withreplacement, Frac, Seed)
Random sampling of FRAC data based on a given seed seed

***********************************************

Union (Otherdataset)
Returns a new dataset, combined with the original data set and parameters

***********************************************

Groupbykey ([Numtasks])
Called on a dataset composed of (K,V) pairs that returns a (K,seq[v]) pair of data sets. Note: By default, 8 parallel tasks are used to group, you can pass in the Numtask optional parameter, set a different number of task according to the amount of data
(Groupbykey and filter combine to achieve reduce functionality similar to Hadoop)

***********************************************

Reducebykey (func, [Numtasks])
Used on a (k,v) pair of data sets, returns a (K,V) pair of data sets, key the same value, are aggregated together using the specified reduce function. Similar to Groupbykey, the number of tasks can be configured with a second optional parameter.

***********************************************

Join (Otherdataset, [numtasks])
Called on a dataset of type (K,V) and (k,w) type, returns a DataSet (K, (v,w)) pair, with all the elements in each key together

***********************************************

Groupwith (Otherdataset, [numtasks])
Called on datasets of type (K,V) and (k,w) type, returns a dataset consisting of elements (K, seq[v], seq[w]) tuples. This operation is in other frameworks, called Cogroup

***********************************************

Cartesian (Otherdataset)
Cartesian product. But when called on the dataset T and U, returns a (t,u) pair of data sets, all elements interacting with the Cartesian product.

***********************************************

Sortbykey ([Ascendingorder])
Called on a DataSet of type (k, V), which returns the (K,V) pair of datasets sorted with the key K. Ascending or descending is determined by the Ascendingorder parameter of the Boolean type
(Sort in map-reduce intermediate stage similar to Hadoop, sorted by key)

***********************************************

Reduce (func)
All elements in the dataset are aggregated through the function func. The Func function takes 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently.

***********************************************

Collect ()
In the driver program, all elements of the dataset are returned as an array. This usually returns a subset of data that is small enough to be used after using filter or other operations, directly returning the entire RDD set collect, which is likely to cause the driver program to Oom

***********************************************

Count ()
Returns the number of elements in a data set

***********************************************

Take (N)
Returns an array that consists of the first n elements of a dataset. Note that this operation is not currently performed in parallel on multiple nodes, but rather driver the machine where the program is located, and computes all the elements
(Gateway's memory pressure will increase, need to use caution)

***********************************************

First ()
Returns the first element of a dataset (similar to take (1))

***********************************************

Saveastextfile (PATH)
Save the elements of the dataset, in the form of Textfile, to a local filesystem, HDFs, or any other Hadoop-supported file system. Spark will invoke the ToString method of each element and convert it to a line of text in the file

***********************************************

Saveassequencefile (PATH)
Save the elements of the dataset in Sequencefile format, to the specified directory, to the local system, to HDFs, or to any other Hadoop-supported file system. The elements of the RDD must be composed of key-value pairs and both implement the writable interface of Hadoop, or implicitly can be converted to writable (spark includes basic types of transformations, such as int,double,string, etc.)

***********************************************

foreach (func)
On each element of the dataset, run the function func. This is typically used to update an accumulator variable, or to interact with an external storage system

***********************************************

Maptopair (): On each element of the dataset, run Func and return a key-value pair

***********************************************

Spark RDD Operations

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More