Spark RDD Operations

Source: Internet
Author: User
Tags spark rdd

The above is the corresponding RDD operation, compared to maoreduce only map, reduce two operations, spark for RDD operation is more

***********************************************

Map (func)

Returns a new distributed dataset consisting of each original element after the Func function is converted

***********************************************
Filter (func)
Returns a new dataset consisting of the original elements that return a value of true after the Func function

***********************************************

FlatMap (func)
Similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)

***********************************************

Sample (Withreplacement, Frac, Seed)
Random sampling of FRAC data based on a given seed seed

***********************************************

Union (Otherdataset)
Returns a new dataset, combined with the original data set and parameters

***********************************************

Groupbykey ([Numtasks])
Called on a dataset composed of (K,V) pairs that returns a (K,seq[v]) pair of data sets. Note: By default, 8 parallel tasks are used to group, you can pass in the Numtask optional parameter, set a different number of task according to the amount of data
(Groupbykey and filter combine to achieve reduce functionality similar to Hadoop)

***********************************************

Reducebykey (func, [Numtasks])
Used on a (k,v) pair of data sets, returns a (K,V) pair of data sets, key the same value, are aggregated together using the specified reduce function. Similar to Groupbykey, the number of tasks can be configured with a second optional parameter.

***********************************************

Join (Otherdataset, [numtasks])
Called on a dataset of type (K,V) and (k,w) type, returns a DataSet (K, (v,w)) pair, with all the elements in each key together

***********************************************

Groupwith (Otherdataset, [numtasks])
Called on datasets of type (K,V) and (k,w) type, returns a dataset consisting of elements (K, seq[v], seq[w]) tuples. This operation is in other frameworks, called Cogroup

***********************************************

Cartesian (Otherdataset)
Cartesian product. But when called on the dataset T and U, returns a (t,u) pair of data sets, all elements interacting with the Cartesian product.

***********************************************

Sortbykey ([Ascendingorder])
Called on a DataSet of type (k, V), which returns the (K,V) pair of datasets sorted with the key K. Ascending or descending is determined by the Ascendingorder parameter of the Boolean type
(Sort in map-reduce intermediate stage similar to Hadoop, sorted by key)

***********************************************

Reduce (func)
All elements in the dataset are aggregated through the function func. The Func function takes 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently.

***********************************************

Collect ()
In the driver program, all elements of the dataset are returned as an array. This usually returns a subset of data that is small enough to be used after using filter or other operations, directly returning the entire RDD set collect, which is likely to cause the driver program to Oom

***********************************************

Count ()
Returns the number of elements in a data set

***********************************************

Take (N)
Returns an array that consists of the first n elements of a dataset. Note that this operation is not currently performed in parallel on multiple nodes, but rather driver the machine where the program is located, and computes all the elements
(Gateway's memory pressure will increase, need to use caution)

***********************************************

First ()
Returns the first element of a dataset (similar to take (1))

***********************************************

Saveastextfile (PATH)
Save the elements of the dataset, in the form of Textfile, to a local filesystem, HDFs, or any other Hadoop-supported file system. Spark will invoke the ToString method of each element and convert it to a line of text in the file

***********************************************

Saveassequencefile (PATH)
Save the elements of the dataset in Sequencefile format, to the specified directory, to the local system, to HDFs, or to any other Hadoop-supported file system. The elements of the RDD must be composed of key-value pairs and both implement the writable interface of Hadoop, or implicitly can be converted to writable (spark includes basic types of transformations, such as int,double,string, etc.)

***********************************************

foreach (func)
On each element of the dataset, run the function func. This is typically used to update an accumulator variable, or to interact with an external storage system

***********************************************

Maptopair (): On each element of the dataset, run Func and return a key-value pair

***********************************************

Spark RDD Operations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.