Spark conversion (transform) and Action (action) list

Source: Internet
Author: User
Keywords Cloud computing Spark spark conversion Spark tutorials
Spark conversion (transform) and Action (action) list.


the following func, most of the time to make the logic clearer, recommend using anonymous functions! (lambda) "


"Ps:java and Python APIs are the same, names and parameters are unchanged." Transformation meaning map (func) Output one element filter (func) after each INPUT element passes the Func function conversion returns the return value true after the Func function evaluates Input elements of a new dataset Flatmap (func) is similar to map, but each INPUT element can be mapped to 0 or more output elements, so func should return a sequence Mappartitions (func) similar to map, but independently in RDD Runs on every block of the RDD, so when running on a type T,


func function Type must be iterator[t]? Iterator[u] Mappartitionswithsplit (func) is similar to Mappartitions, but Func has an integer parameter representing the index value of the block. So when running on a rdd of type T,

The function type of the
func must Be (Int, iterator[t])? Iterator[u] Sample (Withreplacement,fraction, Seed) samples the data according to the proportions specified by fraction, optionally whether to replace it with random numbers,


seed is used to specify the random number generator seed Union (Otherdataset) to return a new dataset that is combined with the source dataset and the parameter dataset distinct ([Numtasks]) Returns a new dataset containing all the distinct elements in the source DataSet Groupbykey ([Numtasks]) called on a dataset of a key-value pair, which returns a (K,SEQ[V) pair of data sets. Note: By default,


only 8 parallel tasks to do the operation,


But you can pass in an optional numtasks parameter to change it Reducebykey (func, [numtasks]) when invoked on a data set of a key-value pair, returns a data set of key pairs, using the specified reduce function,


the values of the same key together.

The number of
similar groupbykey,reduce tasks can be configured with a second optional parameter Sortbykey ([ascending], [Numtasks]) on a data set of a key pair, K must implement the Ordered interface,


returns a key-value pair data set sorted by key.


Ascending or descending order the ascending Boolean parameter determines that join (Otherdataset, [numtasks]) is invoked on a dataset of type (K,V) and (k,w) type,


returns all elements of the same key corresponding to the (k, (V, W)) DataSet Cogroup (Otherdataset, [Numtasks]) on a dataset of type (K,V) and (K,W), and returns a (K, seq[v), seq[ W]) tuple.


This operation can also be called Groupwith Cartesian (Otherdataset) Cartesian product, when invoked on a dataset of type T and U type, returns a (T, U) pipe to the data set (22 element pairs) ( Envvars]) RDD pipe operation COALESCE (numpartitions) reduces the number of RDD partitions to the specified value. After filtering large amounts of data, you can perform this operation repartition (numpartitions) to RDD partition repartitionandsortwithinpartitions (Partitioner) again to RDD partition, And within each partition, the recorded key sort action meaning reduce (func) Func all the elements in the dataset through the function. This feature must be interchangeable and associative so that it can be executed in parallel. Collect () in the driver, returns all the elements of the dataset in the form of an array.


This is often useful when using filter or other actions and returning a subset of data that is small enough to be used. COUNT () returns the number of elements in the dataset. First () returns a data set that is similar to take (1) take (n) returns an array of the first n elements of the dataset. Note that this operation is not currently executed in parallel,

Instead, the driver calculates all the elements takesample (withreplacement,num, Seed) returns an array, randomly sampling num elements in the dataset, optionally replacing the
with random numbers,


seed is used to specify the random number generator seed takeordered (n, [ordering]) returns the elements of the dataset, in textfile form, in the natural order or in the first n elements of the custom order Saveastextfile (PATH) Save to the local file system, HDFs or any other Hadoop-supported file system.


for each element, Spark will invoke the ToString method, converting it into a text line in the file Saveassequencefile (path) (Java and Scala) to convert the elements of the dataset to Hadoop Sequencefile Format is saved to the specified directory Saveasobjectfile (path) (Java and Scala) saves the elements of the dataset in Java serialization to the specified directory Countbykey () to the RDD (k,v) type, and returns a (K, INT) to the Map, which represents the number of elements per key corresponding to each element of the Func, runs the function func to update on each of the elements of the dataset. This is usually used for edge effects,


such as updating an accumulator, or interacting with an external storage system, such as HBase
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.