The RDD operation in Spark

Last Update:2018-07-23 Source: Internet

Author: User

Tags shuffle sorts

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transformations (conversion)

Transformation	Description
Map (func)	Each element in the original Rdd object is processed according to the incoming function, and after each new element is processed, an object is returned, which is assembled to get a new rdd, and the new Rdd and the old RDD elements are all one by one corresponding
Filter (func)	Filter each element in the RDD according to the function passed in, forming a new rdd with the elements of the filter condition
FlatMap (func)	First, the map operation, and then the result of the map operation is merged into an object, if the map operation returns Array[array[string]], that flatmap operation should be array[string], automatically merge multiple string arrays into one, Another implication is that an old RDD element can generate multiple new elements, one-to-many relationships
Mappartitions (func)	You can think of it as a map, but he handles the data in each individual partition and merges the values of each partition
Mappartitionswithindex (func)	Pass the index value of the partition (index) to the input function for processing
Sample (withreplacement, fraction, Seed)	Sampling function, split back and not put back, determined by withreplacement parameter, fraction: sampling rate
Union (Otherdataset)	Two rdd combined, not heavy
Intersection (Otherdataset)	Two Rdd intersection and remove weight
Distinct ([numtasks]))	Go heavy
Groupbykey ([Numtasks])	Grouped by key, the value of the same key is placed in a set
Reducebykey (func, [Numtasks])	The value corresponding to the key is given to the incoming function processing
Aggregatebykey (Zerovalue) (Seqop, Combop, [Numtasks])	The value corresponding to the key to do the aggregation calculation, return is also the pair Rdd object
Sortbykey ([ascending], [numtasks])	Pairrdd sorting with a key value
Join (Otherdataset, [numtasks])	Associative within SQL statements
Cogroup (Otherdataset, [numtasks])	Full out-of-context outer join in SQL
Cartesian (Otherdataset)	Two rdd operation of the Cartesian set, return to Cartesianrdd
Pipe (command, [Envvars])	Each data shard of the RDD is connected to the standard input of the Shell-command. The Shell-command output data regenerates the new Rdd, and the new Rdd is a string type of RDD
COALESCE (Numpartitions)	Merging partitions, parameters performing the merged partition size
Repartition (Numpartitions)	Coalesce operation for Shuffle
Repartitionandsortwithinpartitions (Partitioner)	This method partitions the RDD according to Partitioner, and sorts them by key in each result partition, and compares Sortbykey to find that this is more efficient than partitioning and then sorting in each partition because it can integrate the sorting into the shuffle phase

Action (action)

Action	Description
Reduce (func)	evaluates the elements in the RDD by a two-element calculation based on the mapping function f.
collect ()	convert Rdd to an array
count ()	rdd number of elements
First ()	return rdd element one
take (n)	convert the first n elements of the RDD to an array return
takesample (withreplacement, NUM, [seed])	randomly remove num element conversions to an array return
ta Keordered (n, [ordering])	takes n elements, sorts after a comparer, returns
saveastextfile (path)	Rdd saved to Piece
saveassequencefile (path)	Save as Hadoop sequencefile format file
SA Veasobjectfile (path)	is used to serialize the elements in the RDD into objects, stored in a file
countbykey ()	pairrdd, calculating key The number of
foreach	is not returned, and is used to traverse the RDD, applying the function f to each element.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More