The RDD operation in Spark

Source: Internet
Author: User
Tags shuffle sorts
Transformations (conversion)
Transformation Description
Map (func) Each element in the original Rdd object is processed according to the incoming function, and after each new element is processed, an object is returned, which is assembled to get a new rdd, and the new Rdd and the old RDD elements are all one by one corresponding
Filter (func) Filter each element in the RDD according to the function passed in, forming a new rdd with the elements of the filter condition
FlatMap (func) First, the map operation, and then the result of the map operation is merged into an object, if the map operation returns Array[array[string]], that flatmap operation should be array[string], automatically merge multiple string arrays into one, Another implication is that an old RDD element can generate multiple new elements, one-to-many relationships
Mappartitions (func) You can think of it as a map, but he handles the data in each individual partition and merges the values of each partition
Mappartitionswithindex (func) Pass the index value of the partition (index) to the input function for processing
Sample (withreplacement, fraction, Seed) Sampling function, split back and not put back, determined by withreplacement parameter, fraction: sampling rate
Union (Otherdataset) Two rdd combined, not heavy
Intersection (Otherdataset) Two Rdd intersection and remove weight
Distinct ([numtasks])) Go heavy
Groupbykey ([Numtasks]) Grouped by key, the value of the same key is placed in a set
Reducebykey (func, [Numtasks]) The value corresponding to the key is given to the incoming function processing
Aggregatebykey (Zerovalue) (Seqop, Combop, [Numtasks]) The value corresponding to the key to do the aggregation calculation, return is also the pair Rdd object
Sortbykey ([ascending], [numtasks]) Pairrdd sorting with a key value
Join (Otherdataset, [numtasks]) Associative within SQL statements
Cogroup (Otherdataset, [numtasks]) Full out-of-context outer join in SQL
Cartesian (Otherdataset) Two rdd operation of the Cartesian set, return to Cartesianrdd
Pipe (command, [Envvars]) Each data shard of the RDD is connected to the standard input of the Shell-command. The Shell-command output data regenerates the new Rdd, and the new Rdd is a string type of RDD
COALESCE (Numpartitions) Merging partitions, parameters performing the merged partition size
Repartition (Numpartitions) Coalesce operation for Shuffle
Repartitionandsortwithinpartitions (Partitioner) This method partitions the RDD according to Partitioner, and sorts them by key in each result partition, and compares Sortbykey to find that this is more efficient than partitioning and then sorting in each partition because it can integrate the sorting into the shuffle phase
Action (action)
Action Description
Reduce (func) evaluates the elements in the RDD by a two-element calculation based on the mapping function f.
collect () convert Rdd to an array
count () rdd number of elements
First () return rdd element one
take (n) convert the first n elements of the RDD to an array return
takesample (withreplacement, NUM, [seed]) randomly remove num element conversions to an array return
ta Keordered (n, [ordering]) takes n elements, sorts after a comparer, returns
saveastextfile (path) Rdd saved to Piece
saveassequencefile (path) Save as Hadoop sequencefile format file
SA Veasobjectfile (path) is used to serialize the elements in the RDD into objects, stored in a file
countbykey () pairrdd, calculating key The number of
foreach is not returned, and is used to traverse the RDD, applying the function f to each element.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.