Map (func) |
Each element in the original Rdd object is processed according to the incoming function, and after each new element is processed, an object is returned, which is assembled to get a new rdd, and the new Rdd and the old RDD elements are all one by one corresponding |
Filter (func) |
Filter each element in the RDD according to the function passed in, forming a new rdd with the elements of the filter condition |
FlatMap (func) |
First, the map operation, and then the result of the map operation is merged into an object, if the map operation returns Array[array[string]], that flatmap operation should be array[string], automatically merge multiple string arrays into one, Another implication is that an old RDD element can generate multiple new elements, one-to-many relationships |
Mappartitions (func) |
You can think of it as a map, but he handles the data in each individual partition and merges the values of each partition |
Mappartitionswithindex (func) |
Pass the index value of the partition (index) to the input function for processing |
Sample (withreplacement, fraction, Seed) |
Sampling function, split back and not put back, determined by withreplacement parameter, fraction: sampling rate |
Union (Otherdataset) |
Two rdd combined, not heavy |
Intersection (Otherdataset) |
Two Rdd intersection and remove weight |
Distinct ([numtasks])) |
Go heavy |
Groupbykey ([Numtasks]) |
Grouped by key, the value of the same key is placed in a set |
Reducebykey (func, [Numtasks]) |
The value corresponding to the key is given to the incoming function processing |
Aggregatebykey (Zerovalue) (Seqop, Combop, [Numtasks]) |
The value corresponding to the key to do the aggregation calculation, return is also the pair Rdd object |
Sortbykey ([ascending], [numtasks]) |
Pairrdd sorting with a key value |
Join (Otherdataset, [numtasks]) |
Associative within SQL statements |
Cogroup (Otherdataset, [numtasks]) |
Full out-of-context outer join in SQL |
Cartesian (Otherdataset) |
Two rdd operation of the Cartesian set, return to Cartesianrdd |
Pipe (command, [Envvars]) |
Each data shard of the RDD is connected to the standard input of the Shell-command. The Shell-command output data regenerates the new Rdd, and the new Rdd is a string type of RDD |
COALESCE (Numpartitions) |
Merging partitions, parameters performing the merged partition size |
Repartition (Numpartitions) |
Coalesce operation for Shuffle |
Repartitionandsortwithinpartitions (Partitioner) |
This method partitions the RDD according to Partitioner, and sorts them by key in each result partition, and compares Sortbykey to find that this is more efficient than partitioning and then sorting in each partition because it can integrate the sorting into the shuffle phase |