Map
Each data item in the RDD, one-to-one mapping, the number of RDD, the number of partitions, and the same
Example:
Data set:
Map operation:
FlatMap
Same as map, but will split the list after each map, can be understood as a one-to-many (note: The string as an array and then split)
Example:
Distinct
Deduplication of data items for RDD
Example:
Coalesce
def COALESCE (numpartitions:int, Shuffle:boolean = False) (implicit ord:ordering[t] = null): Rdd[t]
Function: Re-partitioning the RDD, using Hashpartitioner.
Parameter 1: Number of re-partitions
Parameter 2: Whether shuffle, default is False (note: The general advantage of coalesce than repartition is that you can not need shuffle)
Example: (I gave two cores to the virtual machine, the default is two partitions, and more than 2 is not possible, perhaps the number of partitions and the amount of available processors, the Internet has said if the partition is larger than the current partition to set the parameter is true, but I set the partition of 1 to 2 also succeeded, big data do not know whether it is feasible
Repartition
def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): Rdd[t]
is also repartitioning, will do shuffle work, give a partition number is OK (for a lot of small tasks, spark has its own partition mechanism, if the force is set to some smaller number of partitions, it might be possible to speed up the program)
Example
Randomsplit
def randomsplit (weights:array[double], Seed:long = Utils.random.nextLong): Array[rdd[t]]
Randomly cut an rdd into multiple rdd, slicing according to a double array
The second parameter is the seed of the random, which is basically negligible.
Example:
Union
Def union (Other:rdd[t]): Rdd[t]
Combine two rdd with no weight
Example:
Intersection
Returns the intersection of two Rdd and
def intersection (Other:rdd[t]): Rdd[t]
def intersection (Other:rdd[t], numpartitions:int): Rdd[t]
def intersection (Other:rdd[t], Partitioner:partitioner) (implicit ord:ordering[t] = null): Rdd[t]
Numpartitions: Specifies the number of partitions for the RDD returned.
Partitioner: Specifying partition functions
Example:
Subtract
Returns the collection of data items that appear in the RDD but not Otherrdd
def subtract (other:rdd[t]): Rdd[t]
def subtract (Other:rdd[t], numpartitions:int): Rdd[t]
def subtract (Other:rdd[t], Partitioner:partitioner) (implicit ord:ordering[t] = null): Rdd[t]
Example:
Mappartitionsmappartitionswithindexzip
def Zip[u] (Other:rdd[u]) (implicit arg0:classtag[u]): rdd[(T, U)]
Effect: Two RDD groups are synthesized key/value form of RDD
Note: The default two RDD has the same number of partition and the number of elements, otherwise it throws an exception
Example:
Zippartitionszipwithindex
Def zipwithindex (): rdd[(T, Long)]
Combines the elements in the RDD with the ID (index number) of this element in the RDD bonding/value pairs (can be used for numbering) (different machines may conflict)
Example:
Zipwithuniqueid
Def zipwithuniqueid (): rdd[(T, Long)]
This function combines an bonding/value pair with an element in the RDD and a unique ID, and the unique ID generation algorithm is as follows:
The unique ID value for the first element in each partition is: The partition index number,
The unique ID value for the nth element in each partition is: (The unique ID value of the previous element) + (the total number of partitions for the RDD)
Example: (although numbered separately in different machines, the entire rdd is not duplicated, note the difference between zipwithindex ())
Partitionby
def Partitionby (Partitioner:partitioner): rdd[(K, V)]
The function generates a new SHUFFLERDD based on the Partitioner function and re-partitions the original RDD
Mapvalues
def Mapvalues[u] (f: (V) = = u): rdd[(K, U)]
A map in the same basic conversion operation, except that Mapvalues is a map operation for the V value in [K,v]
Example:
Flatmapvalues
def Flatmapvalues[u] (f: (V) = Traversableonce[u]): rdd[(K, U)]
Flatmap in the basic conversion operation, except that Flatmapvalues is flatmap for the V value in [K,v]
Example:
Reducebykey
Combine RDD according to key in Key-value, equal to key
Def Reducebykey (func: (V, v) = v): rdd[(K, v)]
Def Reducebykey (func: (V, v) = V, numpartitions:int): rdd[(K, v)]
def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)]
This function is used to calculate the V value corresponding to each K in Rdd[k,v] According to the mapping function.
The parameter numpartitions is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
Example:
Groupbykey
Def groupbykey (): rdd[(K, Iterable[v])
def groupbykey (numpartitions:int): rdd[(K, Iterable[v])
def groupbykey (Partitioner:partitioner): rdd[(K, Iterable[v])
This function is used to merge the V value of each K in Rdd[k,v] into a set of iterable[v],
The parameter numpartitions is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
Example:
Reducebykeylocally
Def reducebykeylocally (func: (V, v) = v): map[k, V]
The function evaluates the V value corresponding to each K in Rdd[k,v] According to the mapping function, and the result of the operation is mapped to a map[k,v] instead of rdd[k,v]. (in Python, it's actually dictionary)
Example:
Combinebykeyfoldbykeysubtractbykey
Subtractbykey similar to subtract in basic conversion operations
This is for k, which returns elements that appear in the main Rdd and do not appear in the Otherrdd.
Parameter numpartitions the number of partitions used to specify the result
The parameter partitioner is used to specify the partition function
Cogroup
Failed to run
Joinleftouterjoinrightouterjoin
Spark programming--transformations