Groupbykey
Def groupbykey (): rdd[(K, Iterable[v])
def groupbykey (numpartitions:int): rdd[(K, Iterable[v])
def groupbykey (Partitioner:partitioner): rdd[(K, Iterable[v])
This function is used to merge the V value of each K in Rdd[k,v] into a set of iterable[v],
The parameter numpartitions is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
scala> var rdd1 = Sc.makerdd (Array ("a", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1))) rdd1:org.apache.spark.rdd.rdd[( String, Int)] = parallelcollectionrdd[89] at Makerdd at:21 scala> rdd1.groupbykey (). Collect res81:array[(string, It Erable[int])] = Array ((A,compactbuffer (0, 2)), (B,compactbuffer (2, 1)), (C,compactbuffer (1)))
Reducebykey
Def Reducebykey (func: (V, v) = v): rdd[(K, v)]
Def Reducebykey (func: (V, v) = V, numpartitions:int): rdd[(K, v)]
def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)]
This function is used to calculate the V value corresponding to each K in Rdd[k,v] According to the mapping function.
The parameter numpartitions is used to specify the number of partitions;
The parameter partitioner is used to specify the partition function;
scala> var rdd1 = Sc.makerdd (Array ("a", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 1))) rdd1:org.apache.spark.rdd.rdd[( String, Int)] = parallelcollectionrdd[91] at Makerdd at:21 scala> rdd1.partitions.size res82:int = scala>
var rdd2 = Rdd1.reducebykey ((x, y) = x + y)rdd2:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[94] at Reducebykey at:23 scala> rdd2.collect Res85:arra y[(String, Int)] = Array ((a,2), (b,3), (c,1)) scala> rdd2.partitions.size res86:int = scala>
var rdd2 = Rdd1.reducebykey (New Org.apache.spark.HashPartitioner (2), (x, y) = x + y)rdd2:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[95] at Reducebykey at:23 scala> rdd2.collect Res87:arra y[(String, Int)] = Array ((b,3), (a,2), (c,1)) scala> rdd2.partitions.size Res88:int = 2
reducebykeylocally
Def reducebykeylocally (func: (V, v) = v): map[k, V]
The function evaluates the V value corresponding to each K in Rdd[k,v] According to the mapping function, and the result of the operation is mapped to a map[k,v] instead of rdd[k,v].