Aggregatebykey This rdd is a bit cumbersome, and tidy up the use examples for reference
Directly on the code
ImportOrg.apache.spark.rdd.RDDImportOrg.apache.spark. {sparkcontext, sparkconf}/*** Created by Edward on 2016/10/27. */Object Aggregatebykey {def main (args:array[string]) {val sparkconf:sparkconf=NewSparkconf (). Setappname ("Aggregatebykey"). Setmaster ("Local") Val Sc:sparkcontext=NewSparkcontext (sparkconf) val Data= List ((1, 3), (1, 2), (1, 4), (2, 3)) var rdd= Sc.parallelize (data,2)//data split into two partitions//The values that are merged in different partition, and the data type of a, B is zerovalue data typedef comb (a:string, b:string): String ={println ("Comb:" + A + "\ T" +b) A+B}//values that are merged in the same partition, A has a data type of Zerovalue, and B has the data type of the original valuedef seq (a:string, b:int): String ={println ("SEQ:" + A + "\ T" +b) A+B} rdd.foreach (println)
//Zerovalue A neutral value that defines the type of return value and participates in the Operation//Seqop used to combine values in a single partition.//comb used to combine values in different partition.Val aggregatebykeyrdd:rdd[(Int, String)] = Rdd.aggregatebykey ("100") (Seq,comb)//Print OutputAggregatebykeyrdd.foreach (println) Sc.stop ()}}
Output result Description:
/* split the data into two partitions//partitions one data (1,3)//partition two data (1,4) (2,3)//partition one of the same key data to merge seq:100 3 //(1,3) Start and neutral values to merge merge results to 1003seq:1003 2 //(+) merge results for 10032//partition two identical key data to merge seq:100 4 //(1,4) Start and neutral values to merge 1004seq:100 3 //(2,3) Start and neutral values merge 1003 merge the results of two partitions//key to 2, only in one partition, do not need to merge (2,1003) (2,1003)//key 1, exist in two partitions, and data types are consistent , merge comb:10032 * /
Reference code and the following instructions to understand
Description of official website
Aggregatebykey (zerovalue) (seqop, combop, [numtasks]) |
When called in a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key is aggregated U Sing the given combine functions and a neutral "zero" value. Allows an aggregated value type, which is different than the input value type, while avoiding unnecessary allocations. Like groupByKey in, the number of the reduce tasks are configurable through an optional second argument. |
Description of functions in source code
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
V. Thus, we need one operation for merging a V to a U and one operation for merging both U ' s,
* As in Scala. Traversableonce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* Allocation, both of these functions is allowed to modify and return their first argument
* Instead of creating a new U.
*/
Spark RDD Aggregatebykey