Spark RDD Aggregatebykey

Last Update:2016-10-28 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Aggregatebykey This rdd is a bit cumbersome, and tidy up the use examples for reference

Directly on the code

ImportOrg.apache.spark.rdd.RDDImportOrg.apache.spark. {sparkcontext, sparkconf}/*** Created by Edward on 2016/10/27. */Object Aggregatebykey {def main (args:array[string]) {val sparkconf:sparkconf=NewSparkconf (). Setappname ("Aggregatebykey"). Setmaster ("Local") Val Sc:sparkcontext=NewSparkcontext (sparkconf) val Data= List ((1, 3), (1, 2), (1, 4), (2, 3)) var rdd= Sc.parallelize (data,2)//data split into two partitions//The values that are merged in different partition, and the data type of a, B is zerovalue data typedef comb (a:string, b:string): String ={println ("Comb:" + A + "\ T" +b) A+B}//values that are merged in the same partition, A has a data type of Zerovalue, and B has the data type of the original valuedef seq (a:string, b:int): String ={println ("SEQ:" + A + "\ T" +b) A+B} rdd.foreach (println)    
//Zerovalue A neutral value that defines the type of return value and participates in the Operation//Seqop used to combine values in a single partition.//comb used to combine values in different partition.Val aggregatebykeyrdd:rdd[(Int, String)] = Rdd.aggregatebykey ("100") (Seq,comb)//Print OutputAggregatebykeyrdd.foreach (println) Sc.stop ()}}

Output result Description:

/* split the data into two partitions//partitions one data (1,3)//partition two data (1,4) (2,3)//partition one of the same key data to merge seq:100     3   //(1,3) Start and neutral values to merge merge  results to 1003seq:1003     2   //(+) merge results for 10032//partition two identical key data to merge seq:100     4  //(1,4) Start and neutral values to merge 1004seq:100     3  //(2,3) Start and neutral values merge 1003 merge the results of two partitions//key to 2, only in one partition, do not need to merge (2,1003) (2,1003)//key 1, exist in two partitions, and data types are consistent , merge comb:10032     * /

Reference code and the following instructions to understand

Description of official website

Aggregatebykey (zerovalue) (seqop, combop, [numtasks]) When called in a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key is aggregated U Sing the given combine functions and a neutral "zero" value. Allows an aggregated value type, which is different than the input value type, while avoiding unnecessary allocations. Like groupByKey in, the number of the reduce tasks are configurable through an optional second argument.

Description of functions in source code

/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
V. Thus, we need one operation for merging a V to a U and one operation for merging both U ' s,
* As in Scala. Traversableonce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* Allocation, both of these functions is allowed to modify and return their first argument
* Instead of creating a new U.
*/

Spark RDD Aggregatebykey

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More