The differences between aggregate and Aggregatebykey in Spark and their doubts

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Aggregate (zerovalue,seq,comb,tasknums)

The initial value and the first element in the first partition are passed to the SEQ function for evaluation, and then the result of the calculation and the second element are passed to the SEQ function until the last value is computed. The second partition is also a similar operation. Finally, the initial value, the results of all partitions are computed by the Combine function (the first two results are computed, the returned results and the next result are passed to the combine function, and so on), and the final result is returned.

>>> data = Sc.parallelize ((1,2,3,4,5,6), 2)
>>> def seq (A, B):
...     print ' Seqop: ' +str (a) + "\ T" +str (b) ...     Return min (A, b) ... 
>>> def combine (A, b):
...     print ' Comop: ' +str (a) + "\ T" +str (b) ...     Return a+b ... 
>>> data.aggregate (3,seq,combine)
seqop:3  1
seqop:1  2
seqop:1  3
Seqop:3  4
seqop:3  5
seqop:3  6
comop:3  1
comop:4  3
7
>>>

From the output of the above code, it can be seen that the 4,5,6 is divided into one partition and the other is divided into a partition. 3 First and first element 1 is passed to the SEQ function, returns the minimum value 1, then 1 and the second element 2 to the SEQ function, returns 1, and so on, and finally returns the minimum value of 1 in the first partition. The second partition is the same, The final result returns the minimum value 3. Finally, the results of the initial values of 3 and two partitions are computed by the Combine function, the result 1 of the initial value 3 and the first partition is passed to the Combine function, 4 is returned, and 4 and the second partition result 3 are passed to the Combine function, which returns the final result 7. Aggregatebykey (zerovalue,seq,comb,tasknums)

In the kv pair of RDD, the value is grouped by key, merging, each value and initial value as a parameter of the SEQ function is calculated, the returned result as a new KV pair, and then the results are merged according to Key, Finally, the value of each packet is passed to the Combine function for calculation (first the first two value is evaluated, the return result and the next value is passed to the combine function, and so on), and the key and the result of the calculation as a new KV pair output.

See Code:

>>> data = Sc.parallelize ([(1,3), (UP), (1,4), (2,3)])
>>> def seq (A, B):
...     Return Max (A, b) ... 
>>> def combine (A, b):
...     Return a+b ... 
>>> Data.aggregatebykey (3,seq,comb,4). Collect ()
[(1, 10), (2, 3)]

However, when using the problem encountered, confused:

When you start Pyspark, if it is./bin/pyspark–master Local[3] or more than 3, the expected result will be returned.
If the number of small fish equals 2, there will be discrepancies. such as [(1,7) (2,3)].
do not know what reason, someone online said: Suspect that the calculation of data by default using parallel computing, and we set lcoal when we do not specify the number of cores used, resulting in parallel computation can not be executed, only to maintain a certain calculation results, resulting in the calculation results of the error ...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The differences between aggregate and Aggregatebykey in Spark and their doubts

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support