The differences between aggregate and Aggregatebykey in Spark and their doubts

Source: Internet
Author: User
Aggregate (zerovalue,seq,comb,tasknums)

The initial value and the first element in the first partition are passed to the SEQ function for evaluation, and then the result of the calculation and the second element are passed to the SEQ function until the last value is computed. The second partition is also a similar operation. Finally, the initial value, the results of all partitions are computed by the Combine function (the first two results are computed, the returned results and the next result are passed to the combine function, and so on), and the final result is returned.

>>> data = Sc.parallelize ((1,2,3,4,5,6), 2)
>>> def seq (A, B):
...     print ' Seqop: ' +str (a) + "\ T" +str (b) ...     Return min (A, b) ... 
>>> def combine (A, b):
...     print ' Comop: ' +str (a) + "\ T" +str (b) ...     Return a+b ... 
>>> data.aggregate (3,seq,combine)
seqop:3  1
seqop:1  2
seqop:1  3
Seqop:3  4
seqop:3  5
seqop:3  6
comop:3  1
comop:4  3
7
>>>

From the output of the above code, it can be seen that the 4,5,6 is divided into one partition and the other is divided into a partition. 3 First and first element 1 is passed to the SEQ function, returns the minimum value 1, then 1 and the second element 2 to the SEQ function, returns 1, and so on, and finally returns the minimum value of 1 in the first partition. The second partition is the same, The final result returns the minimum value 3. Finally, the results of the initial values of 3 and two partitions are computed by the Combine function, the result 1 of the initial value 3 and the first partition is passed to the Combine function, 4 is returned, and 4 and the second partition result 3 are passed to the Combine function, which returns the final result 7. Aggregatebykey (zerovalue,seq,comb,tasknums)

In the kv pair of RDD, the value is grouped by key, merging, each value and initial value as a parameter of the SEQ function is calculated, the returned result as a new KV pair, and then the results are merged according to Key, Finally, the value of each packet is passed to the Combine function for calculation (first the first two value is evaluated, the return result and the next value is passed to the combine function, and so on), and the key and the result of the calculation as a new KV pair output.

See Code:

>>> data = Sc.parallelize ([(1,3), (UP), (1,4), (2,3)])
>>> def seq (A, B):
...     Return Max (A, b) ... 
>>> def combine (A, b):
...     Return a+b ... 
>>> Data.aggregatebykey (3,seq,comb,4). Collect ()
[(1, 10), (2, 3)]

However, when using the problem encountered, confused:

When you start Pyspark, if it is./bin/pyspark–master Local[3] or more than 3, the expected result will be returned.
If the number of small fish equals 2, there will be discrepancies. such as [(1,7) (2,3)].
do not know what reason, someone online said: Suspect that the calculation of data by default using parallel computing, and we set lcoal when we do not specify the number of cores used, resulting in parallel computation can not be executed, only to maintain a certain calculation results, resulting in the calculation results of the error ...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.