Aggregate (zerovalue,seq,comb,tasknums)
The initial value and the first element in the first partition are passed to the SEQ function for evaluation, and then the result of the calculation and the second element are passed to the SEQ function until the last value is computed. The second partition is also a similar operation. Finally, the initial value, the results of all partitions are computed by the Combine function (the first two results are computed, the returned results and the next result are passed to the combine function, and so on), and the final result is returned.
>>> data = Sc.parallelize ((1,2,3,4,5,6), 2)
>>> def seq (A, B):
... print ' Seqop: ' +str (a) + "\ T" +str (b) ... Return min (A, b) ...
>>> def combine (A, b):
... print ' Comop: ' +str (a) + "\ T" +str (b) ... Return a+b ...
>>> data.aggregate (3,seq,combine)
seqop:3 1
seqop:1 2
seqop:1 3
Seqop:3 4
seqop:3 5
seqop:3 6
comop:3 1
comop:4 3
7
>>>
From the output of the above code, it can be seen that the 4,5,6 is divided into one partition and the other is divided into a partition. 3 First and first element 1 is passed to the SEQ function, returns the minimum value 1, then 1 and the second element 2 to the SEQ function, returns 1, and so on, and finally returns the minimum value of 1 in the first partition. The second partition is the same, The final result returns the minimum value 3. Finally, the results of the initial values of 3 and two partitions are computed by the Combine function, the result 1 of the initial value 3 and the first partition is passed to the Combine function, 4 is returned, and 4 and the second partition result 3 are passed to the Combine function, which returns the final result 7. Aggregatebykey (zerovalue,seq,comb,tasknums)
In the kv pair of RDD, the value is grouped by key, merging, each value and initial value as a parameter of the SEQ function is calculated, the returned result as a new KV pair, and then the results are merged according to Key, Finally, the value of each packet is passed to the Combine function for calculation (first the first two value is evaluated, the return result and the next value is passed to the combine function, and so on), and the key and the result of the calculation as a new KV pair output.
See Code:
>>> data = Sc.parallelize ([(1,3), (UP), (1,4), (2,3)])
>>> def seq (A, B):
... Return Max (A, b) ...
>>> def combine (A, b):
... Return a+b ...
>>> Data.aggregatebykey (3,seq,comb,4). Collect ()
[(1, 10), (2, 3)]
However, when using the problem encountered, confused:
When you start Pyspark, if it is./bin/pyspark–master Local[3] or more than 3, the expected result will be returned.
If the number of small fish equals 2, there will be discrepancies. such as [(1,7) (2,3)].
do not know what reason, someone online said: Suspect that the calculation of data by default using parallel computing, and we set lcoal when we do not specify the number of cores used, resulting in parallel computation can not be executed, only to maintain a certain calculation results, resulting in the calculation results of the error ...