The Spark aggregate function is detailed

Source: Internet
Author: User
Tags mul

Aggregate is a relatively common function in spark, it will be more difficult to understand, now through a few detailed examples to focus on understanding the use of aggregate. 1. First look at the function signature of aggregate

In Spark's source code, you can see the signature of the aggregate function as follows:

def Aggregate[u:classtag] (zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = u): U

As can be seen, this function is a method of curry, the input parameters are divided into two parts: (Zerovalue:u) and (Seqop: (U, T) = u, Combop: (u, u) = u) 2.aggregate usage

Function signature is more complicated, maybe some of the small partners look at the faint dish. Don't catch it, let's take a look at the comments in front of the function, and we'll be clear about the use of this function.

  /** * Aggregate The elements of each partition, and then the results for all the partitions, using * given combine Functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T to a u * and one operation for merging-one, as in Scala. Traversableonce. Both of these functions is * allowed to modify and return their first argument instead of creating a new U to avoid me
   Mory * allocation. * * @param zerovalue The initial value for the accumulated result of each partition for the * ' Seqo P ' operator, and also the initial value for the combine results from * different partitions for the ' C Ombop ' operator-this would typically be the the * neutral element (e.g. ' Nil ' for list concatenation or ' 0 ' for summation] * @param seqop an operator used to accumulate results within a partition * @param CombOP an associative operator used to combine results from different partitions */ 

The aggregate first aggregates the elements of each partition, then aggregates the results of all the partitions, using the given aggregate function and the initial value "zero value" in the aggregation process. This function can return a different type U than the original RDD, so you need a function that combines the RDD type T to the result type U, and a function that merges type U. Both functions can modify and return their first parameter instead of re-creating a U-type parameter to avoid reallocating memory.
The initial value of the cumulative result for each partition of the parameter Zerovalue:seqop operator and the result of the combination of the different partitions of the COMBOP operator-this will typically be the initial element (for example, "Nil" for list joins or "0" for summing)
Parameter seqop: The aggregate function that accumulates results for each partition.
Parameter Combop: An association operator for combining results from different partitions 3. Averaging

Looks like the above principle introduction, next we look at the dry goods.
First, you can look at one of the most online examples:

Val list = list (1,2,3,4,5,6,7,8,9)
val (mul, Sum, count) = Sc.parallelize (list, 2). Aggregate ((1, 0, 0)) (
    acc, num ber) = (acc._1 * number, acc._2 + number, Acc._3 + 1),
    (x, y) = (x._1 * y._1, x._2 + y._2, X._3 + y._3)
        ) c5/> (Sum/count, Mul)

A slight change is made on the basis of common mean values, sum is SUM, Count is the number of cumulative elements, and Mul is the product of each element.
Explain the specific process:
1. The initial value is (1, 0, 0)
2.number is the T in the function, which is the element in the list, when the type is int. The type of ACC is (int, int, int). Acc._1 * num is multiplied by each element (initial value is 1), and acc._2 + number adds to each element.
3.sum/count is calculated as an average. 4. Additional examples

To deepen your understanding, look at another example.

        Val raw = List ("A", "B", "D", "F", "G", "H", "O", "Q", "X", "Y")
        Val (biggerthanf, lessthanf) = Sc.parallelize (raw, 1) . Aggregate ((0, 0)) (
            (cc, str) = = {
                var biggerf = cc._1
                var lessf = cc._2
                if (Str.compareto ("F") >= 0 ) Biggerf = cc._1 + 1
                else if (Str.compareto ("F") < 0) LESSF = cc._2 + 1
                (Biggerf, LESSF)
            },
            (x, y) =&G T (X._1 + y._1, x._2 + y._2)
        )

In this example, what we want to do is count the number of elements in the raw list that are larger than "F" and smaller than "F". The logic of the code itself is relatively simple, and it is no longer explained. comparison of 5.aggregateByKey and Combinebykey

Aggregate is a sequence-specific operation, and Aggregatebykey is for k,v pairs. As the name implies, Aggregatebykey is for the key to do aggregate operation. The prototypes for the functions in Spark are as follows:

  def Aggregatebykey[u:classtag] (zerovalue:u) (Seqop: (U, V) = = u,
      combop: (u, u) = + u): rdd[(K, u)] = Self.withs Cope {
    Aggregatebykey (Zerovalue, Defaultpartitioner (self)) (Seqop, Combop)
  }

It's all about k,v, and there's a combinebykey operation in Spark:

  def Combinebykey[c] (
      createcombiner:v = C,
      mergevalue: (c, V) and C,
      Mergecombiners: (c, C) and C): R dd[(K, C)] = self.withscope {
    Combinebykeywithclasstag (Createcombiner, Mergevalue, mergecombiners) (null)
  }

To see the two links, let's look at the real implementation of Aggregatebykey inside:

From the above source code can be clearly seen, Aggregatebykey call is the Combinebykey method. The Seqop method is that the Mergevalue,combop method is Mergecombiners,cleanedseqop (Createzero (), V) is Createcombiner, that is, the incoming Seqop function, Only one of the values is an incoming zerovalue.
Therefore, Aggregatebykey is more appropriate when the Createcombiner and Mergevalue functions are the same.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.