The Spark aggregate function is detailed

Last Update:2018-07-26 Source: Internet

Author: User

Tags mul

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Aggregate is a relatively common function in spark, it will be more difficult to understand, now through a few detailed examples to focus on understanding the use of aggregate. 1. First look at the function signature of aggregate

In Spark's source code, you can see the signature of the aggregate function as follows:

def Aggregate[u:classtag] (zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = u): U

As can be seen, this function is a method of curry, the input parameters are divided into two parts: (Zerovalue:u) and (Seqop: (U, T) = u, Combop: (u, u) = u) 2.aggregate usage

Function signature is more complicated, maybe some of the small partners look at the faint dish. Don't catch it, let's take a look at the comments in front of the function, and we'll be clear about the use of this function.

  /** * Aggregate The elements of each partition, and then the results for all the partitions, using * given combine Functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T to a u * and one operation for merging-one, as in Scala. Traversableonce. Both of these functions is * allowed to modify and return their first argument instead of creating a new U to avoid me
   Mory * allocation. * * @param zerovalue The initial value for the accumulated result of each partition for the * ' Seqo P ' operator, and also the initial value for the combine results from * different partitions for the ' C Ombop ' operator-this would typically be the the * neutral element (e.g. ' Nil ' for list concatenation or ' 0 ' for summation] * @param seqop an operator used to accumulate results within a partition * @param CombOP an associative operator used to combine results from different partitions */

The aggregate first aggregates the elements of each partition, then aggregates the results of all the partitions, using the given aggregate function and the initial value "zero value" in the aggregation process. This function can return a different type U than the original RDD, so you need a function that combines the RDD type T to the result type U, and a function that merges type U. Both functions can modify and return their first parameter instead of re-creating a U-type parameter to avoid reallocating memory.
The initial value of the cumulative result for each partition of the parameter Zerovalue:seqop operator and the result of the combination of the different partitions of the COMBOP operator-this will typically be the initial element (for example, "Nil" for list joins or "0" for summing)
Parameter seqop: The aggregate function that accumulates results for each partition.
Parameter Combop: An association operator for combining results from different partitions 3. Averaging

Looks like the above principle introduction, next we look at the dry goods.
First, you can look at one of the most online examples:

Val list = list (1,2,3,4,5,6,7,8,9)
val (mul, Sum, count) = Sc.parallelize (list, 2). Aggregate ((1, 0, 0)) (
    acc, num ber) = (acc._1 * number, acc._2 + number, Acc._3 + 1),
    (x, y) = (x._1 * y._1, x._2 + y._2, X._3 + y._3)
        ) c5/> (Sum/count, Mul)

A slight change is made on the basis of common mean values, sum is SUM, Count is the number of cumulative elements, and Mul is the product of each element.
Explain the specific process:
1. The initial value is (1, 0, 0)
2.number is the T in the function, which is the element in the list, when the type is int. The type of ACC is (int, int, int). Acc._1 * num is multiplied by each element (initial value is 1), and acc._2 + number adds to each element.
3.sum/count is calculated as an average. 4. Additional examples

To deepen your understanding, look at another example.

        Val raw = List ("A", "B", "D", "F", "G", "H", "O", "Q", "X", "Y")
        Val (biggerthanf, lessthanf) = Sc.parallelize (raw, 1) . Aggregate ((0, 0)) (
            (cc, str) = = {
                var biggerf = cc._1
                var lessf = cc._2
                if (Str.compareto ("F") >= 0 ) Biggerf = cc._1 + 1
                else if (Str.compareto ("F") < 0) LESSF = cc._2 + 1
                (Biggerf, LESSF)
            },
            (x, y) =&G T (X._1 + y._1, x._2 + y._2)
        )

In this example, what we want to do is count the number of elements in the raw list that are larger than "F" and smaller than "F". The logic of the code itself is relatively simple, and it is no longer explained. comparison of 5.aggregateByKey and Combinebykey

Aggregate is a sequence-specific operation, and Aggregatebykey is for k,v pairs. As the name implies, Aggregatebykey is for the key to do aggregate operation. The prototypes for the functions in Spark are as follows:

  def Aggregatebykey[u:classtag] (zerovalue:u) (Seqop: (U, V) = = u,
      combop: (u, u) = + u): rdd[(K, u)] = Self.withs Cope {
    Aggregatebykey (Zerovalue, Defaultpartitioner (self)) (Seqop, Combop)
  }

It's all about k,v, and there's a combinebykey operation in Spark:

  def Combinebykey[c] (
      createcombiner:v = C,
      mergevalue: (c, V) and C,
      Mergecombiners: (c, C) and C): R dd[(K, C)] = self.withscope {
    Combinebykeywithclasstag (Createcombiner, Mergevalue, mergecombiners) (null)
  }

To see the two links, let's look at the real implementation of Aggregatebykey inside:

From the above source code can be clearly seen, Aggregatebykey call is the Combinebykey method. The Seqop method is that the Mergevalue,combop method is Mergecombiners,cleanedseqop (Createzero (), V) is Createcombiner, that is, the incoming Seqop function, Only one of the values is an incoming zerovalue.
Therefore, Aggregatebykey is more appropriate when the Createcombiner and Mergevalue functions are the same.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More