The fold,foldbykey,treeaggregate of the basic RDD operator for Spark programming, Treereduce

Source: Internet
Author: User
The fold,foldbykey,treeaggregate of the basic RDD operator for Spark programming, Treereduce1) Fold
def fold (zerovalue:t) (OP: (T, T) + t): T

This API operator receives an initial value, the fold operator passes in a function, merges two values of the same type, and returns a value of the same type

This operator merges the values in each partition. Each partition is merged with a zerovalue as the initial value at each time each partition is merged.

Val A = Sc.parallelize (List, 3)
a.fold (0) (_ + _)//This is somewhat similar to the reduce function. It just adds an initial value.
Res59:int = 6
2) Foldbykey
def foldbykey (Zerovalue:v) (func: (V, v) = v): rdd[(K, v)]
def foldbykey (Zerovalue:v, Numpartitions:int) (Func: ( V, v) = v): rdd[(k, v)]
def foldbykey (Zerovalue:v, Partitioner:partitioner) (func: (V, v) = v): rdd[(k, V)]

First of all, its api,foldbykey receives an initial value, the Zerovalue type, and the type of the value of the last key value pair that is returned is also consistent with the type of the initial value. Compared to Redeucebykey, Foldbykey added an initial value.
Can look at a few examples

Val A = Sc.parallelize (List ("Dog", "Cat", "owl", "GNU", "Ant"), 2) Val B = a.map (x = (x.length, x))//b is of type (INT, Strin g) B.foldbykey ("") (_ + _). Collect//This representation is aggregated according to key.
Since the length is all 3, the result of aggregation is shown below. res84:array[(Int, String)] = Array ((3,dogcatowlgnuant) Val a = Sc.parallelize (List ("Dog", "Tiger", "Lion", "cat", "Panth Er "," Eagle "), 2) Val B = a.map (x = = (x.length, x)) B.foldbykey (" ") (_ + _). Collect res85:array[(Int, String)] = Array ( (4,lion), (3,dogcat), (7,panther), (5,tigereagle)) scala> Val rdd= sc.makerdd (Array ("a", 0), ("A", 2), ("B", 1), ("B", 2 ), ("C", 1)), 2) rdd:org.apache.spark.rdd.rdd[(String, Int)] = parallelcollectionrdd[1] at Makerdd at <console>:27 S Cala> Rdd.foldbykey (_+_). Collect.foreach (println) (b,103) (a,102) (c,101)//actually Foldbykey internal call is Combinebykey. Zerovalue is actually similar to Createcombiner, and Mergevalue and Mergecombiner are the same (which is the function we passed in), all of which are first performed within the partition, and then merge the results of the merge in the partition again.
3) Treeaggregate

First take a look at the API for this treeaggregate operator:

def Treeaggregate[u] (zerovalue:u) (Seqop: (U, T) ⇒u, Combop: (U, u) ⇒u, depth:int = 2) (implicit arg0:classtag[u]): U

The result returned by this operator is the U type. First, an initial value is passed in, and again, the first function is operated on the partition first, Seqop. Merges the type of T encountered within the partition into the U type, and finally merges the results of the merged U types of the different partitions. The first function is within a partition, and the second function is in the interval.

Treeaggregate is similar to aggregate, except that it is aggregated in the form of a multi-layered tree. Another is that this initial value does not work for the second function, just in the first function. The default depth is 2.

Val z = sc.parallelize (List (1,2,3,4,5,6), 2)

def myfunc (Index:int, iter:iterator[(INT)]): iterator[string] = {
  Iter.toList.map (x = "[PartID:" +  Index + ", val:" + x + "]"). Iterator
}

z.mappartitionswithindex (MyFunc) . collect
Res28:array[string] = Array ([partid:0, Val:1], [partid:0, Val:2], [partid:0, Val:3], [Partid:1, Val:4], [Partid:1, Val:5], [Partid:1, Val:6])

z.treeaggregate (0) (Math.max (_, _), _ + _)
Res40:int = 9//The same, first in each partition to find the maximum values, and then merge between partitions. The initial value is not used for the second function.
//If the initial value is transformed to 5, then first Partition max (5,1,2,3) =5
//If the initial value is transformed to 5, then the 2nd partition max (5,4,5,6) =6
///The final result is 5 + 6 = 11, no initial value introduced
z.treeaggregate (5) (Math.max (_, _), _ + _)
Res42:int = 11
4) Treereduce
def  Treereduce (f: (T, T) ⇒t, Depth:int = 2): T

The treereduce is somewhat similar to the reduce function and does not need to pass in the initial value, except that the operator uses a multi-layered tree for the reduce operation.

Val z = sc.parallelize (List (1,2,3,4,5,6), 2)
z.treereduce (_+_)
res49:int = 21

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.