10.aggregate
With the elements in the aggregation Rdd, first use Seqop to aggregate the T type elements in each partition of the RDD into U types, and then use Combop to aggregate the U types that were previously aggregated for each partition into U types, paying special attention to SEQOP and Combop using Zerovalue values. The type of Zerovalue is U,
def Aggregate[u:classtag] (zerovalue:u) (Seqop: (u,t) = = u, Combop: (u,u) = = u): U = withscope { Clone the zero value since we'll also be serializing it as part of the tasks var jobresult = Utils.clone (Zerovalue, Sc.env.serializer.newInstance ()) val cleanseqop = Sc.clean (SEQOP) val cleancombop = Sc.clean (Combop) Zerovalue is the initial value, Aggregatepartition is performed on the Excutor val aggregatepartition = (it:iterator[t]) = It.aggregate (Zerovalue) (Cleanseqop, Cleancombop) Jobresult is the initial value, and the result of merging each partition is executed at the driver end. val mergeresult = (index:int, taskresult:u) = Jobresult = Combop (Jobresult, Taskresult) Sc.runjob (This, aggregatepartition, Mergeresult) Jobresult } |
For example:
var rdd1 = Sc.makerdd (1 to 10,2) # #第一个分区中包含5, 4,3,2,1 # #第二个分区中包含10, 9,8,7,6 Scala> rdd1.aggregate (1) ( | {(X:int,y:int) = + x + y}, | {(A:int,b:int) = + A + b} | ) Res17:int = 58 |
Why is it 58? And look at the following execution process: 11.fold
Simplified aggregate that use the same function op for Seqop and Combop in aggregate.
/** * Aggregate The elements of each partition, and then the results for all the partitions, using a * Given associative and commutative function and a neutral "zero value". The function * OP (T1, T2) is allowed to modify T1 and return it as its result value to avoid object * Allocation; However, it should not modify T2. * * This behaves somewhat differently from fold operations implemented for non-distributed * Collections in functional languages like Scala. This fold operation is applied to * partitions individually, and then fold those results into the final result, rather than * Apply the fold to all element sequentially in some defined ordering. for functions * That is not commutative, and the result may differ from the a fold applied to a * non-distributed collection. */ defFold (zerovalue:t) (OP: (T, T) + t): t= withscope { Clone the zero value since we'll also be serializing it as part of the tasks varJobresult = Utils.clone (Zerovalue, Sc.env.closureSerializer.newInstance ()) ValCleanop = Sc.clean (OP) First fold operation on the excutor for the partition val foldpartition = (iter:iterator[t]) = Iter.fold (Zerovalue) (CLEANOP) Then merge the results on each partition at the driver end val mergeresult = (index:int, taskresult:t) = Jobresult = OP (jobresult, Taskresult) Sc.runjob (This, foldpartition, Mergeresult) Jobresult } |
For example, you can convert the example operation in the aggregate section into a fold operation:
Scala>var rdd1 = Sc.makerdd (1 to 10,2) # #第一个分区中包含5, 4,3,2,1 # #第二个分区中包含10, 9,8,7,6 Scala> Rdd1.fold (1) ( | (x, y) = x + y | ) Res19:int = 58 # #结果同上面使用aggregate的第一个例子一样, i.e.: Scala> rdd1.aggregate (1) ( | {(x, y) = x + y}, | {(A, b) = + A + b} | ) Res20:int = 58 |
12.treeAggregate
Layered to aggregate, because the aggregate when the settlement of its partition is transferred to the driver end of the merger, if the partition is more, the results of the data returned larger, then the driver side need to cache a large number of intermediate results, This will increase the computing power of the driver end, so treeaggregate the partition calculation results are still placed on the excutor side, the results at the excutor end of the continuous merging to reduce the amount of data returned driver, and finally driver end of the last merger.
/** * Aggregates The elements of this RDD in a multi-level tree pattern. * * @par Am Depth suggested depth of the tree (default:2) * @see [[org.apache.spark.rdd.rdd#a Ggregate]] */ def Treeaggregate[u:classtag] (zerovalue:u) ( Seqop: (U, T) =>u, Combop: (U, u) =>u, depth:int = 2): U = Withscope { require (depth >= 1, S "depth must is greater than or equal to 1 but got$ depth ." ) if (partitions.length = = 0) { utils.clone (Zerovalue, Context.env.closureSerializer.newInstance ()) } else { val Cleanseqop = Context.clean (seqop) val cleancombop = Context.clean ( Combop) Aggregate functions for the initial partition val aggregatepartition = (it:iterator[t]) = It.aggregate (Zerovalue) (Cleanseqop, Cleancombop) Partial aggregation for the initial partitions first var partiallyaggregated = mappartitions (It =>iterator (aggregatepartition (IT))) var numpartitions = PartiallyAggregated.partitions.length Calculates the degree to which an iterative calculation is required based on the incoming depth val scale = Math.max (Math.ceil (Math.pow (numpartitions,1.0/depth)). ToInt, 2) If creating an extra level doesn ' t help reduce The Wall-clock time, we stop tree aggregation. while (Numpartitions > scale + numpartitions/scale) {//Calculate the degree of iteration Numpartitions/= Scale val curnumpartitions = numpartitions Reduce the number of partitions and merge the results of partial partitions partiallyaggregated = partiallyaggregated.mappartitionswithindex { (i, ITER) = Iter.map ((i% curnumpartitions, _)) }.reducebykey (new Hashpartitioner (curnumpartitions), Cleancombop). Values } Perform the last reduce to return the final result Partiallyaggregated.reduce (Cleancombop) } } |
For example:
scala> def seq (a:int,b:int): int={ | A+B} Seq: (A:int, B:int) Int scala> def Comb (a:int,b:int): int={ | A+B} Comb: (A:int, B:int) Int Val z =sc.parallelize (List (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18), 9) Scala> z.treeaggregate (0) (seq,comb,2) Res1:int = 171 |
Its specific implementation process is as follows:
13.reduce
The first two elements of the RDD are passed to the input function, and a new return value is generated, and the newly generated return value consists of two elements in the next element of the RDD (the third element), which is then passed to the input function until there is only one value at the end.
/** * Reduces the elements of this RDD using the specified commutative and * Associative binary operator. */ def reduce (f: (t,t) = t): T = withscope { val cleanf = Sc.clean (f) Defines a function that traverses the partition, which is executed at the Excutor end. Val reducepartition:iterator[t] = option[t] = ITER = if (Iter.hasnext) { Reduceleft traversing from left to right Some (Iter.reduceleft (CLEANF)) } Else { None } } var jobresult:option[t] = None A function that defines the result of a driver-side processing partition, which is performed at the driver end Val mergeresult = (index:int, taskresult:option[t]) + = { if (taskresult.isdefined) { Jobresult = Jobresult match { Case Some (value) =>some (f (value, Taskresult.get)) Case None = Taskresult } } } Sc.runjob (This, reducepartition, Mergeresult) Get The final result out of our Option, or throw a exception if the RDD was empty Returns the result Jobresult.getorelse (throw new unsupportedoperationexception ("Empty Collection")) } |
For example:
Val C = sc.parallelize (1 to 10, 2) C.reduce ((x, y) = x + y)//Result 55 |
The specific implementation process is as follows: 14.max
Returns the maximum value of the Sort method object by default
/** * Returns The max of this RDD as defined by the implicit ordering[t]. * @return The maximum element of the RDD * */ def Max () (implicitord:ordering[t]): T = withscope { this. reduce (Ord.max) } |
The essence of this is to define a sort of method, and then call the reduce operation, with the following example:
Scala>var rdd1 = Sc.makerdd (1 to 10,2) # #第一个分区中包含5, 4,3,2,1 # #第二个分区中包含10, 9,8,7,6 Scala> Rdd1.max () Res19:int = 10 |
The execution flow is as follows: 15.min
Returns the minimum value for the Sort method object by default sorting method
/** * Returns the Min of this RDD as defined by the implicit ordering[t]. * @return The maximum element of the RDD * */ def min () (implicitord:ordering[t]): T = withscope { this. reduce (ord.min) } |
The essence of this is to define a sort of method, and then call the reduce operation, with the following example:
Scala>var rdd1 = Sc.makerdd (1 to 10,2) # #第一个分区中包含5, 4,3,2,1 # #第二个分区中包含10, 9,8,7,6 Scala> Rdd1.min () Res19:int = 1 |
The execution flow is as follows: 16.treeReduce
Similar to Treeaggregate, using multiple aggregate at the Excutor end to reduce the computational overhead of driver
/** * Reduces the elements of this RDD in a multi-level tree pattern. * * @param depth suggested depth of the tree (DEFAULT:2) * @see [[Org.apache.spark.rdd.rdd#reduce]] */ def Treereduce (f: (t,t) = = T, depth:int =2): t = withscope { Require (depth >= 1, S "depth must is greater than or equal to 1 but got$depth.") val cleanf = Context.clean (f) Reduce function for the initial partition Val reducepartition:iterator[t] = option[t] = ITER = if (Iter.hasnext) { Some (Iter.reduceleft (CLEANF)) } Else { None } } Partial reduce is first for each of the initial partitions Val partiallyreduced = mappartitions (It =>iterator (reducepartition (IT))) Val op: (option[t], option[t]) = option[t] = (c, x) + = { if (c.isdefined && x.isdefined) { Some (cleanf (C.get, X.get)) } else if (c.isdefined) { C } else if (x.isdefined) { X } Else { None } } The final call or the Treeaggregate method Partiallyreduced.treeaggregate (Option.empty[t]) (OP, op, depth) . Getorelse (throw new unsupportedoperationexception ("Empty Collection")) } |
The
Treereduce function is first computed for each partition using Scala's Reduceleft function, and finally, in the treeaggregate calculation of the locally merged rdd, the Seqop and Combop here have null initial values. In practice, treereduce can be used instead of reduce, mainly for single reduce operation overhead, while treereduce can control the size of each reduce by adjusting the depth. Its specific implementation process is no longer described in detail, you can refer to the Treeaggregate method.