The three steps of the spark operator execution process

Source: Internet
Author: User
Tags min ord require sort
10.aggregate

With the elements in the aggregation Rdd, first use Seqop to aggregate the T type elements in each partition of the RDD into U types, and then use Combop to aggregate the U types that were previously aggregated for each partition into U types, paying special attention to SEQOP and Combop using Zerovalue values. The type of Zerovalue is U,

def Aggregate[u:classtag] (zerovalue:u) (Seqop: (u,t) = = u, Combop: (u,u) = = u): U = withscope {
Clone the zero value since we'll also be serializing it as part of the tasks
var jobresult = Utils.clone (Zerovalue, Sc.env.serializer.newInstance ())
val cleanseqop = Sc.clean (SEQOP)
val cleancombop = Sc.clean (Combop)

Zerovalue is the initial value, Aggregatepartition is performed on the Excutor

val aggregatepartition = (it:iterator[t]) = It.aggregate (Zerovalue) (Cleanseqop, Cleancombop)

Jobresult is the initial value, and the result of merging each partition is executed at the driver end.
val mergeresult = (index:int, taskresult:u) = Jobresult = Combop (Jobresult, Taskresult)

Sc.runjob (This, aggregatepartition, Mergeresult)
Jobresult
}

For example:

var rdd1 = Sc.makerdd (1 to 10,2)

# #第一个分区中包含5, 4,3,2,1

# #第二个分区中包含10, 9,8,7,6

Scala> rdd1.aggregate (1) (

| {(X:int,y:int) = + x + y},

| {(A:int,b:int) = + A + b}

| )

Res17:int = 58

Why is it 58? And look at the following execution process: 11.fold

Simplified aggregate that use the same function op for Seqop and Combop in aggregate.

/**
* Aggregate The elements of each partition, and then the results for all the partitions, using a
* Given associative and commutative function and a neutral "zero value". The function
* OP (T1, T2) is allowed to modify T1 and return it as its result value to avoid object
* Allocation; However, it should not modify T2.
*
* This behaves somewhat differently from fold operations implemented for non-distributed
* Collections in functional languages like Scala. This fold operation is applied to
* partitions individually, and then fold those results into the final result, rather than
* Apply the fold to all element sequentially in some defined ordering. for functions
* That is not commutative, and the result may differ from the a fold applied to a
* non-distributed collection.
*/
defFold (zerovalue:t) (OP: (T, T) + t): t= withscope {
Clone the zero value since we'll also be serializing it as part of the tasks
varJobresult = Utils.clone (Zerovalue, Sc.env.closureSerializer.newInstance ())
ValCleanop = Sc.clean (OP)

First fold operation on the excutor for the partition
val foldpartition = (iter:iterator[t]) = Iter.fold (Zerovalue) (CLEANOP)

Then merge the results on each partition at the driver end
val mergeresult = (index:int, taskresult:t) = Jobresult = OP (jobresult, Taskresult)
Sc.runjob (This, foldpartition, Mergeresult)
Jobresult

}

For example, you can convert the example operation in the aggregate section into a fold operation:

Scala>var rdd1 = Sc.makerdd (1 to 10,2)

# #第一个分区中包含5, 4,3,2,1

# #第二个分区中包含10, 9,8,7,6

Scala> Rdd1.fold (1) (

| (x, y) = x + y

| )

Res19:int = 58

# #结果同上面使用aggregate的第一个例子一样, i.e.:

Scala> rdd1.aggregate (1) (

| {(x, y) = x + y},

| {(A, b) = + A + b}

| )

Res20:int = 58

12.treeAggregate

Layered to aggregate, because the aggregate when the settlement of its partition is transferred to the driver end of the merger, if the partition is more, the results of the data returned larger, then the driver side need to cache a large number of intermediate results, This will increase the computing power of the driver end, so treeaggregate the partition calculation results are still placed on the excutor side, the results at the excutor end of the continuous merging to reduce the amount of data returned driver, and finally driver end of the last merger.

/**
 * Aggregates The elements of this RDD in a multi-level tree pattern.
 *
 * @par Am Depth suggested depth of the tree (default:2)
 * @see [[org.apache.spark.rdd.rdd#a Ggregate]]
 */
def Treeaggregate[u:classtag] (zerovalue:u) (
    Seqop: (U, T) =>u,
    Combop: (U, u) =>u,
    depth:int = 2): U = Withscope {
  require (depth >= 1, S "depth must is greater than or equal to 1 but got$ depth ." )
  if (partitions.length = = 0) {
    utils.clone (Zerovalue, Context.env.closureSerializer.newInstance ())
 } else {
    val Cleanseqop = Context.clean (seqop)
    val cleancombop = Context.clean ( Combop)

Aggregate functions for the initial partition
val aggregatepartition =
(it:iterator[t]) = It.aggregate (Zerovalue) (Cleanseqop, Cleancombop)

Partial aggregation for the initial partitions first
var partiallyaggregated = mappartitions (It =>iterator (aggregatepartition (IT)))
var numpartitions = PartiallyAggregated.partitions.length

Calculates the degree to which an iterative calculation is required based on the incoming depth
val scale = Math.max (Math.ceil (Math.pow (numpartitions,1.0/depth)). ToInt, 2)
If creating an extra level doesn ' t help reduce
The Wall-clock time, we stop tree aggregation.
while (Numpartitions > scale + numpartitions/scale) {//Calculate the degree of iteration
Numpartitions/= Scale
val curnumpartitions = numpartitions

Reduce the number of partitions and merge the results of partial partitions
partiallyaggregated = partiallyaggregated.mappartitionswithindex {
(i, ITER) = Iter.map ((i% curnumpartitions, _))
}.reducebykey (new Hashpartitioner (curnumpartitions), Cleancombop). Values
}

Perform the last reduce to return the final result
Partiallyaggregated.reduce (Cleancombop)
}
}

For example:

scala> def seq (a:int,b:int): int={

| A+B}

Seq: (A:int, B:int) Int

scala> def Comb (a:int,b:int): int={

| A+B}

Comb: (A:int, B:int) Int

Val z =sc.parallelize (List (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18), 9)

Scala> z.treeaggregate (0) (seq,comb,2)

Res1:int = 171

Its specific implementation process is as follows:

13.reduce

The first two elements of the RDD are passed to the input function, and a new return value is generated, and the newly generated return value consists of two elements in the next element of the RDD (the third element), which is then passed to the input function until there is only one value at the end.

/**
* Reduces the elements of this RDD using the specified commutative and
* Associative binary operator.
*/
def reduce (f: (t,t) = t): T = withscope {
val cleanf = Sc.clean (f)

Defines a function that traverses the partition, which is executed at the Excutor end.
Val
reducepartition:iterator[t] = option[t] = ITER =
if (Iter.hasnext) {

Reduceleft traversing from left to right
Some (Iter.reduceleft (CLEANF))
} Else {
None
}
}
var jobresult:option[t] = None

A function that defines the result of a driver-side processing partition, which is performed at the driver end
Val
mergeresult = (index:int, taskresult:option[t]) + = {
if (taskresult.isdefined) {
Jobresult = Jobresult match {
Case Some (value) =>some (f (value, Taskresult.get))
Case None = Taskresult
}
}
}
Sc.runjob (This, reducepartition, Mergeresult)
Get The final result out of our Option, or throw a exception if the RDD was empty

Returns the result
Jobresult.getorelse (throw new unsupportedoperationexception ("Empty Collection"))
}

For example:

Val C = sc.parallelize (1 to 10, 2)

C.reduce ((x, y) = x + y)//Result 55

The specific implementation process is as follows: 14.max

Returns the maximum value of the Sort method object by default

/**
* Returns The max of this RDD as defined by the implicit ordering[t].
* @return The maximum element of the RDD
* */
def Max () (implicitord:ordering[t]): T = withscope {
this. reduce (Ord.max)

}

The essence of this is to define a sort of method, and then call the reduce operation, with the following example:

Scala>var rdd1 = Sc.makerdd (1 to 10,2)

# #第一个分区中包含5, 4,3,2,1

# #第二个分区中包含10, 9,8,7,6

Scala> Rdd1.max ()

Res19:int = 10

The execution flow is as follows: 15.min

Returns the minimum value for the Sort method object by default sorting method

/**
* Returns the Min of this RDD as defined by the implicit ordering[t].
* @return The maximum element of the RDD
* */
def min () (implicitord:ordering[t]): T = withscope {
this. reduce (ord.min)

}

The essence of this is to define a sort of method, and then call the reduce operation, with the following example:

Scala>var rdd1 = Sc.makerdd (1 to 10,2)

# #第一个分区中包含5, 4,3,2,1

# #第二个分区中包含10, 9,8,7,6

Scala> Rdd1.min ()

Res19:int = 1

The execution flow is as follows: 16.treeReduce

Similar to Treeaggregate, using multiple aggregate at the Excutor end to reduce the computational overhead of driver

/**
* Reduces the elements of this RDD in a multi-level tree pattern.
*
* @param depth suggested depth of the tree (DEFAULT:2)
* @see [[Org.apache.spark.rdd.rdd#reduce]]
*/
def Treereduce (f: (t,t) = = T, depth:int =2): t = withscope {
Require (depth >= 1, S "depth must is greater than or equal to 1 but got$depth.")
val cleanf = Context.clean (f)

Reduce function for the initial partition
Val
reducepartition:iterator[t] = option[t] = ITER =
if (Iter.hasnext) {
Some (Iter.reduceleft (CLEANF))
} Else {
None
}
}

Partial reduce is first for each of the initial partitions
Val
partiallyreduced = mappartitions (It =>iterator (reducepartition (IT)))
Val op: (option[t], option[t]) = option[t] = (c, x) + = {
if (c.isdefined && x.isdefined) {
Some (cleanf (C.get, X.get))
} else if (c.isdefined) {
C
} else if (x.isdefined) {
X
} Else {
None
}
}

The final call or the Treeaggregate method
Partiallyreduced.treeaggregate (Option.empty[t]) (OP, op, depth)
. Getorelse (throw new unsupportedoperationexception ("Empty Collection"))

}

The

Treereduce function is first computed for each partition using Scala's Reduceleft function, and finally, in the treeaggregate calculation of the locally merged rdd, the Seqop and Combop here have null initial values. In practice, treereduce can be used instead of reduce, mainly for single reduce operation overhead, while treereduce can control the size of each reduce by adjusting the depth. Its specific implementation process is no longer described in detail, you can refer to the Treeaggregate method.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.