Spark Notes: Understanding of the API for complex RDD (on)

Last Update:2016-05-20 Source: Internet

Author: User

Tags spark notes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article goes on to explain the Rdd API, explaining the APIs that are not very easy to understand, and this article will show you how to introduce external functions into the RDD API, and finally learn about the Rdd API, and we'll talk about some of the Scala syntax associated with RDD development.

1) Aggregate (Zerovalue) (Seqop,combop)

This function, like the reduce function, also aggregates the data, but aggregate can return different data types from the original RDD and provide the initial values when used.

Let's take a look at the following usage, the code is as follows:

    Val Rddint:rdd[int] = sc.parallelize (List (1, 2, 3, 4, 5), 1)    val rddAggr1: (int, int) = Rddint.aggregate ((0, 0)) ((X, Y) = = (X._1 + y, x._2 + 1), (x, y) = = (X._1 + y._1, x._2 + y._2))    println ("====aggregate 1====:" + rddaggr1.tost Ring ())//(15,5)

The method is to sum the values of the RDD with a number, and also to count the number of elements, so that we can calculate an average, which is very useful in the actual operation.

If the reader does not understand the Scala language, or even a little bit of Scala syntax, the use of the API is very difficult to understand, this x is what, this y is what, why do they put together so that the operation can get the desired result?

In fact, the aggregate method uses the structure of the Scala tuple, the tuple is a very distinctive data structure in Scala, let's look at the following code:

 val Tuple2param1:tuple2[string,int] = Tuple2 ("x01", 12)//Standard definition two tuples Val tuple2pa RAM2: (String,int) = ("x02", 29)//literal definition two-tuple//Result: x01:12*/println ("====tuple2param1====:" + tuple2param1._1 + ":        + tuple2param1._2)/* Results: x02:29 */println ("====tuple2param2====:" + tuple2param2._1 + ":" + tuple2param2._2) Val tuple6param1:tuple6[string,int,int,int,int,string] = Tuple6 ("xx01", 1,2,3,4, "x1x")//Standard definition 6 tuples Val tuple6param2: (S tring,int,int,int,int,string) = ("xx02", 1,2,3,4, "x2x")//literal definition 6-tuple//Result: XX01:1:2:3:4:X1X */println ("====tuple6 param1====: "+ tuple6param1._1 +": "+ tuple6param1._2 +": "+ Tuple6param1._3 +": "+ Tuple6param1._4 +": "+ Tuple6param 1._5 + ":" + tuple6param1._6)/* Result: XX02:1:2:3:4:X2X */println ("====tuple6param2====:" + tuple6param2._1 + ":" + t Uple6param2._2 + ":" + Tuple6param2._3 + ":" + Tuple6param2._4 + ":" + tuple6param2._5 + ":" + tuple6param2._6)

Tuples are constructed using tuple in Scala, but in practice we give the tuple a number suffix, for example, Tuple2 is a two-tuple that contains two elements, Tuple6 is a 6-tuple it contains 6 elements, and tuples look like arrays, But arrays can only store data structures of the same data type, whereas tuples are data structures that can store different data types, elements in tuples are accessed using _1,_2, and the first element is marked starting at 1, and the array is different. In practice we seldom use tuple construction tuples, but instead use literal definitions (see code comments), so we can see that in spark the key value to the RDD is actually using a two-tuple to represent the key-value pair data structure, back to the aggregate method, Its operations are also done through a two-tuple data structure.

Let's take a look at the operation of aggregate, where I'll use an external function for the operators in the aggregate method, as shown in the code below:

  def aggrftnone (par: ((int, int), int)): (int, int) = {/* *aggregate initial value is (0,0): ====aggrftnone param===: ((0,0), 1) ====aggrftnone param===:((2) ====aggrftnone param===:((3,2), 3) ====aggrftnone param===:((6,3), 4) = = = Aggrftnone param===:((10,4), 5) *///* *aggregate initial value is (max): ====aggrftnone param===:((1) ====agg Rftnone param===:((2,2), 2) ====aggrftnone param===:((4,3), 3) ====aggrftnone param===:((7,4), 4) ====ag Grftnone param===:((11,5), 5) * */println ("====aggrftnone param===:" + par.tostring ()) VAL ret: (int, int) = ( Par._1._1 + par._2, par._1._2 + 1) ret} def aggrftntwo (par: (int, int), (int, int))): (int, int) = {/*aggregate The initial value is (0,0):::::((0,0), (15,5)) *//*aggregate initial value is (the first):::::((max), (16,6)) */println ("====aggrftntwo param===:" + par  . ToString ()) VAL ret: (int, int) = (par._1._1 + par._2._1, par._1._2 + par._2._2) ret} val rddAggr2: (int, int) = Rddint.aggregate ((0, 0)) ((x, y)= = Aggrftnone (x, y), (x, y) = = Aggrftntwo (x, y))//parameter can omit the parentheses of the tuple println ("====aggregate 2====:" + Rddaggr2.tostrin g ())//(15,5) Val rddAggr3: (int, int) = Rddint.aggregate ((1, 1)) ((x, y) = = Aggrftnone ((x, Y)), (x, y) = = Aggrft Ntwo ((x, y))//parameter uses the parentheses of the tuple println ("====aggregate 3====:" + rddaggr3.tostring ())//(17,7)

From the above code we can clearly see the aggregate method of the actual operation of the process.

The parameter format of the Aggrftnone method is ((int, int), int), the second element in this compound two tuple is the actual value, and the first element is the initialization value we give, the first value in the first element is the value that we actually sum, The second element inside is the value of the cumulative number of recorded elements.

The first element of the two-tuple in the parameter of the Aggrftntwo method is the initialization value, and the second element is the result of the Aggrftnone calculation, so that we can calculate the result we want.

As a comparison, I changed the initialization parameter to (two), the final result will be added 2 on the sum and the number of calculated elements, because the initialization value is two times the sum of the arguments, the above code we can clearly see the reason.

If we want the result, the first element in the two-tuple is calculated then the initialization value cannot be (0,0), but should be (1,0), understanding the principle we are very clear know how to set the initial value, the specific code is as follows:

    Val RDDAGGR4: (int, int) = Rddint.aggregate ((1, 0)) ((x, y) = = (X._1 * y, x._2 + 1), (x, y) = = (X._1 * y._1, x._2 + y . _2))

2) Fold (zero) (func)

This function is the same as the reduce function, except that you need to add the initialization value when you use it.

The code looks like this:

 def foldftn (par: (int, int)): Int = {/*fold Initial value is 0: =====foldftn param===        =:(0,1) =====foldftn param====:() =====foldftn param====:(3,3) =====foldftn param====:(6,4) =====FOLDFTN param====:(10,5) =====foldftn param====:(0,15) * */* * Fold initial value is 1: =====foldft N param====:() =====foldftn param====:(2,2) =====foldftn param====:(4,3) =====foldftn param====:(7 , 4) =====foldftn param====:(11,5) =====foldftn param====:(1,16) * */println ("=====foldftn param=== =: "+ par.tostring ()) Val ret:int = par._1 + par._2 ret} val rddfold2:int = Rddint.fold (0) ((x, y) = = Foldf TN (x, Y))//parameter can omit the parentheses of the tuple println ("====fold 2=====:" + rddFold2)//+ Val Rddfold3:int = Rddint.fold (1) ((x, y) => ; FOLDFTN ((x, y))//parameter uses the parentheses of the tuple println ("====fold 3====:" + rddFold3)///+

We found that when the initialization value is 1, the sum increment is not 1 but 2, the reason is that the fold computation time in order to gather a complete two tuples, in the first element calculation and the last element calculation will let the initialization value dine composed of two tuples, so the initial value will be used two times summation, So the actual result is not an increase of 1, but an increase of 2.

As a comparison, we look at the actual operation of reduce, the code is as follows:

  def reduceftn (par: (int,int)): Int = {    /     * * ======reduceftn param=====:1:2 ======reduceftn param=====:3:3       = =====reduceftn param=====:6:4       ======reduceftn param=====:10:5     *    /println ("======reduceftn param=====: "+ Par._1 +": "+ par._2)    par._1 + par._2  }    val rddreduce1:int = Rddint.reduce ((x, y) = x + y)    Prin TLN ("====rddreduce 1====:" + RddReduce1)//        val rddreduce2:int = Rddint.reduce ((x, y) = reduceftn (x, y))    println ("====rddreduce 2====:" + rddReduce2)//15

3) combinebykey[c] (Createcombiner:int = C, Mergevalue: (c, Int) and C, Mergecombiners: (c, c) + = c): rdd[(String , C)]

The Combinebykey effect is to use different return types to combine values with the same key, Combinebykey applies the key value to the RDD, and the ordinary RDD does not have this method.

With the definition above, we see that the Combinebykey will pass through a tri-wheel operation, and the result of the previous operation step is the parameter of the next operation step, we look at the following code:

  def combineftnone (par:int):(int,int) = {/* * ====combineftnone param====:2 ====combineftnone param====:5 ====combineftnone param====:8 ====combineftnone Param====:3 */println ("====combineftnone param====:" + PAR) VAL ret: (int,int) = (par,1) ret} def combineftntwo (par: ((int,int), Int)):(int,int) = {/* ====combi Neftntwo param====:((2,1), ====combineftntwo param====:((8,1), 9) * */println ("====combineftntwo param==== : "+ par.tostring ()) VAL ret: (int,int) = (par._1._1 + par._2,par._1._2 + 1) ret} def combineftnthree (par: (Int, INT), (int,int))):(int,int) = {/* * No result print */println ("") println ("====combineftnthree P  aram===: "+ par.tostring ()) VAL ret: (int,int) = (par._1._1 + par._2._1,par._1._2 + par._2._2) ret} val rddpair:        rdd[(String, Int)] = Sc.parallelize (List ("x01", 2), ("x02", 5), ("x03", 8), ("x04", 3), ("x01", "a"), ("x03", 9)), 1) /* def combinebykey[C] (Createcombiner:int = C, Mergevalue: (c, Int) and C, Mergecombiners: (c, C) and C): rdd[(String, c)] */ Val rddcombine1:rdd[(String, (int,int))] = Rddpair.combinebykey (x = = (x, 1), (com: (int, int), x) = = (com._1 + x, com . _2 + 1), (COM1: (int, int), COM2: (int, int)) = = (Com1._1 + com2._1, com1._2 + com2._2)) println ("====combinebykey 1====: "+ rddcombine1.collect (). mkstring (", "))//(x02, (5,1)), (x03, (17,2)), (x01, (14,2)), (x04, (3,1)) Val RddCombine2 : rdd[(String, (int,int))] = Rddpair.combinebykey (x = Combineftnone (x), (com: (int, int), x) = = Combineftntwo (com,x ), (COM1: (int, int), COM2: (int, int)) = = Combineftnthree (com1,com2)) println ("=====combinebykey 2====:" + Rddcombi Ne2.collect (). mkstring (","))//(x02, (5,1)), (x03, (17,2)), (x01, (14,2)), (x04, (3,1))

This algorithm and the above aggregate summation method is very similar, but Combinebykey is very strange, its third operator does not seem to be executed, the second operator prints the information is not complete, but I think they are executed, but some statements did not print out, as to why, I'll study it later.

I will write this in this article, the rest of the content I explained in the next article.

Spark Notes: Understanding of the API for complex RDD (on)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More