What is an RDD?
The RDD is an abstract data structure type in spark, and any data is represented as an rdd in spark. From a programmatic point of view, an RDD can be viewed simply as an array. Unlike normal arrays, the data in the RDD is partitioned, so that data from different partitions can be distributed across different machines while being processed in parallel. So what the spark application does is simply convert the data that needs to be processed into an rdd and then perform a series of transformations and operations on the RDD to get the results. In the first part of this article, we will cover the APIs associated with map and reduce in the spark rdd.
How do I create an RDD?
The RDD can be created from a normal array, or from a file system or from a file in HDFs.
Example: Creating an RDD from an ordinary array containing 9 numbers from 1 to 9, respectively, in 3 partitions.
scala> val a = sc.parallelize(1 to 9, 3)a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12
Example: Read file readme.md to create an RDD, and each line in the file is an element in the RDD
scala> val b = sc.textFile("README.md")b: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at <console>:12
There are other ways to create an rdd, but in this article we use both of these methods to create an RDD to illustrate the Rdd API.
Map
Map is the execution of a specified function on each element of the RDD to produce a new rdd. any element in the original RDD is in the new RDD and has only one element corresponding to it.
Example:
scala> val a = sc.parallelize(1 to 9, 3)scala> val b = a.map(x => x*2)scala> a.collectres10: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)scala> b.collectres11: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
In the example above, multiply each element in the original RDD to create a new rdd.
Mappartitions
Mappartitions is a variant of map. The input function of map is applied to each element in the RDD, and the input function of mappartitions is applied to each partition, that is, the contents of each partition are treated as a whole.
Its function is defined as:
def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
F is the input function, which processes the contents of each partition. The contents of each partition will be passed as iterator[t] to the input function f,f the output is iterator[u]. The final Rdd is combined by the result of all partitions being processed by the input function.
Example:
scala> val a = sc.parallelize (1 to 9, 3) scala> def Myfunc[t] (Iter:iterator[t]): ITER ator[(T, t)] = {var res = list[(T, T)] () var pre = Iter.next while (iter.hasnext) {val cur = iter.next; Res.:: = (pre, cur) pre = cur; } res.iterator}scala> a.mappartitions (MyFunc). collectres0:array[(int, int)] = Array ((2,3), (UP), (5,6), (4,5), (8 , 9), (7,8))
The function MyFunc in the above example is to make a tuple of an element in the partition and its next element. Because the last element in the partition has no next element, (3,4) and (6,7) are not in the result.
Mappartitions also has some variants, such as Mappartitionswithcontext, which can pass some state information from the process to the user-specified input function. There is also Mappartitionswithindex, which can pass the index of the partition to the user-specified input function.
Mapvalues
Mapvalues as the name implies is that the input function is applied to the Kev-value value in the RDD, the key in the original RDD remains unchanged, and the new value is composed of the elements in the new Rdd. Therefore, the function applies only to the RDD for which the element is KV.
Example:
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)scala> val b = a.map(x => (x.length, x))scala> b.mapValues("x" + _ + "x").collectres5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx),(3,xcatx), (7,xpantherx), (5,xeaglex))
Mapwith
Mapwith is another variant of map, map requires only one input function, and Mapwith has two input functions. It is defined as follows:
def mapWith[A: ClassTag, U: ](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]
- The first function constructa is the partition index of the RDD (index starting from 0) as the input, the output is the new type A;
- The second function f is the two-tuple (T, A) as input (where T is the element in the original RDD, A is the output of the first function) and the output type is U.
Example: Multiply the partition index by 10 and add 2 as the element of the new RDD.
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) x.mapWith(a => a * 10)((a, b) => (b + 2)).collect res4: Array[Int] = Array(2, 2, 2, 12, 12, 12, 22, 22, 22, 22)
FlatMap
Similar to map, the difference is that elements in the original RDD can only generate one element after map processing, and the elements in the original RDD can be flatmap processed to generate multiple elements to construct a new rdd.
Example: generating y elements for each element x in the original Rdd (from 1 to y,y as the value of element x)
scala> val a = sc.parallelize(1 to 4, 2)scala> val b = a.flatMap(x => 1 to x)scala> b.collectres12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)
Flatmapwith
Flatmapwith and Mapwith very similar, are to receive two functions, a function of partitionindex as input, the output is a new type A; the other function is a two-tuple (t,a) as input, the output is a sequence, The elements inside these sequences make up the new Rdd. It is defined as follows:
def flatMapWith[A: ClassTag, U: ClassTag](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => Seq[U]): RDD[U]
Example:
scala> val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)scala> a.flatMapWith(x => x, true)((x, y) => List(y, x)).collectres58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2,8, 2, 9)
Flatmapvalues
Flatmapvalues is similar to mapvalues, except that flatmapvalues applies to the value of an rdd in which the element is KV. The value of each element is mapped to a series of values by the input function, and the values then form a series of new kv pairs with the key in the original RDD.
Example
scala> val a = sc.parallelize(List((1,2),(3,4),(3,6)))scala> val b = a.flatMapValues(x=>x.to(5))scala> b.collectres3: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (3,4), (3,5))
In the above example, the value of each element in the RDD is converted to a sequence (from its current value to 5), such as the first KV pair (for example), and its value of 2 is converted to 2,3,4,5. Then it is formed a series of new KV pairs (1,3), (1,4), (1,5) with the original KV key.
Reduce
Reduce passes the element 22 in the RDD to the input function, generating a new value, the newly generated value and the next element of the RDD being passed to the input function until there is only one value at the end.
Example
scala> val c = sc.parallelize(1 to 10)scala> c.reduce((x, y) => x + y)res4: Int = 55
The above example sums the elements in the RDD.
Reducebykey
As the name implies, Reducebykey is the value of the element that is the same as key in the RDD for the KV pair, so the value of multiple elements of the same key is reduce to a value, then a new KV pair is formed with the key in the original RDD.
Example:
scala> val a = sc.parallelize(List((1,2),(3,4),(3,6)))scala> a.reduceByKey((x,y) => x + y).collectres7: Array[(Int, Int)] = Array((1,2), (3,10))
In the example above, the value of the same element as key is summed, so that the two elements of key 3 are converted (3,10).
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Spark RDD API Detailed (a) map and reduce