Spark RDD API Detailed (a) map and reduce

Source: Internet
Author: User
Tags spark rdd

Original link: https://www.zybuluo.com/jewes/note/35032

What is an RDD?

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable (non-modifiable), partitioned collection of elements that can is operated on parallel. This class contains the basic operations available on all RDDs, such as map , filter , and persist . In addition, Org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKeyand join ; Org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of doubles; and Org.apa Che.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs the can be saved as sequencefiles. All operations is automatically available on any RDD of the right type (e.g. rdd[(int, int)] through implicit.

Internally, each RDD are characterized by five main properties:

    • A List of partitions
    • A function for computing each split
    • A list of dependencies on other RDDs
    • Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
    • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

All of the scheduling and execution in Spark are do based on these methods, allowing each RDD-implement its own the Computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Refer to the Spark paper for more details on the RDD internals.

The RDD is an abstract data structure type in spark, and any data is represented as an rdd in spark. From a programmatic point of view, an RDD can be viewed simply as an array. Unlike normal arrays, the data in the RDD is partitioned, so that data from different partitions can be distributed across different machines while being processed in parallel. So what the spark application does is simply convert the data that needs to be processed into an rdd and then perform a series of transformations and operations on the RDD to get the results. The first part of this article describes the APIs associated with map and reduce in the spark rdd.

How do I create an RDD?

The RDD can be created from a normal array, or from a file system or from a file in HDFs.

Example: Creating an RDD from an ordinary array containing 9 numbers from 1 to 9, respectively, in 3 partitions.

scala> val A = sc.parallelize (1 to 9, 3= parallelcollectionrdd[1) at parallelize at <console>:12





Example: Read file readme.md to create an RDD, and each line in the file is an element in the RDD

scala> val B = sc.textfile ("readme.md"= mappedrdd[3] at textfile at <console>:12

There are other ways to create an rdd, but in this article we use both of these methods to create an RDD to illustrate the Rdd API.

Map

Map is the execution of a specified function on each element of the RDD to produce a new rdd. any element in the original RDD is in the new RDD and has only one element corresponding to it.

Example:

scala> val A = sc.parallelize (1 to 9, 3) Scala> Val b = a.map (x = x*2) Scala>= Array (1, 2, 3, 4, 5, 6, 7, 8, 9) Scala>= Array (2, 4, 6, 8, 10, 12, 14, 16, 18)

In the example above, multiply each element in the original RDD to create a new rdd.

Mappartitions

Mappartitions is a variant of map. The input function of map is applied to each element in the RDD, and the input function of mappartitions is applied to each partition, that is, the contents of each partition are treated as a whole.
Its function is defined as:

false): Rdd[u]

F is the input function, which processes the contents of each partition. The contents of each partition will be passed as iterator[t] to the input function f,f the output is iterator[u]. The final Rdd is combined by the result of all partitions being processed by the input function.

Example:

scala> val A = sc.parallelize (1 to 9, 3) Scala> Def myfunc[t] (iter:iterator[t]): iterator[(T, t)] = { c2/>= list[(t, T)] ()     while  (iter.hasnext) {        = iter.next;         Res.::= (PRE, cur) pre = cur;    }     Res.iterator}scala>= Array ((2,3), (UP), (5,6), (4,5), (8,9), (7,8))

The function MyFunc in the above example is to make a tuple of an element in the partition and its next element. Because the last element in the partition has no next element, (3,4) and (6,7) are not in the result.
Mappartitions also has some variants, such as Mappartitionswithcontext, which can pass some state information from the process to the user-specified input function. There is also Mappartitionswithindex, which can pass the index of the partition to the user-specified input function.

Mapvalues

Mapvalues as the name implies is that the input function is applied to the Kev-value value in the RDD, the key in the original RDD remains unchanged, and the new value is composed of the elements in the new Rdd. Therefore, the function applies only to the RDD for which the element is KV.

Example:

scala> val A = sc.parallelize (List ("Dog", "Tiger", "Lion", "cat", "Panther", "Eagle"), 2) Scala> Val b = A. Map (x = = (x.length, x)) Scala> B.mapvalues ("x" + _ + "x" = Array ((3,XDOGX), (5,xtigerx), (4,xlionx) , (3,XCATX), (7,xpantherx), (5,xeaglex))
Mapwith

Mapwith is another variant of map, map requires only one input function, and Mapwith has two input functions. It is defined as follows:

false) (f: (T, A) = + U): Rdd[u]

    • The first function constructa is the partition index of the RDD (index starting from 0) as the input, the output is the new type A;
    • The second function f is the two-tuple (T, A) as input (where T is the element in the original RDD, A is the output of the first function) and the output type is U.

Example: Multiply the partition index by 10 and add 2 as the element of the new RDD.

Val x = sc.parallelize (List (1,2,3,4,5,6,7,8,9,10), 3= A *) ((a, b) = = (b + 2= Array (2, 2, 2, 12 , 12, 12, 22, 22, 22, 22)

FlatMap

Similar to map, the difference is that elements in the original RDD can only generate one element after map processing, and the elements in the original RDD can be flatmap processed to generate multiple elements to construct a new rdd.
Example: generating y elements for each element x in the original Rdd (from 1 to y,y as the value of element x)

scala> val A = sc.parallelize (1 to 4, 2) Scala> Val b = a.flatmap (x = 1 to x) Scala>  = Array (1, 1, 2, 1, 2, 3, 1, 2, 3, 4)
Flatmapwith

Flatmapwith and Mapwith very similar, are to receive two functions, a function of partitionindex as input, the output is a new type A; the other function is a two-tuple (t,a) as input, the output is a sequence, The elements inside these sequences make up the new Rdd. It is defined as follows:

false) (f: (T, A) = Seq[u]): Rdd[u]

Example:

scala> val A = Sc.parallelize (List (1,2,3,4,5,6,7,8,9), 3) Scalatrue) ((x, y)= = Array ( 0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)

Flatmapvalues

Flatmapvalues is similar to mapvalues, except that flatmapvalues applies to the value of an rdd in which the element is KV. The value of each element is mapped to a series of values by the input function, and the values then form a series of new kv pairs with the key in the original RDD.

Example

scala> val A = Sc.parallelize (List ((), (3,4), (3,6))) Scala> Val b = a.flatmapvalues (x=>x.to (5)) Scala>= Array ((1,3), (1,4), (1,5), (3,4), (3,5))

In the above example, the value of each element in the RDD is converted to a sequence (from its current value to 5), such as the first KV pair (for example), and its value of 2 is converted to 2,3,4,5. Then it is formed a series of new KV pairs (1,3), (1,4), (1,5) with the original KV key.

Reduce

Reduce passes the element 22 in the RDD to the input function, generating a new value, the newly generated value and the next element of the RDD being passed to the input function until there is only one value at the end.

Example

Scala> val C = sc.parallelize (1 to ten) Scala> C.reduce ((x, y) + x += 55

The above example sums the elements in the RDD.

Reducebykey

As the name implies, Reducebykey is the value of the element that is the same as key in the RDD for the KV pair, so the value of multiple elements of the same key is reduce to a value, then a new KV pair is formed with the key in the original RDD.

Example:

scala> val A = Sc.parallelize (List ((), (3,4), (3,6))) Scala> A.reducebykey ((x, y) = x+ = Array ((), (3,10))

In the example above, the value of the same element as key is summed, so that the two elements of key 3 are converted (3,10).

Reference

Some examples in this article are from: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Spark RDD API Detailed (a) map and reduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.