Spark RDD API (Scala)

Source: Internet
Author: User
Tags types of functions spark rdd

1. RDD

The RDD (Resilient distributed dataset Elastic distributed data Set) is the abstract data structure type in spark, which is represented as an rdd in spark. From a programmatic point of view, an RDD can be viewed simply as an array. The difference between an ordinary array is that the data in the RDD is partitioned, so that data from different partitions can be distributed across different machines and can be processed in parallel. So what the spark application does is simply convert the data that needs to be processed into an rdd and then perform a series of transformations and operations on the RDD to get the results.

2. Rdd Creation

The RDD can be created from a normal array, or from a file system or from a file in HDFs.

1) Create an RDD from an ordinary array containing 9 numbers from 1 to 9, respectively, in 3 partitions

scala> val A = sc.parallelize (1 to 9, 3)

2) Read the file readme.md to create the RDD, and each line in the file is an element in the RDD

scala> val B = sc.textfile ("readme.md")

3, two types of operator

There are two main categories, the conversion (transformation) and the Action (action). The main difference between the two types of functions is that transformation accepts the RDD and returns the RDD, while the action accepts the RDD to return the non-rdd.

The transformation operation is deferred, meaning that a conversion operation that generates another RDD from an RDD is not performed immediately, and the operation is actually triggered when there is an action action.

The action operator triggers spark to submit job jobs and outputs the data to the spark system.

4. Conversion operator

See more links

1)Map

Executes a specified function on each element in the RDD to produce a new Rdd. Any element in the original RDD is in the new RDD and has only one element corresponding to it.

Example:

scala> val A = sc.parallelize (1 to 9, 3)

scala> val B = a.map (x = x*2)

Scala> A.collect

Res10:array[int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9)

Scala> B.collect

Res11:array[int] = Array (2, 4, 6, 8, 10, 12, 14, 16, 18)

In contrast, if you switch to Flatmap, the results are as follows:

2)FlatMap

Similar to map, the difference is that elements in the original RDD can only generate one element after map processing, and the elements in the original RDD can be flatmap processed to generate multiple elements to construct a new rdd.

Example: generating y elements for each element x in the original Rdd (from 1 to y,y as the value of element x)

scala> val A = sc.parallelize (1 to 4, 2)

scala> val B = a.flatmap (x = 1 to x)

Scala> B.collect

Res12:array[int] = Array (1, 1, 2, 1, 2, 3, 1, 2, 3, 4,1,2,3,4)

In contrast, if you switch to map, the results are as follows:

3)Mappartitions

Mappartitions is a variant of map. The input function of map is applied to each element in the RDD, and the input function of mappartitions is applied to each partition, that is, the contents of each partition are treated as a whole.

Its function is defined as:

def Mappartitions[u:classtag] (f:iterator[t] = Iterator[u], Preservespartitioning:boolean = False): RDD[U]

F is the input function, which processes the contents of each partition. The contents of each partition will be passed as iterator[t] to the input function f,f the output is iterator[u]. The final Rdd is combined by the result of all partitions being processed by the input function.

Example:

The function MyFunc in the above example is to make a tuple of an element in the partition and its next element. Because the last element in the partition has no next element, (3,4) and (6,7) are not in the result.

Mappartitions also has some variants, such as Mappartitionswithcontext, which can pass some state information from the process to the user-specified input function. There is also Mappartitionswithindex, which can pass the index of the partition to the user-specified input function.

4)Mapwith

is another variant of map, map requires only one input function, and Mapwith has two input functions. It is defined as follows:

def Mapwith[a:classtag, U:] (Constructa:int = A, Preservespartitioning:boolean = False) (f: (T, A) = + U): Rdd[u]

The first function constructa is the partition index of the RDD (index starting from 0) as the input, the output is the new type A;

The second function f is the two-tuple (T, A) as input (where T is the element in the original RDD, A is the output of the first function) and the output type is U.

Example: Multiply the partition index by 10 and add 2 as the element of the new RDD.

Val x = sc.parallelize (List (1,2,3,4,5,6,7,8,9,10), 3)

X.mapwith (A + A *) ((a, b) = = (b + 2)). Collect

Res4:array[int] = Array (2, 2, 2, 12, 12, 12, 22, 22, 22, 22)

5)Flatmapwith

Flatmapwith and Mapwith very similar, are to receive two functions, a function of partitionindex as input, the output is a new type A; the other function is a two-tuple (t,a) as input, the output is a sequence, The elements inside these sequences make up the new Rdd. It is defined as follows:

def Flatmapwith[a:classtag, U:classtag] (Constructa:int = A, Preservespartitioning:boolean = False) (f: (T, A) + = Seq[u]): Rdd[u]

Example:

scala> val A = Sc.parallelize (List (1,2,3,4,5,6,7,8,9), 3)

Scala> a.flatmapwith (x = x, True) ((x, y) = = List (y, x)). Collect

Res58:array[int] = Array (0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)

6)Flatmapvalues

Flatmapvalues is similar to mapvalues, except that flatmapvalues applies to the value of an rdd in which the element is KV. The value of each element is mapped to a series of values by the input function, and the values then form a series of new kv pairs with the key in the original RDD.

Example

scala> val A = Sc.parallelize (List (), (3,4), (3,6))

scala> val B = a.flatmapvalues (X=>x.to (5))

Scala> B.collect

res3:array[(int, int)] = Array ((), (1,3), (1,4), (1,5), (3,4), (3,5))

In the above example, the value of each element in the RDD is converted to a sequence (from its current value to 5), such as the first KV pair (for example), and its value of 2 is converted to 2,3,4,5. Then it is formed a series of new KV pairs (1,3), (1,4), (1,5) with the original KV key.

7)Union

8) Cartesian

9) GroupBy

) filter

When you need to compare different types of data, refer to: More APIs

One) sample

Cache)

Cache RDD elements from disk to memory

If the data needs to be reused, the data can be cached to memory through the cache operator.

Persist)

mapvalues )

As the name implies, the input function is applied to the Kev-value value in the RDD, and the key in the original RDD remains unchanged, together with the new value to form the elements in the new Rdd. Therefore, the function applies only to the RDD for which the element is KV.

Example:

scala> val A = sc.parallelize (List ("Dog", "Tiger", "Lion", "cat", "Panther", "Eagle"), 2)

scala> val B = a.map (x = = (x.length, x))

Scala> b.mapvalues ("x" + _ + "X"). Collect

res5:array[(Int, String)] = Array ((3,XDOGX), (5,xtigerx), (4,xlionx), (3,XCATX), (7,xpantherx), (5,xeaglex))

Combinebykey )

Reducebykey )

As the name implies, Reducebykey is the value of the element that is the same as key in the RDD for the KV pair, so the value of multiple elements of the same key is reduce to a value, then a new KV pair is formed with the key in the original RDD.

Example:

scala> val A = Sc.parallelize (List (), (3,4), (3,6))

Scala> A.reducebykey ((x, y) = x + y). Collect

res7:array[(int, int)] = Array ((), (3,10))

In the example above, the value of the same element as key is summed , so that the two elements of key 3 are converted (3,10).

)reduce

Reduce passes the element 22 in the RDD to the input function, generating a new value, the newly generated value and the next element of the RDD being passed to the input function until there is only one value at the end.

Example: Summing elements in an RDD

Scala> val C = sc.parallelize (1 to 10)

Scala> C.reduce ((x, y) = x + y)

Res4:int = 55

) Join

)Zip

Intersection)

5. Action operator

1) foreach

2) Saveastextfile

3) Collect

Equivalent to ToArray, the distributed RDD is returned as a single stand-alone Scala Array.

4) Count

SOURCE Quote: Https://www.jianshu.com/p/bfeed8f7583d?utm_campaign=maleskine&utm_content=note&utm_medium=seo_ Notes&utm_source=recommendation

Spark RDD API (Scala)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.