Spark function Detailed series--rdd Basic conversion

Last Update:2016-04-10 Source: Internet

Author: User

Tags define function shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary: RDD: Elastic distributed DataSet, is a special set of ' support multiple sources ' have fault tolerant mechanism ' can be cached ' support parallel operation, an RDD represents a dataset in a partition
There are two operators of Rdd: Transformation (conversion):transformation is a deferred calculation, when an RDD is converted to another RDD without immediate conversion, just remember the logical operation of the dataset Ation (execution): triggers the operation of the spark job, which actually triggers the calculation of the conversion operator This series focuses on the function operations commonly used in spark: 1.RDD Basic Conversion 2. Key-value RDD conversion 3.Action Operating Chapter 1.map (func):Each element of the dataset is transformed by a user-defined function to form a new RDD, the new rdd called Mappedrdd (example 1)

Object Map {  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Map")    val sc = new Sparkcontext (conf)    val rdd = sc.parallelize (1 to ten)  //Create Rdd    val map = Rdd.map (_*2)             //R Each element in DD is multiplied by 2    map.foreach (x = print (x+ ""))    sc.stop ()  }}

Output:

2 4 6 8 Ten  A  -  -  -  -

(Rdd dependency graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same) 2.flatMap (func):Similar to map, but each element entry can be mapped to 0 or more output items, resulting in a "flattened" output (example 2)

//... Omit sc    val rdd = sc.parallelize (1 to 5)    val fm = rdd.flatmap (x + = (1 to x)). Collect ()    fm.foreach (x = pri NT (x + ""))

Output:

1 1 2 1 2 3 1 2 3 4 1 2 3 4 5

If it is the map function its output is as follows:

Range (1) range (1, 2) range (1, 2, 3) range (1, 2, 3, 4) range (1, 2, 3, 4, 5)

(Rdd dependency graph)

3.mapPartitions (func):Similar to each element with Map,map acting on each partition, but the type of mappartitions acting on each partition func: Iterator[t] + iterator[u] Assuming that there are n elements, with m partitions, then the function of map will be called n times, While Mappartitions is called M, it is much more efficient to use mappartitions when creating objects in the process of mapping, such as when writing data to a database, if you use map to create connection objects for each element, But with mappartitions, you need to create a Connetcion object for each partition (example 3): The output has a female name:

Object Mappartitions {//define function   def partitionsfun (/*index:int,*/iter:iterator[(string,string)]): iterator[string] = {    var woman = list[string] ()    while (Iter.hasnext) {      val next = Iter.next ()      next match {case        (_, "fem Ale ") = woman =/*" ["+index+"] "+*/next._1:: Woman case _ + =}    }    return  woman.iterator  }  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Mappartitions")    Val sc = new Sparkcontext (conf)    val L = List ("Kpop", "female"), ("Zorro", "Male"), ("Mobin", "Male"), ("Lucy", "female "))    val rdd = sc.parallelize (l,2)    val mp = rdd.mappartitions (Partitionsfun)    /*val MP = Rdd.mappartitionswithindex (partitionsfun) */    Mp.collect.foreach (x = (print (x + ")))   // Convert elements in a partition to Aarray and output  }}

Output:

Kpop Lucy

In fact, this effect can be done with a single statement

Val MP = rdd.mappartitions (x = x.filter (_._2 = = "female")). Map (x = x._1)

The reason why you don't do this is to demonstrate the definition of a function (Rdd dependency graph) 4.mapPartitionsWithIndex (func):Similar to mappartitions, different time functions have more than one partition index parameter func type: (Int, iterator[t]) + Iterator[u] (example 4): Remove the comment portion of example 3 orange as output: (with partitioned index)

[0]kpop [1]lucy

5.union (ortherdataset):The data set in the two Rdd is combined to return the union of the two Rdd and will not be weighed if the same element exists in the RDD.

Omit sc   val rdd1 = sc.parallelize (1 to 3)   val rdd2 = sc.parallelize (3 to 5)   val unionrdd = rdd1.union (RDD2) 
   
    unionrdd.collect.foreach (x = print (x + ""))   Sc.stop

Output:

1 2 3 3 4 5

6.intersection (otherdataset):Returns the intersection of two Rdd

Omit Scval rdd1 = sc.parallelize (1 to 3) Val rdd2 = Sc.parallelize (3 to 5) Val Unionrdd = rdd1.intersection (RDD2) Unionrdd.col Lect.foreach (x = print (x + "")) Sc.stop

Output:

3 4

7.distinct ([numtasks]):De-weight elements in the RDD

Omit Scval list = List (1,1,2,5,2,9,6,1) Val distinctrdd = sc.parallelize (list) Val Unionrdd = Distinctrdd.distinct () UnionRDD.collect.foreach (x = print (x + ""))

Output:

1 6 9) 5 2

8.cartesian (otherdataset):Cartesian product operation for all elements in two Rdd

Omit val rdd1 = sc.parallelize (1 to 3) Val rdd2 = Sc.parallelize (2 to 5) Val Cartesianrdd = Rdd1.cartesian (RDD2) Cartesianrdd. foreach (x = println (x + ""))

Output:

(1,3) (1,4)(1,5)(2,2)(2,3) (2,4) (2,5) ( 3,2)(3,3) (3,4)(3,5)

(Rdd dependency graph)

9.coalesce (numpartitions,shuffle):To repartition the RDD partition, shuffle default value is False, when shuffle=false, can not increase the number of partitions, but not error, only the number of partitions or the original (Example 9:) Shuffle=false

Omit  val rdd = sc.parallelize (1 to 16,4) val Coalescerdd = Rdd.coalesce (3)///When the value of Suffle is false, you cannot increase the number of partitions (that is, the number of partitions cannot be from 5->7 ) println ("Number of partitions after repartitioning:" +coalescerdd.partitions.size)

Output:

Number of partitions after repartitioning: 3// data set after partition list (1, 2, 3, 4)list (5, 6, 7, 8) List (

(Example 9.1:) shuffle=true

//... Omit val rdd = sc.parallelize (1 to 16,4) val Coalescerdd = Rdd.coalesce (7,true) println ("Number of partitions after repartitioning:" + coalesceRDD.partitions.size) println ("Rdd dependency:" +coalescerdd.todebugstring)

Output:

Number of partitions after repartitioning: 5Rdd Dependency: (5) mappartitionsrdd[4] at coalesce at Coalesce.scala:14 []| COALESCEDRDD[3] at coalesce at Coalesce.scala:14 []| SHUFFLEDRDD[2] at coalesce at Coalesce.scala:14 []+-(4) mappartitionsrdd[1] at coalesce  []| Parallelcollectionrdd[0] at parallelize at Coalesce.scala:13 []// partitioned Data set List (Ten) List (1, 5, one, all) List (2, 6,) List(3, 7,) List(

(Rdd Dependency graph: Coalesce (3,flase))

(Rdd Dependency graph: Coalesce (3,true)) 10.repartition (numpartition):is the implementation of function coalesce (numpartition,true), the same effect as in example 9.1 of coalesce (Numpartition,true) 11.glom ():Converts an element of type T in each partition of the RDD into an array array[t]

Omit val rdd = sc.parallelize (1 to 16,4) val Glomrdd = Rdd.glom ()//rdd[array[t]]glomrdd.foreach (Rdd = println (rdd.getc Lass.getsimplename)) Sc.stop

Output:

int // explains that the elements in the RDD are converted to an array of array[int]

12.randomSplit (weight:array[double],seed):Divides an RDD into multiple rdd according to the weight weight value, the higher the weight, the greater the chance of dividing the resulting element.

Omit Scval Rdd = sc.parallelize (1 to ten) Val Randomsplitrdd = Rdd.randomsplit (Array (1.0,2.0,7.0)) Randomsplitrdd (0). foreach (x = print (x + "")) Randomsplitrdd (1). foreach (x = print (x + "")) Randomsplitrdd (2). foreach (x = print (x + "")) Sc.stop

Output:

2 43 8 91 5 6 7 10

Spark function Detailed series--rdd Basic conversion

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More