Spark function Detailed series--rdd Basic conversion

Source: Internet
Author: User
Tags define function shuffle

Summary: RDD: Elastic distributed DataSet, is a special set of ' support multiple sources ' have fault tolerant mechanism ' can be cached ' support parallel operation, an RDD represents a dataset in a partition
There are two operators of Rdd: Transformation (conversion):transformation is a deferred calculation, when an RDD is converted to another RDD without immediate conversion, just remember the logical operation of the dataset Ation (execution): triggers the operation of the spark job, which actually triggers the calculation of the conversion operator This series focuses on the function operations commonly used in spark: 1.RDD Basic Conversion 2. Key-value RDD conversion 3.Action Operating Chapter 1.map (func):Each element of the dataset is transformed by a user-defined function to form a new RDD, the new rdd called Mappedrdd (example 1)
Object Map {  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Map")    val sc = new Sparkcontext (conf)    val rdd = sc.parallelize (1 to ten)  //Create Rdd    val map = Rdd.map (_*2)             //R Each element in DD is multiplied by 2    map.foreach (x = print (x+ ""))    sc.stop ()  }}
Output:
2 4 6 8 Ten  A  -  -  -  -
(Rdd dependency graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same) 2.flatMap (func):Similar to map, but each element entry can be mapped to 0 or more output items, resulting in a "flattened" output (example 2)
//... Omit sc    val rdd = sc.parallelize (1 to 5)    val fm = rdd.flatmap (x + = (1 to x)). Collect ()    fm.foreach (x = pri NT (x + ""))
Output:
1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
If it is the map function its output is as follows:
Range (1) range (1, 2) range (1, 2, 3) range (1, 2, 3, 4) range (1, 2, 3, 4, 5)

(Rdd dependency graph)

3.mapPartitions (func):Similar to each element with Map,map acting on each partition, but the type of mappartitions acting on each partition func: Iterator[t] + iterator[u] Assuming that there are n elements, with m partitions, then the function of map will be called n times, While Mappartitions is called M, it is much more efficient to use mappartitions when creating objects in the process of mapping, such as when writing data to a database, if you use map to create connection objects for each element, But with mappartitions, you need to create a Connetcion object for each partition (example 3): The output has a female name:
Object Mappartitions {//define function   def partitionsfun (/*index:int,*/iter:iterator[(string,string)]): iterator[string] = {    var woman = list[string] ()    while (Iter.hasnext) {      val next = Iter.next ()      next match {case        (_, "fem Ale ") = woman =/*" ["+index+"] "+*/next._1:: Woman case _ + =}    }    return  woman.iterator  }  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Mappartitions")    Val sc = new Sparkcontext (conf)    val L = List ("Kpop", "female"), ("Zorro", "Male"), ("Mobin", "Male"), ("Lucy", "female "))    val rdd = sc.parallelize (l,2)    val mp = rdd.mappartitions (Partitionsfun)    /*val MP = Rdd.mappartitionswithindex (partitionsfun) */    Mp.collect.foreach (x = (print (x + ")))   // Convert elements in a partition to Aarray and output  }}
Output:
Kpop Lucy
In fact, this effect can be done with a single statement
Val MP = rdd.mappartitions (x = x.filter (_._2 = = "female")). Map (x = x._1)
The reason why you don't do this is to demonstrate the definition of a function (Rdd dependency graph) 4.mapPartitionsWithIndex (func):Similar to mappartitions, different time functions have more than one partition index parameter func type: (Int, iterator[t]) + Iterator[u] (example 4): Remove the comment portion of example 3 orange as output: (with partitioned index)
[0]kpop [1]lucy

5.union (ortherdataset):The data set in the two Rdd is combined to return the union of the two Rdd and will not be weighed if the same element exists in the RDD.
Omit sc   val rdd1 = sc.parallelize (1 to 3)   val rdd2 = sc.parallelize (3 to 5)   val unionrdd = rdd1.union (RDD2) 
   
    unionrdd.collect.foreach (x = print (x + ""))   Sc.stop
   
Output:
1 2 3 3 4 5

  

6.intersection (otherdataset):Returns the intersection of two Rdd
Omit Scval rdd1 = sc.parallelize (1 to 3) Val rdd2 = Sc.parallelize (3 to 5) Val Unionrdd = rdd1.intersection (RDD2) Unionrdd.col Lect.foreach (x = print (x + "")) Sc.stop
Output:
3 4

7.distinct ([numtasks]):De-weight elements in the RDD
Omit Scval list = List (1,1,2,5,2,9,6,1) Val distinctrdd = sc.parallelize (list) Val Unionrdd = Distinctrdd.distinct () UnionRDD.collect.foreach (x = print (x + ""))
Output:
1 6 9) 5 2

8.cartesian (otherdataset):Cartesian product operation for all elements in two Rdd
Omit val rdd1 = sc.parallelize (1 to 3) Val rdd2 = Sc.parallelize (2 to 5) Val Cartesianrdd = Rdd1.cartesian (RDD2) Cartesianrdd. foreach (x = println (x + ""))
Output:
(1,3) (1,4)(1,5)(2,2)(2,3) (2,4) (2,5) ( 3,2)(3,3) (3,4)(3,5)

(Rdd dependency graph)

9.coalesce (numpartitions,shuffle):To repartition the RDD partition, shuffle default value is False, when shuffle=false, can not increase the number of partitions, but not error, only the number of partitions or the original (Example 9:) Shuffle=false
Omit  val rdd = sc.parallelize (1 to 16,4) val Coalescerdd = Rdd.coalesce (3)///When the value of Suffle is false, you cannot increase the number of partitions (that is, the number of partitions cannot be from 5->7 ) println ("Number of partitions after repartitioning:" +coalescerdd.partitions.size)
Output:
Number of partitions after repartitioning: 3// data set after partition list (1, 2, 3, 4)list (5, 6, 7, 8) List (

 

(Example 9.1:) shuffle=true
//... Omit val rdd = sc.parallelize (1 to 16,4) val Coalescerdd = Rdd.coalesce (7,true) println ("Number of partitions after repartitioning:" + coalesceRDD.partitions.size) println ("Rdd dependency:" +coalescerdd.todebugstring)
Output:
Number of partitions after repartitioning: 5Rdd Dependency: (5) mappartitionsrdd[4] at coalesce at Coalesce.scala:14 []| COALESCEDRDD[3] at coalesce at Coalesce.scala:14 []| SHUFFLEDRDD[2] at coalesce at Coalesce.scala:14 []+-(4) mappartitionsrdd[1] at coalesce  []| Parallelcollectionrdd[0] at parallelize at Coalesce.scala:13 []// partitioned Data set List (Ten) List (1, 5, one, all) List (2, 6,) List(3, 7,) List(

(Rdd Dependency graph: Coalesce (3,flase))

(Rdd Dependency graph: Coalesce (3,true)) 10.repartition (numpartition):is the implementation of function coalesce (numpartition,true), the same effect as in example 9.1 of coalesce (Numpartition,true) 11.glom ():Converts an element of type T in each partition of the RDD into an array array[t]  
Omit val rdd = sc.parallelize (1 to 16,4) val Glomrdd = Rdd.glom ()//rdd[array[t]]glomrdd.foreach (Rdd = println (rdd.getc Lass.getsimplename)) Sc.stop
Output:
int // explains that the elements in the RDD are converted to an array of array[int]
12.randomSplit (weight:array[double],seed):Divides an RDD into multiple rdd according to the weight weight value, the higher the weight, the greater the chance of dividing the resulting element.
Omit Scval Rdd = sc.parallelize (1 to ten) Val Randomsplitrdd = Rdd.randomsplit (Array (1.0,2.0,7.0)) Randomsplitrdd (0). foreach (x = print (x + "")) Randomsplitrdd (1). foreach (x = print (x + "")) Randomsplitrdd (2). foreach (x = print (x + "")) Sc.stop
Output:
2 43 8 91 5 6 7 10

Spark function Detailed series--rdd Basic conversion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.