Spark operator Summary and case studies

Source: Internet
Author: User

The spark operator can be broadly divided into three broad classes of operators:

1, the Value data type of the transformation operator, this transformation does not trigger the submission of the job, the data item processed is the value type of data.

2, Key-value data type of the transformation operator, this transformation does not trigger the submission of the job, the data item for processing is the Key-value type of data.

3, action operator, this kind of operator will trigger Sparkcontext submit job. First, Value type transformation operator

1) Map

Val A = Sc.parallelize (List ("Dog", "salmon", "salmon", "rat", "Elephant"), 3)
val b = A.map (_.length)
val C = A.zi P (b)
c.collect
res0:array[(String, Int)] = Array ((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))

2) FlatMap

Val A = sc.parallelize (1 to ten, 5)
A.flatmap (1 to _). Collect
Res47:array[int] = Array (1, 1, 2, 1, 2, 3, 1, 2, 3,  4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, ten)

sc.parallelize (List (1, 2, 3), 2). FLATMAP (x = List (x, x, X)). Collect
Res85:array[int] = Array (1, 1, 1, 2, 2, 2, 3, 3, 3)

3) Mappartiions

val x  = sc.parallelize (1 to 3)
X.flatmap (List.fill (Scala.util.Random.nextInt) (_)). Collect

Res1: Array[int] = Array (1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8 , 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

4) Glom (form an array)

Val A = sc.parallelize (1 to 3)
A.glom.collect
Res8:array[array[int]] = Array (Array (1, 2, 3, 4, 5, 6, 7, 8, 9 , ten, one, A, a, 34, 35, 36, 37, 3, and A. (+), (+), (+), (+), (+), (+),. 8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,. 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94 , 98, 99, 100))

5) Union

Val A = sc.parallelize (1 to 3, 1)
val b = Sc.parallelize (5 to 7, 1)
(A + B). Collect
Res0:array[int] = Array ( 1, 2, 3, 5, 6, 7)

6) Cartesian (Descartes operation)

Val x = sc.parallelize (list (1,2,3,4,5))
val y = sc.parallelize (list (6,7,8,9,10))
X.cartesian (y). Collect
res0:array[(int, int)] = Array ((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), 3,8 ), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

7) groupBy (generate the corresponding key, the same is put together)

Val A = sc.parallelize (1 to 9, 3)
a.groupby (x = = {if (x% 2 = = 0) "Even" Else "odd"}). Collect
res42:array[(S Tring, Seq[int])] = Array ((Even,arraybuffer (2, 4, 6, 8)), (Odd,arraybuffer (1, 3, 5, 7, 9)))

8) Filter

Val A = sc.parallelize (1 to ten, 3)
val B = A.filter (_% 2 = = 0)
b.collect
Res3:array[int] = Array (2, 4, 6, 8, 10)

9) Distinct (de-weight)

Val C = sc.parallelize (List ("GNU", "Cat", "Rat", "Dog", "GNU", "Rat"), 2)
c.distinct.collect
res6:array[string] = Array (Dog, Gnu, Cat, Rat)

Subtract (remove items with duplicates)

Val A = sc.parallelize (1 to 9, 3)
val b = sc.parallelize (1 to 3, 3)
val c = a.subtract (b)
c.collect
Res3: Array[int] = Array (6, 9, 4, 7, 5, 8)

One) sample

Val A = sc.parallelize (1 to 10000, 3)
A.sample (False, 0.1, 0). Count
Res24:long = 960

Takesample)

Val x = sc.parallelize (1 to 3)
X.takesample (True, 1)
res3:array[int] = Array (339, 718, 810, 105, 71, 2 68, 333, 360, 341, 300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937, 301, 167, 69, 330, 864, 40, 645, 65, 349, 61 3, 468, 982, 314, 160, 675, 232, 794, 577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482, 657, 114, 332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559, 301, 694, 460, 839, 952, 664, 851, 260, 729, 823, 880, 792, 964, 614, 821, 683, 36 4, 80, 875, 813, 951, 663, 344, 546, 918, 436, 451, 397, 670, 756, 512, 391, 70, 213, 896, 123, 858)

Cache, persist

Val C = sc.parallelize (List ("GNU", "Cat", "Rat", "Dog", "GNU", "Rat"), 2)
c.getstoragelevel
Res0: Org.apache.spark.storage.StorageLevel = Storagelevel (False, False, False, False, 1)
C.cache
C.getstoragelevel
Res2:org.apache.spark.storage.StorageLevel = Storagelevel (False, True, False, true, 1)

second, Key-value type transformation operator

1) mapvalues

Val A = Sc.parallelize (List ("Dog", "Tiger", "Lion", "cat", "Panther", "Eagle"), 2)
val b = a.map (x = (x.length, x) )
b.mapvalues ("x" + _ + "X"). Collect
res5:array[(Int, String)] = Array (3,XDOGX), (5,xtigerx), (4,xlionx), (3,XC ATX), (7,xpantherx), (5,xeaglex))

2) Combinebykey

Val A = Sc.parallelize (List ("Dog", "cat", "GNU", "Salmon", "rabbit", "Turkey", "Wolf", "Bear", "Bee"), 3)
val B = Sc.parallelize (List (1,1,2,2<

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.