Spark's key value RDD conversion (reprint)

Last Update:2016-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.mapValus (Fun): V-valued map operation in [k,v] type data
(Example 1): For each of the ages plus 2

Object Mapvalues {  def main (args:array[string]) {    new sparkconf (). Setmaster ("local"). Setappname ("map")    new  sparkcontext (conf)    = List (("Mobin", +), ("Kpop", 20), (" Lufei ",    ") = sc.parallelize (list)    = rdd.mapvalues (_+2)    Mapvaluesrdd.foreach (println)  }}

Output:
(mobin,24)
(kpop,22)
(lufei,25)
(Rdd dependency Graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same)

2.flatMapValues (Fun): Flatmap operation on V-value in [k,v] type data
(Example 2):

// omit <br>val list = List (("Mobin", "Max"), ("Kpop", (), ("Lufei", "Max")) val Rdd = = rdd.flatmapvalues (x = > Seq (x, "male")) Mapvaluesrdd.foreach (println)

Output:
(mobin,22)
(Mobin,male)
(kpop,20)
(Kpop,male)
(lufei,23)
(Lufei,male)
If it is mapvalues will output:
(Mobin,list (male))
(Kpop,list (male))
(Lufei,list (male))
(Rdd dependency graph)

3.comineByKey (Createcombiner,mergevalue,mergecombiners,partitioner,mapsidecombine)

Cominebykey (createcombiner,mergevalue,mergecombiners,numpartitions)

Cominebykey (Createcombiner,mergevalue,mergecombiners)

Createcombiner: Creates a combo function when you first encounter key, converts the V-type value in the Rdd dataset to the C-type value (v = c),
As Example 3:

Mergevalue: Merge value Functions, when encountering the same key again, combine the C-type value of Createcombiner with this incoming V-type value into a C-type value (C,V) =>c,
As Example 3:

Mergecombiners: Merging the combo function to combine the C-type value 22 into a C-type value
As Example 3:

Partitioner: Using an existing or custom partition function, the default is Hashpartitioner

Mapsidecombine: Whether to perform combine operation on map side, default to True

Note that the parameter types of the first three functions correspond; call Createcombiner the first time you encounter key, call Mergevalue merge value when you encounter the same key again

(Example 3): Statistics of the number of males and females, and to (sex, (name, name ...). ), the form output of the number)

Object Combinebykey {def main (args:array[string]) {val conf=NewSparkconf (). Setmaster ("local"). Setappname ("Combinbykey") Val SC=Newsparkcontext (Conf) val people= List ("Male", "Mobin"), ("Male", "Kpop"), ("Female", "Lucy"), ("Male", "Lufei"), ("Female", "Amy"))) Val Rdd=sc.parallelize (People) Val Combinbykeyrdd=Rdd.combinebykey ((x:string)= = (List (x), 1), (PEO: (list[string], Int), x:string)= = (x:: Peo._1, peo._2 + 1), (sex1: (list[string], int), SEX2: (list[string], int))= = (sex1._1::: sex2._1, Sex1._2 +sex2._2)) Combinbykeyrdd.foreach (println) Sc.stop ()}}

Output:
(Male, (List (Lufei, Kpop, Mobin), 3))
(Female, (List (Amy, Lucy), 2))
Process decomposition:

Partition1:
k= "Male"--("Male", "Mobin")--Createcombiner ("Mobin") + Peo1 = (List ("Mobin"), 1)
k= "Male"--("Male", "Kpop")--Mergevalue (Peo1, "Kpop") + Peo2 = ("Kpop":: peo1_1, 1 + 1)//key same Call the Mergevalue function to merge the values
k= "Female"--("female", "Lucy")--Createcombiner ("Lucy") + Peo3 = (List ("Lucy"), 1)

Partition2:
k= "Male"--("Male", "Lufei")--Createcombiner ("Lufei") + Peo4 = (List ("Lufei"), 1)
k= "Female"--("female", "Amy")--Createcombiner ("Amy") and Peo5 = (List ("Amy"), 1)

Merger Partition:
k= "Male"--mergecombiners (Peo2,peo4) = (List (lufei,kpop,mobin))
k= "Female"--mergecombiners (PEO3,PEO5) = (List (amy,lucy))

(Rdd dependency graph)

4.foldByKey (Zerovalue) (func)

Foldbykey (Zerovalue,partitioner) (func)

Foldbykey (Zerovalue,numpartitiones) (func)

The Foldbykey function is implemented by invoking the Combinebykey function.

Zerovale: The V is initialized, actually by the Combinebykey createcombiner Implementation of V = (zerovalue,v), and then through the Func function mapped to a new value, namely Func (ZEROVALUE,V), As example 4 can be considered for each V first v=> 2 + V

Func:value will be merged by the Func function by the key value (actually implemented by the Mergevalue,mergecombiners function of Combinebykey, but here the two functions are the same)
Example 4:
Omitted

    Val people = List ("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3))    = sc.parallelize (peop Le)    = Rdd.foldbykey (2) (_+_)    Foldbykeyrdd.foreach (println)

Output:
(amy,2)
(mobin,4)
(lucy,6)
Add 2 to each v first, and then add the value of the same key.

5.reduceByKey (func,numpartitions): GROUP by key, aggregate value value with given Func function, Numpartitions set number of partitions, improve job parallelism
Example 5

Omitted

Val arr = List ("A", 3), ("A", 2), ("B", 1), ("B", 3== Rdd.reducebykey (_ +_) Reducebykeyrdd.foreach ( println) Sc.stop

Output:
(a,5)
(a,4)
(Rdd dependency graph)

6.groupByKey (numpartitions): Group by Key, return [k,iterable[v]],numpartitions set number of partitions, improve job parallelism
Example 6:

Omitted

Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.groupbykey () Groupbykeyrdd.foreach ( println) Sc.stop

Output:
(B,compactbuffer (2, 3))
(A,compactbuffer (1, 2))

The above Foldbykey,reducebykey,groupbykey functions are ultimately implemented by calling the Combinebykey function.

7.sortByKey (accending,numpartitions): Returns the value of the (K,V) key-valued pair, sorted by key, when the rdd,accending is true, indicating ascending, false when descending, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 7:

Omit SC

Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.sortbykey () Sortbykeyrdd.foreach (println) Sc.stop

Output:
(a,1)
(a,2)
(b,2)
(b,3)

8.cogroup (otherdataset,numpartitions): The elements of the same key for two rdd (such as: (K,v) and (K,W)) are aggregated first, and finally returned (K,iterator<v>,iterator <W>) Rdd,numpartitions Set the number of partitions, improve the degree of job parallelism
Example 8:

Omitted

Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd1.cogroup (rdd2) Groupbykeyrdd.foreach ( println) Sc.stop

Output:
(B, (Compactbuffer (2, 3), Compactbuffer (B1, B2)))
(A, (Compactbuffer (1, 2), Compactbuffer (A1, A2)))
(Rdd dependency graph)

9.join (otherdataset,numpartitions): to two RDD first cogroup operation to form a new RDD, and then the elements under each key Cartesian product, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 9
Omitted

Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.join (rdd1) Groupbykeyrdd.foreach ( println)

Output:

(B, (2,B1))
(B, (2,B2))
(B, (3,B1))
(B, (3,B2))

(A, (1,A1))
(A, (1,A2))
(A, (2,A1))
(A, (2,A2)

(Rdd dependency graph)

10.LeftOutJoin (otherdataset,numpartitions): Left outer connection, contains all data of the left RDD, if the right side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 10:
Omitted

Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3), ("C", 1= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", " B2 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.leftouterjoin (RDD1) Leftoutjoinrdd. foreach (println) sc.stop

Output:

(B, (2,some (B1)))
(B, (2,some (B2)))
(B, (3,some (B1)))
(B, (3,some (B2)))

(C, (1,none))

(A, (1,some (A1)))
(A, (1,some (A2)))
(A, (2,some (A1)))
(A, (2,some (A2)))

11.RightOutJoin (Otherdataset, numpartitions): Right outer connection, contains all the data of the right RDD, if the left side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 11:
Omitted

Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"), ("C "," C1 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.rightouterjoin (RDD1) Rightoutjoinrdd.foreach (println) sc.stop

Output:

(B, (Some (2), B1))
(B, (Some (2), B2))
(B, (Some (3), B1))
(B, (Some (3), B2))

(C, (NONE,C1))

(A, (Some (1), A1))
(A, (Some (1), A2))
(A, (Some (2), A1))
(A, (Some (2), A2))

Top: 0

Step

Spark's key value RDD conversion (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark's key value RDD conversion (reprint)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark's key value RDD conversion (reprint)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support