Spark's key value RDD conversion (reprint)

Source: Internet
Author: User

1.mapValus (Fun): V-valued map operation in [k,v] type data
(Example 1): For each of the ages plus 2

Object Mapvalues {  def main (args:array[string]) {    new sparkconf (). Setmaster ("local"). Setappname ("map")    new  sparkcontext (conf)    = List (("Mobin", +), ("Kpop", 20), (" Lufei ",    ") = sc.parallelize (list)    = rdd.mapvalues (_+2)    Mapvaluesrdd.foreach (println)  }}
Output:
(mobin,24)
(kpop,22)
(lufei,25)
(Rdd dependency Graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same)




2.flatMapValues (Fun): Flatmap operation on V-value in [k,v] type data
(Example 2):

// omit <br>val list = List (("Mobin", "Max"), ("Kpop", (), ("Lufei", "Max")) val Rdd = = rdd.flatmapvalues (x = > Seq (x, "male")) Mapvaluesrdd.foreach (println)

Output:
(mobin,22)
(Mobin,male)
(kpop,20)
(Kpop,male)
(lufei,23)
(Lufei,male)
If it is mapvalues will output:
(Mobin,list (male))
(Kpop,list (male))
(Lufei,list (male))
(Rdd dependency graph)




3.comineByKey (Createcombiner,mergevalue,mergecombiners,partitioner,mapsidecombine)

Cominebykey (createcombiner,mergevalue,mergecombiners,numpartitions)

Cominebykey (Createcombiner,mergevalue,mergecombiners)

Createcombiner: Creates a combo function when you first encounter key, converts the V-type value in the Rdd dataset to the C-type value (v = c),
As Example 3:




Mergevalue: Merge value Functions, when encountering the same key again, combine the C-type value of Createcombiner with this incoming V-type value into a C-type value (C,V) =>c,
As Example 3:

Mergecombiners: Merging the combo function to combine the C-type value 22 into a C-type value
As Example 3:

Partitioner: Using an existing or custom partition function, the default is Hashpartitioner

Mapsidecombine: Whether to perform combine operation on map side, default to True

Note that the parameter types of the first three functions correspond; call Createcombiner the first time you encounter key, call Mergevalue merge value when you encounter the same key again

(Example 3): Statistics of the number of males and females, and to (sex, (name, name ...). ), the form output of the number)

Object Combinebykey {def main (args:array[string]) {val conf=NewSparkconf (). Setmaster ("local"). Setappname ("Combinbykey") Val SC=Newsparkcontext (Conf) val people= List ("Male", "Mobin"), ("Male", "Kpop"), ("Female", "Lucy"), ("Male", "Lufei"), ("Female", "Amy"))) Val Rdd=sc.parallelize (People) Val Combinbykeyrdd=Rdd.combinebykey ((x:string)= = (List (x), 1), (PEO: (list[string], Int), x:string)= = (x:: Peo._1, peo._2 + 1), (sex1: (list[string], int), SEX2: (list[string], int))= = (sex1._1::: sex2._1, Sex1._2 +sex2._2)) Combinbykeyrdd.foreach (println) Sc.stop ()}}

Output:
(Male, (List (Lufei, Kpop, Mobin), 3))
(Female, (List (Amy, Lucy), 2))
Process decomposition:

Partition1:
k= "Male"--("Male", "Mobin")--Createcombiner ("Mobin") + Peo1 = (List ("Mobin"), 1)
k= "Male"--("Male", "Kpop")--Mergevalue (Peo1, "Kpop") + Peo2 = ("Kpop":: peo1_1, 1 + 1)//key same Call the Mergevalue function to merge the values
k= "Female"--("female", "Lucy")--Createcombiner ("Lucy") + Peo3 = (List ("Lucy"), 1)

Partition2:
k= "Male"--("Male", "Lufei")--Createcombiner ("Lufei") + Peo4 = (List ("Lufei"), 1)
k= "Female"--("female", "Amy")--Createcombiner ("Amy") and Peo5 = (List ("Amy"), 1)

Merger Partition:
k= "Male"--mergecombiners (Peo2,peo4) = (List (lufei,kpop,mobin))
k= "Female"--mergecombiners (PEO3,PEO5) = (List (amy,lucy))

(Rdd dependency graph)





4.foldByKey (Zerovalue) (func)

Foldbykey (Zerovalue,partitioner) (func)

Foldbykey (Zerovalue,numpartitiones) (func)

The Foldbykey function is implemented by invoking the Combinebykey function.

Zerovale: The V is initialized, actually by the Combinebykey createcombiner Implementation of V = (zerovalue,v), and then through the Func function mapped to a new value, namely Func (ZEROVALUE,V), As example 4 can be considered for each V first v=> 2 + V

Func:value will be merged by the Func function by the key value (actually implemented by the Mergevalue,mergecombiners function of Combinebykey, but here the two functions are the same)
Example 4:
Omitted
    Val people = List ("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3))    = sc.parallelize (peop Le)    = Rdd.foldbykey (2) (_+_)    Foldbykeyrdd.foreach (println)

Output:
(amy,2)
(mobin,4)
(lucy,6)
Add 2 to each v first, and then add the value of the same key.


5.reduceByKey (func,numpartitions): GROUP by key, aggregate value value with given Func function, Numpartitions set number of partitions, improve job parallelism
Example 5

Omitted
Val arr = List ("A", 3), ("A", 2), ("B", 1), ("B", 3== Rdd.reducebykey (_ +_) Reducebykeyrdd.foreach ( println) Sc.stop

Output:
(a,5)
(a,4)
(Rdd dependency graph)



6.groupByKey (numpartitions): Group by Key, return [k,iterable[v]],numpartitions set number of partitions, improve job parallelism
Example 6:

Omitted
Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.groupbykey () Groupbykeyrdd.foreach ( println) Sc.stop

Output:
(B,compactbuffer (2, 3))
(A,compactbuffer (1, 2))

The above Foldbykey,reducebykey,groupbykey functions are ultimately implemented by calling the Combinebykey function.

7.sortByKey (accending,numpartitions): Returns the value of the (K,V) key-valued pair, sorted by key, when the rdd,accending is true, indicating ascending, false when descending, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 7:

Omit SC
Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.sortbykey () Sortbykeyrdd.foreach (println) Sc.stop

Output:
(a,1)
(a,2)
(b,2)
(b,3)

8.cogroup (otherdataset,numpartitions): The elements of the same key for two rdd (such as: (K,v) and (K,W)) are aggregated first, and finally returned (K,iterator<v>,iterator <W>) Rdd,numpartitions Set the number of partitions, improve the degree of job parallelism
Example 8:

Omitted
Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd1.cogroup (rdd2) Groupbykeyrdd.foreach ( println) Sc.stop

Output:
(B, (Compactbuffer (2, 3), Compactbuffer (B1, B2)))
(A, (Compactbuffer (1, 2), Compactbuffer (A1, A2)))
(Rdd dependency graph)




9.join (otherdataset,numpartitions): to two RDD first cogroup operation to form a new RDD, and then the elements under each key Cartesian product, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 9
Omitted
Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.join (rdd1) Groupbykeyrdd.foreach ( println)

Output:

(B, (2,B1))
(B, (2,B2))
(B, (3,B1))
(B, (3,B2))

(A, (1,A1))
(A, (1,A2))
(A, (2,A1))
(A, (2,A2)

(Rdd dependency graph)




10.LeftOutJoin (otherdataset,numpartitions): Left outer connection, contains all data of the left RDD, if the right side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 10:
Omitted
Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3), ("C", 1= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", " B2 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.leftouterjoin (RDD1) Leftoutjoinrdd. foreach (println) sc.stop

Output:

(B, (2,some (B1)))
(B, (2,some (B2)))
(B, (3,some (B1)))
(B, (3,some (B2)))

(C, (1,none))

(A, (1,some (A1)))
(A, (1,some (A2)))
(A, (2,some (A1)))
(A, (2,some (A2)))

11.RightOutJoin (Otherdataset, numpartitions): Right outer connection, contains all the data of the right RDD, if the left side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 11:
Omitted
Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"), ("C "," C1 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.rightouterjoin (RDD1) Rightoutjoinrdd.foreach (println) sc.stop
Output:

(B, (Some (2), B1))
(B, (Some (2), B2))
(B, (Some (3), B1))
(B, (Some (3), B2))

(C, (NONE,C1))

(A, (Some (1), A1))
(A, (Some (1), A2))
(A, (Some (2), A1))
(A, (Some (2), A2))
Top
0
Step

Spark's key value RDD conversion (reprint)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.