1.mapValus (Fun): V-valued map operation in [k,v] type data
(Example 1): For each of the ages plus 2
Object Mapvalues { def main (args:array[string]) { new sparkconf (). Setmaster ("local"). Setappname ("map") new sparkcontext (conf) = List (("Mobin", +), ("Kpop", 20), (" Lufei ", ") = sc.parallelize (list) = rdd.mapvalues (_+2) Mapvaluesrdd.foreach (println) }}
Output:
(mobin,24)
(kpop,22)
(lufei,25)
(Rdd dependency Graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same)
2.flatMapValues (Fun): Flatmap operation on V-value in [k,v] type data
(Example 2):
// omit <br>val list = List (("Mobin", "Max"), ("Kpop", (), ("Lufei", "Max")) val Rdd = = rdd.flatmapvalues (x = > Seq (x, "male")) Mapvaluesrdd.foreach (println)
Output:
(mobin,22)
(Mobin,male)
(kpop,20)
(Kpop,male)
(lufei,23)
(Lufei,male)
If it is mapvalues will output:
(Mobin,list (male))
(Kpop,list (male))
(Lufei,list (male))
(Rdd dependency graph)
3.comineByKey (Createcombiner,mergevalue,mergecombiners,partitioner,mapsidecombine)
Cominebykey (createcombiner,mergevalue,mergecombiners,numpartitions)
Cominebykey (Createcombiner,mergevalue,mergecombiners)
Createcombiner: Creates a combo function when you first encounter key, converts the V-type value in the Rdd dataset to the C-type value (v = c),
As Example 3:
Mergevalue: Merge value Functions, when encountering the same key again, combine the C-type value of Createcombiner with this incoming V-type value into a C-type value (C,V) =>c,
As Example 3:
Mergecombiners: Merging the combo function to combine the C-type value 22 into a C-type value
As Example 3:
Partitioner: Using an existing or custom partition function, the default is Hashpartitioner
Mapsidecombine: Whether to perform combine operation on map side, default to True
Note that the parameter types of the first three functions correspond; call Createcombiner the first time you encounter key, call Mergevalue merge value when you encounter the same key again
(Example 3): Statistics of the number of males and females, and to (sex, (name, name ...). ), the form output of the number)
Object Combinebykey {def main (args:array[string]) {val conf=NewSparkconf (). Setmaster ("local"). Setappname ("Combinbykey") Val SC=Newsparkcontext (Conf) val people= List ("Male", "Mobin"), ("Male", "Kpop"), ("Female", "Lucy"), ("Male", "Lufei"), ("Female", "Amy"))) Val Rdd=sc.parallelize (People) Val Combinbykeyrdd=Rdd.combinebykey ((x:string)= = (List (x), 1), (PEO: (list[string], Int), x:string)= = (x:: Peo._1, peo._2 + 1), (sex1: (list[string], int), SEX2: (list[string], int))= = (sex1._1::: sex2._1, Sex1._2 +sex2._2)) Combinbykeyrdd.foreach (println) Sc.stop ()}}
Output:
(Male, (List (Lufei, Kpop, Mobin), 3))
(Female, (List (Amy, Lucy), 2))
Process decomposition:
Partition1:
k= "Male"--("Male", "Mobin")--Createcombiner ("Mobin") + Peo1 = (List ("Mobin"), 1)
k= "Male"--("Male", "Kpop")--Mergevalue (Peo1, "Kpop") + Peo2 = ("Kpop":: peo1_1, 1 + 1)//key same Call the Mergevalue function to merge the values
k= "Female"--("female", "Lucy")--Createcombiner ("Lucy") + Peo3 = (List ("Lucy"), 1)
Partition2:
k= "Male"--("Male", "Lufei")--Createcombiner ("Lufei") + Peo4 = (List ("Lufei"), 1)
k= "Female"--("female", "Amy")--Createcombiner ("Amy") and Peo5 = (List ("Amy"), 1)
Merger Partition:
k= "Male"--mergecombiners (Peo2,peo4) = (List (lufei,kpop,mobin))
k= "Female"--mergecombiners (PEO3,PEO5) = (List (amy,lucy))
(Rdd dependency graph)
4.foldByKey (Zerovalue) (func)
Foldbykey (Zerovalue,partitioner) (func)
Foldbykey (Zerovalue,numpartitiones) (func)
The Foldbykey function is implemented by invoking the Combinebykey function.
Zerovale: The V is initialized, actually by the Combinebykey createcombiner Implementation of V = (zerovalue,v), and then through the Func function mapped to a new value, namely Func (ZEROVALUE,V), As example 4 can be considered for each V first v=> 2 + V
Func:value will be merged by the Func function by the key value (actually implemented by the Mergevalue,mergecombiners function of Combinebykey, but here the two functions are the same)
Example 4:
Omitted
Val people = List ("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3)) = sc.parallelize (peop Le) = Rdd.foldbykey (2) (_+_) Foldbykeyrdd.foreach (println)
Output:
(amy,2)
(mobin,4)
(lucy,6)
Add 2 to each v first, and then add the value of the same key.
5.reduceByKey (func,numpartitions): GROUP by key, aggregate value value with given Func function, Numpartitions set number of partitions, improve job parallelism
Example 5
Omitted
Val arr = List ("A", 3), ("A", 2), ("B", 1), ("B", 3== Rdd.reducebykey (_ +_) Reducebykeyrdd.foreach ( println) Sc.stop
Output:
(a,5)
(a,4)
(Rdd dependency graph)
6.groupByKey (numpartitions): Group by Key, return [k,iterable[v]],numpartitions set number of partitions, improve job parallelism
Example 6:
Omitted
Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.groupbykey () Groupbykeyrdd.foreach ( println) Sc.stop
Output:
(B,compactbuffer (2, 3))
(A,compactbuffer (1, 2))
The above Foldbykey,reducebykey,groupbykey functions are ultimately implemented by calling the Combinebykey function.
7.sortByKey (accending,numpartitions): Returns the value of the (K,V) key-valued pair, sorted by key, when the rdd,accending is true, indicating ascending, false when descending, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 7:
Omit SC
Val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3== Rdd.sortbykey () Sortbykeyrdd.foreach (println) Sc.stop
Output:
(a,1)
(a,2)
(b,2)
(b,3)
8.cogroup (otherdataset,numpartitions): The elements of the same key for two rdd (such as: (K,v) and (K,W)) are aggregated first, and finally returned (K,iterator<v>,iterator <W>) Rdd,numpartitions Set the number of partitions, improve the degree of job parallelism
Example 8:
Omitted
Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd1.cogroup (rdd2) Groupbykeyrdd.foreach ( println) Sc.stop
Output:
(B, (Compactbuffer (2, 3), Compactbuffer (B1, B2)))
(A, (Compactbuffer (1, 2), Compactbuffer (A1, A2)))
(Rdd dependency graph)
9.join (otherdataset,numpartitions): to two RDD first cogroup operation to form a new RDD, and then the elements under each key Cartesian product, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 9
Omitted
Val arr = list ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2") /c1>= sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.join (rdd1) Groupbykeyrdd.foreach ( println)
Output:
(B, (2,B1))
(B, (2,B2))
(B, (3,B1))
(B, (3,B2))
(A, (1,A1))
(A, (1,A2))
(A, (2,A1))
(A, (2,A2)
(Rdd dependency graph)
10.LeftOutJoin (otherdataset,numpartitions): Left outer connection, contains all data of the left RDD, if the right side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 10:
Omitted
Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3), ("C", 1= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", " B2 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.leftouterjoin (RDD1) Leftoutjoinrdd. foreach (println) sc.stop
Output:
(B, (2,some (B1)))
(B, (2,some (B2)))
(B, (3,some (B1)))
(B, (3,some (B2)))
(C, (1,none))
(A, (1,some (A1)))
(A, (1,some (A2)))
(A, (2,some (A1)))
(A, (2,some (A2)))
11.RightOutJoin (Otherdataset, numpartitions): Right outer connection, contains all the data of the right RDD, if the left side does not match with none, Numpartitions set the number of partitions, improve the degree of job parallelism
Example 11:
Omitted
Val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3= List ("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"), ("C "," C1 "= Sc.parallelize (arr, 3= Sc.parallelize (arr1, 3= rdd.rightouterjoin (RDD1) Rightoutjoinrdd.foreach (println) sc.stop
Output:
(B, (Some (2), B1))
(B, (Some (2), B2))
(B, (Some (3), B1))
(B, (Some (3), B2))
(C, (NONE,C1))
(A, (Some (1), A1))
(A, (Some (1), A2))
(A, (Some (2), A1))
(A, (Some (2), A2))
-
Top
-
0
-
Step
Spark's key value RDD conversion (reprint)