Summary:
RDD: Elastic distributed DataSet, is a special set of ' support multiple sources ' have fault tolerant mechanism ' can be cached ' support parallel operation, an Rdd represents a dataset in a partition
There are two operators of Rdd:
Transformation (conversion):transformation is a deferred calculation, when an RDD is converted to another RDD without immediate conversion, just remember the logical operation of the dataset
Ation (execution): triggers the operation of the spark job, which actually triggers the calculation of the conversion operator
This series focuses on the function operations commonly used in spark:
1.RDD Basic Conversion
2. Key-value RDD conversion
3.Action Operating Chapter
1.mapValus (fun): V-valued map operation in [k,v] type data
(Example 1): For each of the ages plus 2
Object Mapvalues { def main (args:array[string]) { val conf = new sparkconf (). Setmaster ("local"). Setappname (" Map ") val sc = new Sparkcontext (conf) val list = List ((" Mobin "," ("), (" Kpop ", +), (" Lufei ", Max)) val Rdd = Sc.parallelize (list) val Mapvaluesrdd = rdd.mapvalues (_+2) Mapvaluesrdd.foreach (println) }}
Output:
(Mobin,kpop) (Lufei,)
(Rdd dependency graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same)
2.flatMapValues (fun):V-value Flatmap operation in [k,v] type data
(Example 2):
Omitted
Val list = List (("Mobin", "Max"), ("Kpop", "Max"), ("Lufei", Max)) Val rdd = sc.parallelize (list) Val Mapvaluesrdd = Rdd.flatmapvalues (x = Seq (x, "male")) Mapvaluesrdd.foreach (println)
Output:
(Mobin,Mobin,male) (Kpop,kpop,male) (Lufei,Max) (Lufei, Male)
If it is mapvalues will output:
(Mobin,list (male)) (Kpop,list (male)) (Lufei,list (male))
(Rdd dependency graph)
3.comineByKey (createcombiner,mergevalue,mergecombiners,partitioner,mapsidecombine)
Cominebykey (createcombiner,mergevalue,mergecombiners,numpartitions)
Cominebykey (createcombiner,mergevalue,mergecombiners)
Createcombiner:When you first encounter a key, create a combo function that converts the V-type value in the Rdd dataset to the C-type value (v = c), as in Example 3:
mergevalue:Merge value functions, when encountering the same key again, combine the C-type value of Createcombiner with this incoming V-type value into a C-type value (C,V) =>c, as in Example 3:
mergecombiners:Merge the combo function to combine the C-type value 22 into a C-type value such as Example 3:
Partitioner:Using an existing or custom partition function, the default is Hashpartitioner
Mapsidecombine:Whether the combine operation is performed on the map side, the default is TrueNote that the parameter types of the first three functions correspond; call Createcombiner the first time you encounter key, call Mergevalue merge value when you encounter the same key again(Example 3): Statistics of the number of males and females, and to (sex, (name, name ...). ), the form output of the number)
Object Combinebykey { def main (args:array[string]) { val conf = new sparkconf (). Setmaster ("local"). Setappname ( "Combinbykey") val sc = new Sparkcontext (conf) val people = List (("Male", "Mobin"), ("Male", "Kpop"), ("Female", " Lucy "), (" Male "," Lufei "), (" Female "," Amy ")) val Rdd = sc.parallelize (People) val Combinbykeyrdd = Rdd.combinebykey ( x:string) = (List (x), 1), (PEO: (list[string], Int), x:string) = (x:: Peo._1, Peo._ 2 + 1), (sex1: (list[string], int), SEX2: (list[string], int) = = (sex1._1::: sex2._1, Sex1._2 + sex2._2)) com Binbykeyrdd.foreach (println) sc.stop ()} }
Output:
(Male, (list (Lufei, Kpop, Mobin),3)) (Female, (list (Amy, Lucy),2))
Process decomposition:
partition1:k="male"-("male","Mobin")-Createcombiner ("Mobin") = Peo1 = (List ("Mobin") ,1) K="male"-("male","Kpop")--Mergevalue (Peo1,"Kpop") = = Peo2 = ("Kpop":: Peo1_1,1+1)//key same Call mergevalue function to merge valuesk="female"-("female","Lucy")-Createcombiner ("Lucy") = Peo3 = (List ("Lucy") ,1) Partition2:k="male"-("male","Lufei")-Createcombiner ("Lufei") = Peo4 = (List ("Lufei") ,1) K="female"-("female","Amy")-Createcombiner ("Amy") = Peo5 = (List ("Amy") ,1) merger Partition:k="male"--Mergecombiners (Peo2,peo4) =(List (lufei,kpop,mobin)) K="female"--Mergecombiners (PEO3,PEO5) = (List (amy,lucy))
(Rdd dependency graph)
4.foldByKey (Zerovalue) (func)
Foldbykey (Zerovalue,partitioner) (func)
Foldbykey (Zerovalue,numpartitiones) (func)the Foldbykey function is implemented by invoking the Combinebykey function.
Zerovale:The V is initialized by the Combinebykey Createcombiner (ZEROVALUE,V), which is then mapped to the new value via the Func function, the func (zerovalue,v), As example 4 can be considered for each V first v=> 2 + V
func:Value will be merged by the Func function by the key value (actually implemented by Combinebykey's mergevalue,mergecombiners function, except here, the two functions are the same) Example 4:
Omit val people = List ("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3)) val rdd = sc.parallelize ( People) val Foldbykeyrdd = Rdd.foldbykey (2) (_+_) Foldbykeyrdd.foreach (println)
Output:
(Amy,2) (Mobin,4) (Lucy,6)
Add 2 to each v first, and then add the value of the same key.
5.reduceByKey (func,numpartitions):GROUP BY key, aggregate value value with given Func function, Numpartitions set partition number, increase job parallelism Example 5
Omit val arr = List ("A", 3), ("A", 2), ("B", 1), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Reducebykeyrdd = Rdd.reducebykey (_ + _) Reducebykeyrdd.foreach (println) sc.stop
Output:
(A,5) (A,4)
(Rdd dependency graph)6.groupByKey (numpartitions):Group by Key, return [K,iterable[v]],numpartitions set partition number, increase job parallelism Example 6:
Omit val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Groupbykeyrdd = Rdd.groupbykey () Groupbykeyrdd.foreach (println) sc.stop
Output:
(B,compactbuffer (23)) (A,compactbuffer (12))
the above Foldbykey,reducebykey,groupbykey functions are ultimately implemented by calling the Combinebykey function.
7.sortByKey (accending,numpartitions):Returns the K,V key-value pairs that are sorted by key rdd,accending to True when ascending, false for descending, numpartitions to set the number of partitions, and to improve job parallelism Example 7:
Omit Scval arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Sortbykeyrdd = Rdd.sortbykey () Sortbykeyrdd.foreach (println) sc.stop
Output:
(A,1) (A,2) (b,2) (b,3)
8.cogroup (otherdataset,numpartitions):The elements of the same key for two rdd (such as: (K,v) and (K,W)) are aggregated first, and finally return (k,iterator<v>,iterator<w>) The rdd,numpartitions set the number of partitions in the form. Improve Job parallelism Example 8:
Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 ")) Val rdd1 = Sc.parallelize (arr, 3) Val rdd2 = Sc.parallelize (arr1, 3) Val Groupbykeyrdd = Rdd1.cogroup (RDD2) Groupbykeyrdd . foreach (println) sc.stop
Output:
(B, (Compactbuffer (23), Compactbuffer (B1, B2))) (A, (Compactbuffer (12) , Compactbuffer (A1, A2)))
(Rdd dependency graph)
9.join (otherdataset,numpartitions):Two RDD first cogroup operation to form a new RDD, and then the elements under each key Cartesian product, Numpartitions set the number of partitions, improve the work of parallelism Example 9
Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 ") Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Groupbykeyrdd = Rdd.join (RDD1) Groupbykeyrdd.fore Ach (println)
Output:
(b, (2, B1)) (b, (2, B2)) (b, (3, B1)) (b, (3, B2)) (A, (1, A1))(A, (1, A2)) (A, (2, A1)) (A, (2 , A2)
(Rdd dependency graph)
10.LeftOutJoin (otherdataset,numpartitions):Left outer connection, contains all data of the left RDD, if the right side does not match with none, Numpartitions set the number of partitions, improve the job parallelism Example 10:
Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3), ("C", 1)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ( "B", "B2")) Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Leftoutjoinrdd = Rdd.leftouterjoin (RDD1) L Eftoutjoinrdd. foreach (println) sc.stop
Output:
(b, (2, Some (B1))) (b, (2, Some (B2))) (b, (3, Some (B1) )) (b, ( 3, Some (B2))) (C, (1, None)) (A,(1, Some (A1))) (A, (1 , Some (A2))) (A, (2, Some (A1))) (A, (2, Some (A2)))
11.RightOutJoin (Otherdataset, numpartitions):Right outer connection, contains all data of the right RDD, if the left side does not match with none, Numpartitions set the number of partitions, improve the job parallelism Example 11:
Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 "), (" C "," C1 ")) Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Rightoutjoinrdd = Rdd.rightouterjoin ( RDD1) Rightoutjoinrdd.foreach (println) sc.stop
Output:
(b, (Some (2), B1)) (b, (Some (2), B2)) (b, (Some (3), B1)) (b, (Some ( 3) (B2)) (C, (NONE,C1)) (A, (Some (1), A1)) (A, (Some (1), A2)) ( A, (Some (2), A1)) (A, (Some (2), A2))
The above example source address: https://github.com/hadoop-mobin/SparkExample/tree/master/src/main/scala/com/mobin/SparkRDDFun/ Transformation/kvrdd
Spark common functions explained--Key value RDD conversion