International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark common functions explained--Key value RDD conversion

Last Update:2016-04-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary:

RDD: Elastic distributed DataSet, is a special set of ' support multiple sources ' have fault tolerant mechanism ' can be cached ' support parallel operation, an Rdd represents a dataset in a partition
There are two operators of Rdd:

Transformation (conversion):transformation is a deferred calculation, when an RDD is converted to another RDD without immediate conversion, just remember the logical operation of the dataset
Ation (execution): triggers the operation of the spark job, which actually triggers the calculation of the conversion operator

This series focuses on the function operations commonly used in spark:
1.RDD Basic Conversion
2. Key-value RDD conversion
3.Action Operating Chapter 1.mapValus (fun): V-valued map operation in [k,v] type data (Example 1): For each of the ages plus 2

Object Mapvalues {  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname (" Map ") val    sc = new Sparkcontext (conf)    val list = List ((" Mobin "," ("), (" Kpop ", +), (" Lufei ", Max))    val Rdd = Sc.parallelize (list)    val Mapvaluesrdd = rdd.mapvalues (_+2)    Mapvaluesrdd.foreach (println)  }}

Output:

(Mobin,kpop) (Lufei,)

(Rdd dependency graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same) 2.flatMapValues (fun):V-value Flatmap operation in [k,v] type data (Example 2):

Omitted
Val list = List (("Mobin", "Max"), ("Kpop", "Max"), ("Lufei", Max)) Val rdd = sc.parallelize (list) Val Mapvaluesrdd = Rdd.flatmapvalues (x = Seq (x, "male")) Mapvaluesrdd.foreach (println)

Output:

(Mobin,Mobin,male) (Kpop,kpop,male) (Lufei,Max) (Lufei, Male)

If it is mapvalues will output:

(Mobin,list (male)) (Kpop,list (male)) (Lufei,list (male))

(Rdd dependency graph) 3.comineByKey (createcombiner,mergevalue,mergecombiners,partitioner,mapsidecombine) Cominebykey (createcombiner,mergevalue,mergecombiners,numpartitions) Cominebykey (createcombiner,mergevalue,mergecombiners) Createcombiner:When you first encounter a key, create a combo function that converts the V-type value in the Rdd dataset to the C-type value (v = c), as in Example 3: mergevalue:Merge value functions, when encountering the same key again, combine the C-type value of Createcombiner with this incoming V-type value into a C-type value (C,V) =>c, as in Example 3: mergecombiners:Merge the combo function to combine the C-type value 22 into a C-type value such as Example 3: Partitioner:Using an existing or custom partition function, the default is Hashpartitioner Mapsidecombine:Whether the combine operation is performed on the map side, the default is TrueNote that the parameter types of the first three functions correspond; call Createcombiner the first time you encounter key, call Mergevalue merge value when you encounter the same key again(Example 3): Statistics of the number of males and females, and to (sex, (name, name ...). ), the form output of the number)

Object Combinebykey {  def main (args:array[string]) {    val conf = new sparkconf (). Setmaster ("local"). Setappname ( "Combinbykey")    val sc = new Sparkcontext (conf)    val people = List (("Male", "Mobin"), ("Male", "Kpop"), ("Female", " Lucy "), (" Male "," Lufei "), (" Female "," Amy "))    val Rdd = sc.parallelize (People)    val Combinbykeyrdd = Rdd.combinebykey (      x:string) = (List (x), 1),      (PEO: (list[string], Int), x:string) = (x:: Peo._1, Peo._ 2 + 1),      (sex1: (list[string], int), SEX2: (list[string], int) = = (sex1._1::: sex2._1, Sex1._2 + sex2._2))    com Binbykeyrdd.foreach (println)    sc.stop ()}  }

Output:

(Male, (list (Lufei, Kpop, Mobin),3)) (Female, (list (Amy, Lucy),2))

Process decomposition:

partition1:k="male"-("male","Mobin")-Createcombiner ("Mobin") = Peo1 = (List ("Mobin") ,1) K="male"-("male","Kpop")--Mergevalue (Peo1,"Kpop") = = Peo2 = ("Kpop":: Peo1_1,1+1)//key same Call mergevalue function to merge valuesk="female"-("female","Lucy")-Createcombiner ("Lucy") = Peo3 = (List ("Lucy") ,1) Partition2:k="male"-("male","Lufei")-Createcombiner ("Lufei") = Peo4 = (List ("Lufei") ,1) K="female"-("female","Amy")-Createcombiner ("Amy") = Peo5 = (List ("Amy") ,1) merger Partition:k="male"--Mergecombiners (Peo2,peo4) =(List (lufei,kpop,mobin)) K="female"--Mergecombiners (PEO3,PEO5) = (List (amy,lucy))

(Rdd dependency graph) 4.foldByKey (Zerovalue) (func) Foldbykey (Zerovalue,partitioner) (func) Foldbykey (Zerovalue,numpartitiones) (func)the Foldbykey function is implemented by invoking the Combinebykey function. Zerovale:The V is initialized by the Combinebykey Createcombiner (ZEROVALUE,V), which is then mapped to the new value via the Func function, the func (zerovalue,v), As example 4 can be considered for each V first v=> 2 + V func:Value will be merged by the Func function by the key value (actually implemented by Combinebykey's mergevalue,mergecombiners function, except here, the two functions are the same) Example 4:

Omit    val people = List ("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3))    val rdd = sc.parallelize ( People)    val Foldbykeyrdd = Rdd.foldbykey (2) (_+_)    Foldbykeyrdd.foreach (println)

Output:

(Amy,2) (Mobin,4) (Lucy,6)

Add 2 to each v first, and then add the value of the same key. 5.reduceByKey (func,numpartitions):GROUP BY key, aggregate value value with given Func function, Numpartitions set partition number, increase job parallelism Example 5

Omit val arr = List ("A", 3), ("A", 2), ("B", 1), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Reducebykeyrdd = Rdd.reducebykey (_ + _) Reducebykeyrdd.foreach (println) sc.stop

Output:

(A,5) (A,4)

(Rdd dependency graph)6.groupByKey (numpartitions):Group by Key, return [K,iterable[v]],numpartitions set partition number, increase job parallelism Example 6:

Omit val arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Groupbykeyrdd = Rdd.groupbykey () Groupbykeyrdd.foreach (println) sc.stop

Output:

(B,compactbuffer (23)) (A,compactbuffer (12))

the above Foldbykey,reducebykey,groupbykey functions are ultimately implemented by calling the Combinebykey function. 7.sortByKey (accending,numpartitions):Returns the K,V key-value pairs that are sorted by key rdd,accending to True when ascending, false for descending, numpartitions to set the number of partitions, and to improve job parallelism Example 7:

Omit Scval arr = List (("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val Rdd = Sc.parallelize (arr) val Sortbykeyrdd = Rdd.sortbykey () Sortbykeyrdd.foreach (println) sc.stop

Output:

(A,1) (A,2) (b,2) (b,3)

8.cogroup (otherdataset,numpartitions):The elements of the same key for two rdd (such as: (K,v) and (K,W)) are aggregated first, and finally return (k,iterator<v>,iterator<w>) The rdd,numpartitions set the number of partitions in the form. Improve Job parallelism Example 8:

Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 ")) Val rdd1 = Sc.parallelize (arr, 3) Val rdd2 = Sc.parallelize (arr1, 3) Val Groupbykeyrdd = Rdd1.cogroup (RDD2) Groupbykeyrdd . foreach (println) sc.stop

Output:

(B, (Compactbuffer (23), Compactbuffer (B1, B2))) (A, (Compactbuffer (12) , Compactbuffer (A1, A2)))

(Rdd dependency graph) 9.join (otherdataset,numpartitions):Two RDD first cogroup operation to form a new RDD, and then the elements under each key Cartesian product, Numpartitions set the number of partitions, improve the work of parallelism Example 9

Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 ") Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Groupbykeyrdd = Rdd.join (RDD1) Groupbykeyrdd.fore Ach (println)

Output:

(b, (2, B1)) (b, (2, B2)) (b, (3, B1)) (b, (3, B2)) (A, (1, A1))(A, (1, A2)) (A, (2, A1)) (A, (2 , A2)

(Rdd dependency graph) 10.LeftOutJoin (otherdataset,numpartitions):Left outer connection, contains all data of the left RDD, if the right side does not match with none, Numpartitions set the number of partitions, improve the job parallelism Example 10:

Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3), ("C", 1)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ( "B", "B2")) Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Leftoutjoinrdd = Rdd.leftouterjoin (RDD1) L Eftoutjoinrdd. foreach (println) sc.stop

Output:

(b, (2, Some (B1))) (b, (2, Some (B2))) (b, (3, Some (B1) )) (b, (  3, Some (B2))) (C, (1, None)) (A,(1, Some (A1))) (A, (1 , Some (A2))) (A, (2, Some (A1))) (A, (2, Some (A2)))

11.RightOutJoin (Otherdataset, numpartitions):Right outer connection, contains all data of the right RDD, if the left side does not match with none, Numpartitions set the number of partitions, improve the job parallelism Example 11:

Omit val arr = List ("A", 1), ("B", 2), ("A", 2), ("B", 3)) Val arr1 = List (("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2 "), (" C "," C1 ")) Val Rdd = Sc.parallelize (arr, 3) Val rdd1 = Sc.parallelize (arr1, 3) Val Rightoutjoinrdd = Rdd.rightouterjoin ( RDD1) Rightoutjoinrdd.foreach (println) sc.stop

Output:

(b, (Some (2), B1)) (b, (Some (2), B2)) (b, (Some (3), B1)) (b, (Some (  3) (B2)) (C, (NONE,C1)) (A, (Some (1), A1)) (A, (Some (1), A2)) ( A, (Some (2), A1)) (A, (Some (2), A2))

The above example source address: https://github.com/hadoop-mobin/SparkExample/tree/master/src/main/scala/com/mobin/SparkRDDFun/ Transformation/kvrdd

Spark common functions explained--Key value RDD conversion

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark common functions explained--Key value RDD conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support