Summary:
RDD: Elastic distributed DataSet, is a special set of ' support multiple sources ' have fault tolerant mechanism ' can be cached ' support parallel operation, an RDD represents a dataset in a partition
There are two operators of Rdd: Transformation (conversion):transformation is a deferred calculation, when an RDD is converted to another RDD without immediate conversion, just remember the logical operation of the dataset
Ation (execution): triggers the operation of the spark job, which actually triggers the calculation of the conversion operator
This series focuses on the function operations commonly used in spark:
1.RDD Basic Conversion
2. Key-value RDD conversion
3.Action Operating Chapter the function of this hair 1.reduce
2.collect 3.count 4.first 5.take 6.top 7.takeOrdered 8.Countbykey 9.collectAsMap 10.lookup 11.aggregate 12.fold 13.saveAsFile 14.saveAsSequenceFile1.reduce (func): Through function func before aggregating data from partitions, Func receives two parameters, returns a new value, and the new value continues to be passed as argument to function func until the last element 2.collect (): Returns all elements in the dataset as data to the driver program, in order to prevent driver program memory overflow, generally to control the size of the returned dataset
3.count (): Returns the number of data set elements
4.first (): Returns the first element of a dataset 5.take (N): Returns the first n elements on a dataset as an array 6.top (N): Returns the first n elements by default or by a specified collation, which is output by default in descending order 7.takeOrdered (n,[ordering]): Returns the first n elements by natural order or by a specified collation Example 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def main (args:array[string]) { val conf = new sparkconf (). Setmaster ("local"). Setappname ("Red UCE ") val sc = new Sparkcontext (conf) val rdd = sc.parallelize (1 to 10 , 2) val Reducerdd = Rdd.reduce (_ + _) val reduceRDD1 = Rdd.reduce (_- _)//If partition data is 1 results for -53 val countrdd = Rdd.count () val Firstrdd = rdd.firs T () val takerdd = Rdd.take (5) //Output previous element val Toprdd = Rdd.top (3) //from high to the bottom output first three elements val Takeorderedrdd = rdd.takeordered (3) //in natural order from bottom to high output first three elements println ("Func +:" +reducerdd)   &N bsp; println ("func-:" +reducerdd1) println ("Count:" +countrdd) &nbs P println ("First:" +FIRSTRDD) println ("Take:") takerdd.foreach (x = print (x + ""))  &NB sp; println ("\ntop:") toprdd.foreach (x = print (x + ""))    &NB Sp println ("\ntakeordered:") takeorderedrdd.foreach (x = print (x + ""))    &N Bsp Sc.stop } |
Output:Func +: The
func-: 15//If the partition data is 1 The result is -53
count:10
first:1 take
:
1 2 3 4 5
Top:
9 8
Tak Eordered:
1 2 3
(Rdd dependency graph: The red block represents an rdd area, the black block represents the partition collection, the same as the same) (Rdd dependency graph)
8.countByKey ():
on an RDD of the k-v type, count the number of each key, return (number of k,k) 9.collectAsMap (): on the k-v type of RDD, The function differs from collect in that the COLLECTASMAP function does not contain duplicate keys, for duplicate keys. Subsequent elements overwrite the preceding element
10.lookup (k)
: On an rdd that acts on the K-v type, returns all of the V-value examples of the specified K 2:
1 2 3 4 5 6 7 8 9 Ten page UP |
def main (args:array[string]) { val conf = new sparkconf (). Setmaster ("local"). Setappname ("Kvfunc" ) val sc = new Sparkcontext (conf) val arr = List (("A", 1), ("B", 2), ("A", 2 ), ("B", 3)) val rdd = Sc.parallelize (arr, 2) val Countbykeyrdd = Rdd.countbyke Y () val collectasmaprdd = Rdd.collectasmap () println ("Countbykey:") Countbykeyrdd.foreach (print) println ("\ncollectasmap:") Collectasmaprdd.foreach (print) sc.stop } |
output: Countbykey: (b,2) (a,2) Collectasmap: (a,2) (b,3)
(Rdd dependency graph)
11.aggregate (Zerovalue:u) (Seqop: (u,t) = U,comop (u,u) + U):
The
seqop function aggregates the data for each partition into a value of type U, and the COMOP function aggregates the U-type data of each partition to get a value of type U
1 2 3 4 5 6 7 8 |
def main (Args:array[string]) {val conf = new sparkconf (). Setmaster ("local"). Setappname ("Fold") val sc = NE W sparkcontext (conf) val Rdd = Sc.parallelize (List (1, 2, 3, 4), 2) Val Aggregaterdd = Rdd.aggregate (2) (_ +_,_ * _) println (Aggregaterdd) Sc.stop} |
Output:
90
Step 1: Partition 1:zerovalue+1+2=5 partition 2:zerovalue+3+4=9 step 2:zerovalue* result of partition 1 * result of partition 2 =90 (Rdd dependency graph)
12.fold (Zerovalue:t) (OP: (t,t) = T):
The
uses the OP function to aggregate the elements in each partition and merge the elements of each partition, and the OP function requires two parameters, at the beginning the first passed parameter is zerovalue,t as the data type of the RDD dataset, The function is equivalent to the SEQOP and COMOP functions are the same aggregate function Example 3