1. Create an RDD
Val lines = sc.parallelize (List ("Pandas", "I like Pandas"))
2. Load the local file to the RDD
Val Linesrdd = Sc.textfile ("Yangsy.txt")
3. Filtering the filter requires that the filter does not filter on the original RDD, but instead re-creates an RDD based on the contents of the filter
Val spark = linesrdd.filter (line = Line.contains ("Damowang"))
4.count () is also a aciton operation because Spark is lazy to load before the statement whether the right or wrong actually did not execute only to invoke action such as Count () first () foreach () and so on when the actual execution
Spark.count ()
5.foreach (println) Output view data (use take to get a small amount of data, if the project is Dataframe, you can call show (1)) Here's a thing called the Collect () function This function will load all the data to the driver side, the general data volume is large, do not call the Collect function () otherwise it will burst Dirver server although in our project, it is true that the Collect () is used to load more than 40 million data onto the dirver-=)
Spark.take (1). foreach (println)
6. Common conversion actions and action actions common conversion actions such as map () and filter ()
For example, calculate the square of each value in the RDD:
Val input = sc.parallelize (List (1,2,3,4= input.map (x = x*x) println (Result.collect (). Mkstring(" ,"))
7.flatMap () is similar to map, but returns an iterator that returns a sequence of values. Get an RDD that contains all the elements accessible by various iterators. Simple uses such as splitting a string into words
Val lines = sc.parallelize (List ("Xiaojingjing Is My Love", "Damowang", "kings_landing"= lines.flatmap (line = > Line.split (","))// Call First () to return value Words.first ()
The transformation operation of the collation summary rdd:
Basic RDD conversion operation for a data set (1,2,3,3) Rdd
Map: Apply the function to each element in the RDD, and the return value constitutes a new Rdd eg:rdd.map (x = x+1) Result: {2,3,4,4)
Flatmap: Applies a function to each element in the RDD, which makes all the contents of the returned iterator a new rdd, often used to split eg:rdd.flatMap (x = X.split (",")). Take (1). foreach (println) Result:1
Flter: Returns an RDD consisting of the elements passed to the filter eg:rdd.filter (x = = x! = 1) Result: {2,3,3}
Distinct: Used to go to heavy eg:rdd.distinct ()
Conversion operation for two rdd for Rdd with data of {3,4,5} and {* *} respectively
Union: Generates an RDD eg:rdd.union (other) result:{1,2,3,3,4,5} that contains all the elements from all two Rdd
Intersection: Find common elements in two elements eg:rdd.intersection (ohter) Result:{3}
Substract () Remove the contents of the Rdd eg:rdd.substract (other) result:{1,2}
Cartesian () with another Rdd Eg:rdd.cartesian (other) result:{(1,3), (1,4), (1,5) .... (3,5)}
All of the above are transformation operations, down action action
9.reduce parallel integration of all data in the RDD
Val lines1 = sc.parallelize (List (1,2,3,3= x + y)
10.reducebykey the simplest is to achieve the number of wordcount statistics, the principle is that the map function to convert the RDD into a two-tuple, and then through the reducebykey of the Yuan zu.
Val Linesrdd = Sc.textfile ("Yangsy.txt")
Val count = linesrdd.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_). Collect ()
The 11.aggregate function is similar to reduce, but returns a different type of function
Val result = Input.aggregate ((0,0)) (acc.value) = (acc._1+value,acc._2+1), (ACC1,ACC2) = (acc1._1 + acc2._1, acc1. _2 + acc2._2))
There are a lot more like count (), take (num) and so on do not practice
The 12.collect function and the foreach function have actually been used, and there's not much to say here.
Summarize the action action of the RDD:
Operation of an RDD with data {1,2,3,3}
Collect: Returns all elements in the Rdd Rdd.collect ()
The number of elements in the Count:rdd
Countbyvalue: Returns the number of occurrences of each element in the RDD: Eg:rdd.countByValue () [(+), (2,1), (3,2) ....]
Take (num): Returns num elements from an RDD
Top (num): Returns the first NUM element from the RDD
Takesample (Withreplacement,num,[seed]): Returns any element from the Rdd eg:rdd.takeSample (false,1)
Reduce (func): Consolidates all data rdd.reduce (x, y) and X + y in the rdd in parallel
foreach (func): Use the given function for each element in the RDD
You can call the Unpersist () function when calling the persist () function to cache the data as if memory wants to be deleted
Spark RDD transformation with action function consolidation (not finished)