16th Lesson: Rdd Combat
Due to the non-modifiable nature of RDD, the operation of Rdd is different from normal object-oriented operation, and the operation of RDD is basically divided into 3 categories: Transformation,action,contoller
1. Transformation
Transformation is the creation of a new rdd for an existing RDD by conversion
Map (func): use Func for each element in the RDD dataset that calls map, and then return a new RDD, the returned dataset is a distributed data set
Filter (func): uses Func for each element in the RDD data set that invokes filter, and then returns an RDD that contains elements that make Func true
FlatMap (func): similar to map, but FLATMAP produces multiple results
Mappartitions (func): like map, but map is each element, and mappartitions is each partition
Mappartitionswithsplit (func): Similar to mappartitions, but Func acts on one of the split, so the Func should have index
Sample (Withreplacement,faction,seed): Sampling
Union (Otherdataset): Returns a new DataSet containing a collection of the source dataset and the elements of the given dataset
Distinct ([numtasks]): Returns a new DataSet containing the element of the distinct in the source dataset
Groupbykey (Numtasks): Return (K,seq[v]), which is the key-valuelist that the reduce function in Hadoop accepts
Reducebykey (Func,[numtasks]): It is to use a given Reducefunc re-action in the Groupbykey (K,seq[v]), such as summation, averaging
Sortbykey ([ascending],[numtasks]): Sort by key, ascending or descending, ascending is a Boolean type
Join (Otherdataset,[numtasks]): When there are two kv datasets (K,V) and (K,w), the dataset,numtasks of (K, (V,W)) is returned as the number of concurrent tasks
Cogroup (Otherdataset,[numtasks]): When there are two kv datasets (K,V) and (K,w), the k,seq[v],seq[w that returns (Dataset,numtasks]) is the number of concurrent tasks
Transformation Features:
Lazy optimization: Due to the lazy nature of the tranformation, which is created not to run immediately, for the framework, I have enough time to see as many steps as possible, the more steps I see, the greater the space to optimize. The simplest way to optimize is to merge the steps, for example, the original practice is to a=b*3;b=c*3;c=d*3;d=3, the step is a=3*3*3*3 after the merger.
2. Action
The purpose of the action action is to get a value, or a result
Reduce (func): To put it bluntly is to gather, but the incoming function is two parameter inputs to return a value, which must satisfy the Exchange law and the binding law
Collect (): Usually in filter or small enough results, and then return an array in the Collect package
COUNT (): Returns the number of element in the dataset
First (): Returns a single element in the dataset
Take (N): Returns the first n elements, the Driverprogram returned by the
Takesample (withreplacement,num,seed): Sampling returns NUM elements in a dataset, random seed seeds
Saveastextfile (Path): Writes a DataSet to a textfile, or HDFS, or HDFS-supported file system, Spark converts each record to a single row of records and writes it to file
Saveassequencefile (PATH): Can only be used on key-value, then generate Sequencefile write to local or Hadoop file system
Countbykey (): A map that returns the number of keys corresponding to an RDD
foreach (func): use Func for each element in the dataset
3. Contoller
Contoller actions are mainly for persistent rdd, such as cache (), persist (), Checkpoint ();
The specific content will be explained in the follow-up publication.
4. Spark WordCount hands-on practice
This section uses idea to gradually debug a wordcount case, let the trainees know the specific type of RDD in each step, and pave the steps for the next step of analysis
(1) The WordCount code used is as follows:
- Object WordCount {
- def main (args:array[string]) {
- Val conf = new sparkconf ()//create sparkconf
- Conf.setappname ("Wow,my first Spark app")//set app Name
- Conf.setmaster ("local")//run Local
- Val SC =new sparkcontext (conf)
- Val lines =sc.textfile ("C://users//feng//ideaprojects//wordcount//src//sparktext.txt")
- Val words = lines.flatmap{lines = Lines.split ("")}
- Val Pairs =words.map (Word = (word,1))
- val reduce = Pairs.reducebykey (_+_)
- Val sort_1 = Reduce.map (pair=> (pair._2,pair._1))
- Val sort_2 = Sort_1.sortbykey (False)
- Val Sort_3=sort_2.map (pair=> (pair._2,pair._1))
- Val Filter=sort_3.filter (PAIR=>PAIR._2>2)
- Filter.collect.foreach (Wordnumberpair = println (wordnumberpair._1+ ":" +wordnumberpair._2))
- Sc.stop ()
- }
- }
(1) The contents of the SparkText.txt file used by the program are as follows
Hadoop Hadoop Hadoop
Spark Flink Spark
Scala Scala Object
Object Spark Scala
Spark Spark
Hadoop Hadoop
(2) Program WordCount Debug Results:
With idea's step-through debugging, each line of code is displayed in the Debug window with what type of RDD, and this RDD relies on important information such as the parent Rdd (2-14), as shown in the program run result 2-15.
Figure 2-14 Debug Process diagram
Figure 2-15wordcount Results
2.8.2 parsing the internal mechanism of RDD generation
This section is based on the debugging results of the previous section of the program, review the debug information content, and based on the information content to explain, and in the explanation review and review all the contents of this chapter.
(1) line = Sc.textfile ()
The purpose of this statement is to read data from external data and generate Mappartitionsrdd. Here's what to note:
2-16, you can see the Mappartitionsrdd deps (dependency, dependence) for Hadooprdd, from here can be found in fact textfile () process contains two steps, the first step to convert the contents of the file into Hadooprdd ( Key-value form, key is the line number), the second step converts hadooprdd to Mappartitionsrdd (value form, delete the key of the Key-value type)
Figure 2-16 Getting data from Hadooprdd
(2) Words=line.flatmap ()
This command takes a transformation (conversion) action for the RDD, which is to divide each record in the Mappartitionsrdd with a space-marked shard and place the result of each Rdd shard in a Mappartitionrdd
(3) Pairs=words.map (word=> (word,1))
This command takes a transformation (conversion) action for the RDD, which is to convert each record in Mappartitionsrdd (example: Spark (value type) to a Key-value type (example: (spark,1)), Easy Next Reducebykey operation
(4) Reduce = Pairs.reducebykey (_+_)
This command takes an action action for the RDD by shuffle all the records in the pairs to be processed by the rules that add the same value as key, and puts the result in a shufflerdd. Example ((spark,1), (spark,1)) becomes ((spark,2)).
Also need to pay attention to two points: first of all, this step is essentially divided into two steps, the first step is the local level of reduce, the current computer owned data first reduce operation, generate Mappartitionsrdd; the second step is the shuffle level of reduce , based on the results of the first step, the results are shuffle-reduce and the final Shufflerdd is generated. When the action operation is performed, all the conversion operations before this operation are executed, so the execution time of the previous except Textfile operation is very short during debugging, which indicates that the RDD conversion operation is not directly performed.
(5) sort_1 = Reduce.map (pair=> (pair._2,pair._1))
This command takes a transformation (conversion) action for the RDD, which is to swap the key and value of each record in Shufflerdd to generate a new mappartitionsrdd. Example: (spark,2) becomes (2,spark)
(6) Sort_2 = Sort_1.sortbykey (False)
This command takes an action (action) action on the Rdd to sort the mappartitionsrdd according to key and generates SHUFFLERDD
(7) Sort_3=sort_2.map (pair=> (pair._2,pair._1))
This command takes a transformation (conversion) action for the RDD, which is to swap the key and value of each record in Shufflerdd to generate a new mappartitionsrdd. Example: (2,spark) becomes (spark,2)
(8) Filter=sort_3.filter (PAIR=>PAIR._2>2)
This command takes a transformation (transform) operation on the RDD to filter the data in the Mappartitionsrdd based on value values, outputting records with value greater than 2
(9) Finally, after collecting the result through the Collect () method, use the foreach () method to traverse the data and print out all the data through the println () method.
Note: This content prototype comes from the IMP course note
If there is any technical questions, welcome to add my QQ Exchange: 1106373297
16.RDD Combat