16.RDD Combat

Last Update:2016-04-23 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

16th Lesson: Rdd Combat

Due to the non-modifiable nature of RDD, the operation of Rdd is different from normal object-oriented operation, and the operation of RDD is basically divided into 3 categories: Transformation,action,contoller

1. Transformation

Transformation is the creation of a new rdd for an existing RDD by conversion

Map (func): use Func for each element in the RDD dataset that calls map, and then return a new RDD, the returned dataset is a distributed data set

Filter (func): uses Func for each element in the RDD data set that invokes filter, and then returns an RDD that contains elements that make Func true

FlatMap (func): similar to map, but FLATMAP produces multiple results

Mappartitions (func): like map, but map is each element, and mappartitions is each partition

Mappartitionswithsplit (func): Similar to mappartitions, but Func acts on one of the split, so the Func should have index

Sample (Withreplacement,faction,seed): Sampling

Union (Otherdataset): Returns a new DataSet containing a collection of the source dataset and the elements of the given dataset

Distinct ([numtasks]): Returns a new DataSet containing the element of the distinct in the source dataset

Groupbykey (Numtasks): Return (K,seq[v]), which is the key-valuelist that the reduce function in Hadoop accepts

Reducebykey (Func,[numtasks]): It is to use a given Reducefunc re-action in the Groupbykey (K,seq[v]), such as summation, averaging

Sortbykey ([ascending],[numtasks]): Sort by key, ascending or descending, ascending is a Boolean type

Join (Otherdataset,[numtasks]): When there are two kv datasets (K,V) and (K,w), the dataset,numtasks of (K, (V,W)) is returned as the number of concurrent tasks

Cogroup (Otherdataset,[numtasks]): When there are two kv datasets (K,V) and (K,w), the k,seq[v],seq[w that returns (Dataset,numtasks]) is the number of concurrent tasks

Transformation Features:

Lazy optimization: Due to the lazy nature of the tranformation, which is created not to run immediately, for the framework, I have enough time to see as many steps as possible, the more steps I see, the greater the space to optimize. The simplest way to optimize is to merge the steps, for example, the original practice is to a=b*3;b=c*3;c=d*3;d=3, the step is a=3*3*3*3 after the merger.

2. Action

The purpose of the action action is to get a value, or a result

Reduce (func): To put it bluntly is to gather, but the incoming function is two parameter inputs to return a value, which must satisfy the Exchange law and the binding law

Collect (): Usually in filter or small enough results, and then return an array in the Collect package

COUNT (): Returns the number of element in the dataset

First (): Returns a single element in the dataset

Take (N): Returns the first n elements, the Driverprogram returned by the

Takesample (withreplacement,num,seed): Sampling returns NUM elements in a dataset, random seed seeds

Saveastextfile (Path): Writes a DataSet to a textfile, or HDFS, or HDFS-supported file system, Spark converts each record to a single row of records and writes it to file

Saveassequencefile (PATH): Can only be used on key-value, then generate Sequencefile write to local or Hadoop file system

Countbykey (): A map that returns the number of keys corresponding to an RDD

foreach (func): use Func for each element in the dataset

3. Contoller

Contoller actions are mainly for persistent rdd, such as cache (), persist (), Checkpoint ();

The specific content will be explained in the follow-up publication.

4. Spark WordCount hands-on practice

This section uses idea to gradually debug a wordcount case, let the trainees know the specific type of RDD in each step, and pave the steps for the next step of analysis

(1) The WordCount code used is as follows:

Object WordCount {
def main (args:array[string]) {
Val conf = new sparkconf ()//create sparkconf
Conf.setappname ("Wow,my first Spark app")//set app Name
Conf.setmaster ("local")//run Local
Val SC =new sparkcontext (conf)
Val lines =sc.textfile ("C://users//feng//ideaprojects//wordcount//src//sparktext.txt")
Val words = lines.flatmap{lines = Lines.split ("")}
Val Pairs =words.map (Word = (word,1))
val reduce = Pairs.reducebykey (_+_)
Val sort_1 = Reduce.map (pair=> (pair._2,pair._1))
Val sort_2 = Sort_1.sortbykey (False)
Val Sort_3=sort_2.map (pair=> (pair._2,pair._1))
Val Filter=sort_3.filter (PAIR=>PAIR._2>2)
Filter.collect.foreach (Wordnumberpair = println (wordnumberpair._1+ ":" +wordnumberpair._2))
Sc.stop ()
}
}

(1) The contents of the SparkText.txt file used by the program are as follows

Hadoop Hadoop Hadoop

Spark Flink Spark

Scala Scala Object

Object Spark Scala

Spark Spark

Hadoop Hadoop

(2) Program WordCount Debug Results:

With idea's step-through debugging, each line of code is displayed in the Debug window with what type of RDD, and this RDD relies on important information such as the parent Rdd (2-14), as shown in the program run result 2-15.

Figure 2-14 Debug Process diagram

Figure 2-15wordcount Results

2.8.2 parsing the internal mechanism of RDD generation

This section is based on the debugging results of the previous section of the program, review the debug information content, and based on the information content to explain, and in the explanation review and review all the contents of this chapter.

(1) line = Sc.textfile ()

The purpose of this statement is to read data from external data and generate Mappartitionsrdd. Here's what to note:

2-16, you can see the Mappartitionsrdd deps (dependency, dependence) for Hadooprdd, from here can be found in fact textfile () process contains two steps, the first step to convert the contents of the file into Hadooprdd ( Key-value form, key is the line number), the second step converts hadooprdd to Mappartitionsrdd (value form, delete the key of the Key-value type)

Figure 2-16 Getting data from Hadooprdd

(2) Words=line.flatmap ()

This command takes a transformation (conversion) action for the RDD, which is to divide each record in the Mappartitionsrdd with a space-marked shard and place the result of each Rdd shard in a Mappartitionrdd

(3) Pairs=words.map (word=> (word,1))

This command takes a transformation (conversion) action for the RDD, which is to convert each record in Mappartitionsrdd (example: Spark (value type) to a Key-value type (example: (spark,1)), Easy Next Reducebykey operation

(4) Reduce = Pairs.reducebykey (_+_)

This command takes an action action for the RDD by shuffle all the records in the pairs to be processed by the rules that add the same value as key, and puts the result in a shufflerdd. Example ((spark,1), (spark,1)) becomes ((spark,2)).

Also need to pay attention to two points: first of all, this step is essentially divided into two steps, the first step is the local level of reduce, the current computer owned data first reduce operation, generate Mappartitionsrdd; the second step is the shuffle level of reduce , based on the results of the first step, the results are shuffle-reduce and the final Shufflerdd is generated. When the action operation is performed, all the conversion operations before this operation are executed, so the execution time of the previous except Textfile operation is very short during debugging, which indicates that the RDD conversion operation is not directly performed.

(5) sort_1 = Reduce.map (pair=> (pair._2,pair._1))

This command takes a transformation (conversion) action for the RDD, which is to swap the key and value of each record in Shufflerdd to generate a new mappartitionsrdd. Example: (spark,2) becomes (2,spark)

(6) Sort_2 = Sort_1.sortbykey (False)

This command takes an action (action) action on the Rdd to sort the mappartitionsrdd according to key and generates SHUFFLERDD

(7) Sort_3=sort_2.map (pair=> (pair._2,pair._1))

(8) Filter=sort_3.filter (PAIR=>PAIR._2>2)

This command takes a transformation (transform) operation on the RDD to filter the data in the Mappartitionsrdd based on value values, outputting records with value greater than 2

(9) Finally, after collecting the result through the Collect () method, use the foreach () method to traverse the data and print out all the data through the println () method.

Note: This content prototype comes from the IMP course note

If there is any technical questions, welcome to add my QQ Exchange: 1106373297

16.RDD Combat

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More