Spark RDD transformation with action function consolidation (not finished)

Last Update:2015-11-28 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Create an RDD

Val lines = sc.parallelize (List ("Pandas", "I like Pandas"))

2. Load the local file to the RDD

Val Linesrdd = Sc.textfile ("Yangsy.txt")

3. Filtering the filter requires that the filter does not filter on the original RDD, but instead re-creates an RDD based on the contents of the filter

Val spark = linesrdd.filter (line = Line.contains ("Damowang"))

4.count () is also a aciton operation because Spark is lazy to load before the statement whether the right or wrong actually did not execute only to invoke action such as Count () first () foreach () and so on when the actual execution

Spark.count ()

5.foreach (println) Output view data (use take to get a small amount of data, if the project is Dataframe, you can call show (1)) Here's a thing called the Collect () function This function will load all the data to the driver side, the general data volume is large, do not call the Collect function () otherwise it will burst Dirver server although in our project, it is true that the Collect () is used to load more than 40 million data onto the dirver-=)

Spark.take (1). foreach (println)

6. Common conversion actions and action actions common conversion actions such as map () and filter ()

For example, calculate the square of each value in the RDD:

Val input = sc.parallelize (List (1,2,3,4= input.map (x = x*x) println (Result.collect (). Mkstring(" ,"))

7.flatMap () is similar to map, but returns an iterator that returns a sequence of values. Get an RDD that contains all the elements accessible by various iterators. Simple uses such as splitting a string into words

Val lines = sc.parallelize (List ("Xiaojingjing Is My Love", "Damowang", "kings_landing"= lines.flatmap (line = > Line.split (","))// Call First () to return value Words.first ()

The transformation operation of the collation summary rdd:

Basic RDD conversion operation for a data set (1,2,3,3) Rdd

Map: Apply the function to each element in the RDD, and the return value constitutes a new Rdd eg:rdd.map (x = x+1) Result: {2,3,4,4)

Flatmap: Applies a function to each element in the RDD, which makes all the contents of the returned iterator a new rdd, often used to split eg:rdd.flatMap (x = X.split (",")). Take (1). foreach (println) Result:1

Flter: Returns an RDD consisting of the elements passed to the filter eg:rdd.filter (x = = x! = 1) Result: {2,3,3}

Distinct: Used to go to heavy eg:rdd.distinct ()

Conversion operation for two rdd for Rdd with data of {3,4,5} and {* *} respectively

Union: Generates an RDD eg:rdd.union (other) result:{1,2,3,3,4,5} that contains all the elements from all two Rdd

Intersection: Find common elements in two elements eg:rdd.intersection (ohter) Result:{3}

Substract () Remove the contents of the Rdd eg:rdd.substract (other) result:{1,2}

Cartesian () with another Rdd Eg:rdd.cartesian (other) result:{(1,3), (1,4), (1,5) .... (3,5)}

All of the above are transformation operations, down action action

9.reduce parallel integration of all data in the RDD

Val lines1 = sc.parallelize (List (1,2,3,3= x + y)

10.reducebykey the simplest is to achieve the number of wordcount statistics, the principle is that the map function to convert the RDD into a two-tuple, and then through the reducebykey of the Yuan zu.

Val Linesrdd = Sc.textfile ("Yangsy.txt")
Val count = linesrdd.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_). Collect ()

The 11.aggregate function is similar to reduce, but returns a different type of function

Val result = Input.aggregate ((0,0)) (acc.value) = (acc._1+value,acc._2+1), (ACC1,ACC2) = (acc1._1 + acc2._1, acc1. _2 + acc2._2))

There are a lot more like count (), take (num) and so on do not practice

The 12.collect function and the foreach function have actually been used, and there's not much to say here.

Summarize the action action of the RDD:

Operation of an RDD with data {1,2,3,3}

Collect: Returns all elements in the Rdd Rdd.collect ()

The number of elements in the Count:rdd

Countbyvalue: Returns the number of occurrences of each element in the RDD: Eg:rdd.countByValue () [(+), (2,1), (3,2) ....]

Take (num): Returns num elements from an RDD

Top (num): Returns the first NUM element from the RDD

Takesample (Withreplacement,num,[seed]): Returns any element from the Rdd eg:rdd.takeSample (false,1)

Reduce (func): Consolidates all data rdd.reduce (x, y) and X + y in the rdd in parallel

foreach (func): Use the given function for each element in the RDD

You can call the Unpersist () function when calling the persist () function to cache the data as if memory wants to be deleted

Spark RDD transformation with action function consolidation (not finished)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More