Spark RDD transformation with action function consolidation (not finished)

Source: Internet
Author: User
Tags spark rdd

1. Create an RDD

Val lines = sc.parallelize (List ("Pandas", "I like Pandas"))

2. Load the local file to the RDD

Val Linesrdd = Sc.textfile ("Yangsy.txt")

3. Filtering the filter requires that the filter does not filter on the original RDD, but instead re-creates an RDD based on the contents of the filter

Val spark = linesrdd.filter (line = Line.contains ("Damowang"))

4.count () is also a aciton operation because Spark is lazy to load before the statement whether the right or wrong actually did not execute only to invoke action such as Count () first () foreach () and so on when the actual execution

Spark.count ()

5.foreach (println) Output view data (use take to get a small amount of data, if the project is Dataframe, you can call show (1)) Here's a thing called the Collect () function This function will load all the data to the driver side, the general data volume is large, do not call the Collect function () otherwise it will burst Dirver server although in our project, it is true that the Collect () is used to load more than 40 million data onto the dirver-=)

Spark.take (1). foreach (println)

6. Common conversion actions and action actions common conversion actions such as map () and filter ()

For example, calculate the square of each value in the RDD:

Val input = sc.parallelize (List (1,2,3,4= input.map (x = x*x) println (Result.collect (). Mkstring(" ,"))

7.flatMap () is similar to map, but returns an iterator that returns a sequence of values. Get an RDD that contains all the elements accessible by various iterators. Simple uses such as splitting a string into words

Val lines = sc.parallelize (List ("Xiaojingjing Is My Love", "Damowang", "kings_landing"= lines.flatmap (line = > Line.split (","))// Call First () to return value Words.first ()

The transformation operation of the collation summary rdd:

Basic RDD conversion operation for a data set (1,2,3,3) Rdd

Map: Apply the function to each element in the RDD, and the return value constitutes a new Rdd eg:rdd.map (x = x+1) Result: {2,3,4,4)

Flatmap: Applies a function to each element in the RDD, which makes all the contents of the returned iterator a new rdd, often used to split eg:rdd.flatMap (x = X.split (",")). Take (1). foreach (println) Result:1

Flter: Returns an RDD consisting of the elements passed to the filter eg:rdd.filter (x = = x! = 1) Result: {2,3,3}

Distinct: Used to go to heavy eg:rdd.distinct ()

Conversion operation for two rdd for Rdd with data of {3,4,5} and {* *} respectively

Union: Generates an RDD eg:rdd.union (other) result:{1,2,3,3,4,5} that contains all the elements from all two Rdd

Intersection: Find common elements in two elements eg:rdd.intersection (ohter) Result:{3}

Substract () Remove the contents of the Rdd eg:rdd.substract (other) result:{1,2}

Cartesian () with another Rdd Eg:rdd.cartesian (other) result:{(1,3), (1,4), (1,5) .... (3,5)}

All of the above are transformation operations, down action action

9.reduce parallel integration of all data in the RDD

Val lines1 = sc.parallelize (List (1,2,3,3= x + y)

10.reducebykey the simplest is to achieve the number of wordcount statistics, the principle is that the map function to convert the RDD into a two-tuple, and then through the reducebykey of the Yuan zu.

Val Linesrdd = Sc.textfile ("Yangsy.txt")
Val count = linesrdd.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_). Collect ()

The 11.aggregate function is similar to reduce, but returns a different type of function

Val result = Input.aggregate ((0,0)) (acc.value) = (acc._1+value,acc._2+1), (ACC1,ACC2) = (acc1._1 + acc2._1, acc1. _2 + acc2._2))

There are a lot more like count (), take (num) and so on do not practice

The 12.collect function and the foreach function have actually been used, and there's not much to say here.

Summarize the action action of the RDD:

Operation of an RDD with data {1,2,3,3}

Collect: Returns all elements in the Rdd Rdd.collect ()

The number of elements in the Count:rdd

Countbyvalue: Returns the number of occurrences of each element in the RDD: Eg:rdd.countByValue () [(+), (2,1), (3,2) ....]

Take (num): Returns num elements from an RDD

Top (num): Returns the first NUM element from the RDD

Takesample (Withreplacement,num,[seed]): Returns any element from the Rdd eg:rdd.takeSample (false,1)

Reduce (func): Consolidates all data rdd.reduce (x, y) and X + y in the rdd in parallel

foreach (func): Use the given function for each element in the RDD

You can call the Unpersist () function when calling the persist () function to cache the data as if memory wants to be deleted

Spark RDD transformation with action function consolidation (not finished)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.