"Spark in-depth learning 05" RDD Programming Tour Basics 02-spaek Shell

Source: Internet
Author: User
Tags random seed shuffle spark rdd

---------------------

The content of this section:

· Spark Conversion RDD Operation Example

· Example of the Spark action RDD operation

· Resources

---------------------

Everyone has their own way of learning how to program. For me personally, the best way is to do more hands-on demo, to write more code, to understand the more profound, this section in the form of examples to explain the use of various spark rdd and precautions, this article explains the use of 20 Rdd demo.

One, spark conversion RDD Operation Example

The RDD conversion operation returns an RDD, while an action operation returns a different data type.

1. Example: Textfile/collect/foreach

---------------------

Val Line =sc.textfile ("/tmp/test/core-site.xml");

Line.collect (). foreach (println);

------

Description

Textfile: Reading HDFs data

Collect: Collecting RDD data sets

foreach: Looping through the print out

---------------------

Val Line =sc.parallelize (List (1,2,3,4));

Line.map (X=>X*X);

Line.collect (). mkstring (","). foreach (print);

------

Description

Parallelize: Creating an RDD from an external data set

Map: Receives a function that acts on each element of the RDD, and the input type and return type do not need to be the same.

Mkstring: Add delimiter

---------------------

2. Example: Flatmap/first

---------------------

Val Lines =sc.parallelize (List ("Hello World", "Hi Hi hhe"));

Val Words=lines.flatmap (line = Line.split (""));

Words.collect (). foreach (println);

Words.first ();

------

Description

Flatmap:flatmap the returned iterator and takes the elements from multiple rdd in the iterator out to form an RDD.

First: Collection of data from the RDD data set

---------------------

3. Example: Filter/union

---------------------

Val lines =sc.textfile ("/tmp/test/core-site.xml");

Val Name=lines.filter (Line =>line.contains ("name"));

Val Value=lines.filter (Line =>line.contains ("value"));

Val result=name.union (value);

Result.collect (). foreach (println);

------

Description

Filter: Does not change the contents of the RDD, returns the element that satisfies the filter condition, and forms a new rdd.

Union: Combines the contents of the two rdd into one RDD and operates on two rdd. The element types that require two rdd are the same.

---------------------

4. Example: Distinct/sample/intersection/subtract/cartesian

---------------------

Val Lines =sc.parallelize (List (1,2,3,4,1,2,3,3));

Val result = Lines.distinct ();

Result.collect (). foreach (println);

Val A = sc.parallelize (1. to (1000), 3);

Val result = A.sample (false,0.02,0);

Result.collect (). foreach (println);

Val a=sc.parallelize (List (1,2,3,4));

Val b=sc.parallelize (List);

Val Result=a.intersection (b);

Val Result2=a.subtract (b);

Result.collect (). foreach (println);

Result2.collect (). foreach (println);

Val a=sc.parallelize (List ("A", "B", "C"));

Val b=sc.parallelize (List ("1", "2"));

Val Result=a.cartesian (b);

Result.collect (). foreach (println);

------

Description

Distinct: The element in the RDD to go heavy, will reach the shuffle operation, inefficient

Sample: Sampling The elements in the RDD, the first parameter withreplacement is true to indicate that there is a back-up sampling, and false indicates that there is no drop back. The second parameter represents the scale (the number of elements taken out), and the third parameter is a random seed.

Intersection: For two RDD common elements of the RDD, go back to the heavy, trigger shuffle operation.

Subtract: Removing the contents of the RDD causes the shuffle operation to occur.

Cartesian: For Cartesian product, large-scale data is costly.

---------------------

Second, the Spark action Rdd Operation example

1. Example: reduce/fold/aggregate[action operation]

--------Reduce-------------

Val Line =sc.parallelize (List (1,2,3,4));

Val sum = Line.reduce ((x, y) =>x+y);

println (sum);

--------Fold-------------

Val Line =sc.parallelize (List (1,2,3,4), 2);

Val sum = line.fold (1) ((x, y) =>x+y);

println (sum);

--------Aggregate demo01-------------

Val Line =sc.parallelize (List (1,2,3,4));

Val result= line.aggregate ((0,0)) (

(acc,value) = (acc._1+value,acc._2+1),

(ACC1,ACC2) = (acc1._1+acc2._1,acc1._2+acc2._2)

);

Val avg=result._1/result._2.todouble;

println (avg);

--------Aggregate demo02-------------

def seqop (a:int,b:int): int={

println ("Seqop:" +a+ "\ T" +b)

Math.min (A, B)

}

def comop (a:int,b:int): int={

println ("Comop:" +a+ "\ T" +b)

A+b

}

Val line=sc.parallelize (List (1,2,3,4,5), 1);

Val Result=line.aggregate (2) (SEQOP,COMOP);

println (result);

------

Description

Reduce: Takes a function as an argument, and the function takes two RDD data of the same element type and returns a new element of the same type.

Fold: Combines the contents of two rdd into one RDD and operates on two rdd. The element types that require two rdd are the same. The fold calculation process is like this,

If line has only 1 partition

The first partition calculation

First time: 1+1=2;

Second time: 2+2=4;

Third time: 3+4=7;

Fourth time: 4+7=11;

Combie calculation:

First time: 11+1=12, final result: 12

If line has 2 Partion "Val line =sc.parallelize (List (1,2,3,4), 2);"

The first partition calculation

First time: 1+1=2;

Second time: 2+2=4;

A second partition calculation

Third time: 3+1=4;

Fourth time: 4+4=8;

Combie calculation:

First time: 4+1=5;

Second time: 5+8=13

Final Result: 13

Aggregate: Execution process

Demo1 Execution Process

Step1: (0+1,0+1) = (+)

Step2: (1+2,1+1) = (3,2)

Step3: (3+3,2+1) = (6,3)

Step4: (4+6,3+1) = (10,4)

STEP5: (0+10,0+4) = (10,4)

avg=10/4=2.5

Demo2 Execution Process

Step1:math.min (2,1) =1

Step2:math.min (=1)

Step3:math.min (1,3) =1

Step4:math.min (1,4) =1

Step4:math.min (1,5) =1

Step5:2+1=3

2.count/countbyvalue/take/top/takeordered

---------------------

Val line=sc.parallelize (List (1,2,3,3), 1);

Val Result=line.count ();

println (result);

Val line=sc.parallelize (List (1,2,3,3), 1);

Val Result=line.countbyvalue ();

println (result);

Val line=sc.parallelize (List (1,2,3,3), 1);

Val Result=line.take (3);

Result.foreach (println);

Val line=sc.parallelize (List (1,2,3,3), 1);

Val Result=line.top (2);

Result.foreach (println);

Val line=sc.parallelize (List (1,2,3,3), 1);

Val result=line.takeordered (2);

Result.foreach (println);

Val line=sc.parallelize (List (1,2,3,3), 1);

Val result=line.takesample (false,2);

Result.foreach (println);

------

Description

Count: Returns the number of elements in the RDD.

Countbyvalue: The number of times each element appears in the RDD.

Take: The number of first n elements removed from the RDD, and the collect ratio, is the acquisition of elements from the remote cluster, just all the data obtained by the collect operation, and the take operation is to get the first n elements.

Top: Returns the first n elements.

Takeordered: Returns the first n elements from the RDD in the order provided.

Takesample: Returns any number of elements from the RDD.

---------------------

Iii. references

1.fold calculation Process-http://www.aboutyun.com/home.php?mod=space&uid=1&do=blog&id=368

2.fold calculation Process-http://www.cnblogs.com/mobin/p/5414490.html#12

3.aggregate calculation Process-https://www.iteblog.com/archives/1268.html

"Spark in-depth learning 05" RDD Programming Tour Basics 02-spaek Shell

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.