Scala in Spark basic operation not finished

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Introduction to Apache spark Big Data Analysis (i) (http://www.csdn.net/article/2015-11-25/2826324)

Spark Note 5:sparkcontext,sparkconf

Spark reads HBase

Scala's powerful collection data operations example

Some RDD operations and transformations in spark

# Create Textfilerdd
val textfile = Sc.textfile ("readme.md")
Textfile.first ()  #获取textFile The first element of the Rdd
Res3 : String = # Apache Spark

# filters out the RDD that includes the Spark keyword and then makes a row count
val  lineswithspark = textfile.filter (line = Line.contains ("Spark"))
Lineswithspark.count ()
res10:long =

# Find the line
with the most number of words in the RDD textfile Textfile.map (Line=>line.split (""). Size). Reduce ((b) =>math.max (A, b))
res12:int =  # Line 14th is the line that contains the most words

# introduces Java methods in the Scala shell:
import Java.lang.Math
textfile.map (Line=>line.split (""). Size). Reduce ((b) = Math.max (A, b))

#将RDD Lineswithspark Cache, and then Count
Lineswithspark.cache ()
Res13 : Lineswithspark.type = 
Mappartitionsrdd[8] at filter at <console>:23
lineswithspark.count ()
Res15:long = 19

RDD:
Makerdd and Parallelize are the same, but Makerdd seems to be used only in Scala, and parallelize is available for Python and R.

# Create an RDD from a list of words Thingsrdd
val thingsrdd = sc.parallelize (List ("Spoon", "fork", "plate", "Cup", "Bottle"))

# Calculates the number of words in the Rdd thingsrdd
thingsrdd.count ()
Res16:long = 5

Groupbykey () conversion operation

Pairrdd.groupbykey ()
#得到:
Banana [Yellow]
Apple  [Red, Green]
Kiwi   [Green]
FIgs   [Black]

data in the collect or materialize Lineswithspark Rdd

The Collect method returns the calculated value ...

Lineswithspark.collect ()

Cache Rdd Lineswithspark

Lineswithspark.cache ()

to remove Lineswithspark from memory

Lineswithspark,unpersist ()

Part of the RDD conversion operation:

Conversion Operations	function
Filter ()	Filter
Map ()	Returns the collection object by mapping each data item in an RDD to a new element through a function in map
FlatMap ()	First map, then merge all the output partitions into one.
Distinct ()	To redo the elements in the RDD
COALESCE ()	Repartition the RDD, using the Hashpartitioner
Repartition ()	Implementation of the second parameter of the COALESCE function true
Sample ()
Union ()	Combine 2 rdd with no weight
Intersection ()	Returns the intersection of two Rdd and
Subtract	Similar to intersection, returns the element that appears in the RDD and does not appear in the Otherrdd, and does not go heavy.
Mappartitions	Similar to map, map by partition
Mappartitionswithindex	With Mappartitions, more than 2 parameters are provided
Zip	Used to synthesize two RDD groups into the Key/value form of Rdd, where the default two RDD has the same number of partition and the number of elements, otherwise an exception will be thrown.
Zippartitions	Combine multiple rdd into a new RDD according to partition, which requires the combined RDD to have the same number of partitions, but no requirement for the number of elements per partition
Partitionby
Mapvalues
Flatmapvalues
Combinebykey
Foldbykey
Groupbykey ()
Reducebykey ()
Reducebykeylocally
Randomsplit ()	Cut an RDD into multiple rdd according to weights weights

Action Td>countbykey

Actions	Description
First
count
reduce
collect
take
top
takeordered
aggregate
fold
lookup

foreach
foreachpartition
sortby
saveastextfile
saveassequencefile
saveasobjectfile

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scala in Spark basic operation not finished

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scala in Spark basic operation not finished

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support