Scala in Spark basic operation not finished

Source: Internet
Author: User

[Introduction to Apache spark Big Data Analysis (i) (http://www.csdn.net/article/2015-11-25/2826324)

Spark Note 5:sparkcontext,sparkconf

Spark reads HBase

Scala's powerful collection data operations example

Some RDD operations and transformations in spark

# Create Textfilerdd
val textfile = Sc.textfile ("readme.md")
Textfile.first ()  #获取textFile The first element of the Rdd
Res3 : String = # Apache Spark

# filters out the RDD that includes the Spark keyword and then makes a row count
val  lineswithspark = textfile.filter (line = Line.contains ("Spark"))
Lineswithspark.count ()
res10:long =

# Find the line
with the most number of words in the RDD textfile Textfile.map (Line=>line.split (""). Size). Reduce ((b) =>math.max (A, b))
res12:int =  # Line 14th is the line that contains the most words

# introduces Java methods in the Scala shell:
import Java.lang.Math
textfile.map (Line=>line.split (""). Size). Reduce ((b) = Math.max (A, b))

#将RDD Lineswithspark Cache, and then Count
Lineswithspark.cache ()
Res13 : Lineswithspark.type = 
Mappartitionsrdd[8] at filter at <console>:23
lineswithspark.count ()
Res15:long = 19

RDD:
Makerdd and Parallelize are the same, but Makerdd seems to be used only in Scala, and parallelize is available for Python and R.

# Create an RDD from a list of words Thingsrdd
val thingsrdd = sc.parallelize (List ("Spoon", "fork", "plate", "Cup", "Bottle"))

# Calculates the number of words in the Rdd thingsrdd
thingsrdd.count ()
Res16:long = 5

Groupbykey () conversion operation

Pairrdd.groupbykey ()
#得到:
Banana [Yellow]
Apple  [Red, Green]
Kiwi   [Green]
FIgs   [Black]
data in the collect or materialize Lineswithspark Rdd

The Collect method returns the calculated value ...

Lineswithspark.collect ()
Cache Rdd Lineswithspark
Lineswithspark.cache ()
to remove Lineswithspark from memory

Lineswithspark,unpersist ()

Part of the RDD conversion operation:

Conversion Operations function
Filter () Filter
Map () Returns the collection object by mapping each data item in an RDD to a new element through a function in map
FlatMap () First map, then merge all the output partitions into one.
Distinct () To redo the elements in the RDD
COALESCE () Repartition the RDD, using the Hashpartitioner
Repartition () Implementation of the second parameter of the COALESCE function true
Sample ()
Union () Combine 2 rdd with no weight
Intersection () Returns the intersection of two Rdd and
Subtract Similar to intersection, returns the element that appears in the RDD and does not appear in the Otherrdd, and does not go heavy.
Mappartitions Similar to map, map by partition
Mappartitionswithindex With Mappartitions, more than 2 parameters are provided
Zip Used to synthesize two RDD groups into the Key/value form of Rdd, where the default two RDD has the same number of partition and the number of elements, otherwise an exception will be thrown.
Zippartitions Combine multiple rdd into a new RDD according to partition, which requires the combined RDD to have the same number of partitions, but no requirement for the number of elements per partition
Partitionby
Mapvalues
Flatmapvalues
Combinebykey
Foldbykey
Groupbykey ()
Reducebykey ()
Reducebykeylocally
Randomsplit () Cut an RDD into multiple rdd according to weights weights

Action Td>countbykey
Actions Description
First
count
reduce
collect
take
top
takeordered
aggregate
fold
lookup
foreach
foreachpartition
sortby
saveastextfile
saveassequencefile
saveasobjectfile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.