[Introduction to Apache spark Big Data Analysis (i) (http://www.csdn.net/article/2015-11-25/2826324)
Spark Note 5:sparkcontext,sparkconf
Spark reads HBase
Scala's powerful collection data operations example
Some RDD operations and transformations in spark
# Create Textfilerdd
val textfile = Sc.textfile ("readme.md")
Textfile.first () #获取textFile The first element of the Rdd
Res3 : String = # Apache Spark
# filters out the RDD that includes the Spark keyword and then makes a row count
val lineswithspark = textfile.filter (line = Line.contains ("Spark"))
Lineswithspark.count ()
res10:long =
# Find the line
with the most number of words in the RDD textfile Textfile.map (Line=>line.split (""). Size). Reduce ((b) =>math.max (A, b))
res12:int = # Line 14th is the line that contains the most words
# introduces Java methods in the Scala shell:
import Java.lang.Math
textfile.map (Line=>line.split (""). Size). Reduce ((b) = Math.max (A, b))
#将RDD Lineswithspark Cache, and then Count
Lineswithspark.cache ()
Res13 : Lineswithspark.type =
Mappartitionsrdd[8] at filter at <console>:23
lineswithspark.count ()
Res15:long = 19
RDD:
Makerdd and Parallelize are the same, but Makerdd seems to be used only in Scala, and parallelize is available for Python and R.
# Create an RDD from a list of words Thingsrdd
val thingsrdd = sc.parallelize (List ("Spoon", "fork", "plate", "Cup", "Bottle"))
# Calculates the number of words in the Rdd thingsrdd
thingsrdd.count ()
Res16:long = 5
Groupbykey () conversion operation
Pairrdd.groupbykey ()
#得到:
Banana [Yellow]
Apple [Red, Green]
Kiwi [Green]
FIgs [Black]
data in the collect or materialize Lineswithspark Rdd
The Collect method returns the calculated value ...
Lineswithspark.collect ()
Cache Rdd Lineswithspark
Lineswithspark.cache ()
to remove Lineswithspark from memory
Lineswithspark,unpersist ()
Part of the RDD conversion operation:
Conversion Operations |
function |
Filter () |
Filter |
Map () |
Returns the collection object by mapping each data item in an RDD to a new element through a function in map |
FlatMap () |
First map, then merge all the output partitions into one. |
Distinct () |
To redo the elements in the RDD |
COALESCE () |
Repartition the RDD, using the Hashpartitioner |
Repartition () |
Implementation of the second parameter of the COALESCE function true |
Sample () |
|
Union () |
Combine 2 rdd with no weight |
Intersection () |
Returns the intersection of two Rdd and |
Subtract |
Similar to intersection, returns the element that appears in the RDD and does not appear in the Otherrdd, and does not go heavy. |
Mappartitions |
Similar to map, map by partition |
Mappartitionswithindex |
With Mappartitions, more than 2 parameters are provided |
Zip |
Used to synthesize two RDD groups into the Key/value form of Rdd, where the default two RDD has the same number of partition and the number of elements, otherwise an exception will be thrown. |
Zippartitions |
Combine multiple rdd into a new RDD according to partition, which requires the combined RDD to have the same number of partitions, but no requirement for the number of elements per partition |
Partitionby |
|
Mapvalues |
|
Flatmapvalues |
|
Combinebykey |
|
Foldbykey |
|
Groupbykey () |
|
Reducebykey () |
|
Reducebykeylocally |
|
Randomsplit () |
Cut an RDD into multiple rdd according to weights weights |
Action
Actions |
Description |
First |
|
count |
|
reduce |
|
collect |
|
take |
|
top |
|
takeordered |
|
aggregate |
|
fold |
|
lookup |
|
Td>countbykey
|
foreach |
|
foreachpartition |
|
sortby |
|
saveastextfile |
|
saveassequencefile |
|
saveasobjectfile |
|