Spark RDD Scala language programming
The Rdd(Resilient distributed Dataset) is an immutable collection of distributed objects, each of which is divided into partitions that run on different nodes of the cluster. The RDD supports two types of operations: conversion (trainsformation) and action , and Spark only lazily calculates the RDD, which means that the rdd of the conversion operation is not immediately calculated, Instead, it calculates the first time it encounters an action operation, and if you want to reuse an RDD in multiple actions, you can use Rdd.persist () to let spark cache the RDD. 0. Initialize Sparkcontext
Import Org.apache.spark. {sparkconf, sparkcontext}
Val sc = new Sparkcontext (new sparkconf (). Setmaster ("local[*]"). Setappname ("Spark-rdd-demo"))
1. Create an RDD
Spark provides 2 ways to create an RDD: 1.1 Read an external data set
Val lines = sc.parallelize (List ("Java", "Scala", "C + +"))
1.2 parallelization of a collection in a drive program
Val lines = Sc.textfile ("Hdfs://dash-dev:9000/input/test.txt")
2. Rdd Operation
2.1 Conversion Actions
The RDD conversion operation is the return of the new RDD operation, the common conversion operations summarized as follows:
Table 1: Basic conversion operations for an RDD with data {1,2,3,3}
Name of function |
Purpose |
Example |
Results |
Map () |
Apply the function to each element in the RDD, and the return value constitutes a new RDD |
Rdd.map (x=>x+1) |
{2,3,4,5} |
FlatMap () |
Apply the function to each element in the RDD, and make all the contents of the returned iterator a new rdd, commonly used to slice words |
Rdd.flatmap (X=>x.to (2)) |
{1,2,2} |
Filter () |
Returns an RDD consisting of elements passed into a function passed to the filter () |
Rdd.filter (x=> x>2) |
{3,3} |
Distinct () |
Go heavy |
Rdd.distinct () |
{A-i} |
Sample (Withreplacement, fraction, [seed]) |
The RDD sample, and whether to replace |
Rdd.sample (false, 0.5) |
The non-deterministic |
Table 2: Conversion actions for 2 Rdd for data {2,3,4}rdd} and {+ +}
Name of function |
Purpose |
Example |
Results |
Union () |
Find the 2-rdd set |
Rdd.union (Other) |
{1,2,3,4} |
Intersection () |
Find the intersection of 2 Rdd |
Rdd.intersection (Other) |
{2,3} |
Subtract () |
Find the difference set of 2 Rdd |
Rdd.subtract (Other) |
{1} |
Cartesian () |
Ask for the Cartesian product of 2 Rdd |
Rdd.cartesian (Other) |
{1,3}, {*}, {1,4} ... {3,4} |
Sample (Withreplacement, fraction, [seed]) |
The RDD sample, and whether to replace |
Rdd.sample (false, 0.5) |
The non-deterministic |
2.2 Action Operation
The operation of the RDD will return the resulting results to the drive program, or to the external storage system.
Table 3: Action on basic RDD for an RDD with data {1,2,3,3}
Name of function |
Purpose |
Example |
Results |
Redcue () |
Consolidate all elements in the RDD in parallel |
Rdd.reduce ((x, y) = x+y) |
9 |
Collect () |
Return all elements in the RDD |
Rdd.collect () |
{1,2,3,4} |
Count () |
Find the number of elements in the RDD |
Rdd.count () |
4 |
Countbyvalue () |
The number of occurrences of each element in the RDD |
Rdd.countbyvalue () |
{2, 1}, {3,2} |
Take (N) |
Returns n elements from an RDD |
Rdd.take (2) |
{A} |
Top (N) |
Returns the first n elements from the RDD |
Rdd.top (3) |
{3,3,2} |
foreach (func) |
Use the given function for each element in the RDD |
Rdd.foreach (print) |
1,2,3,3 |
2.3 Passing functions to spark
Most of the conversion and action actions of spark depend on the user-passed function to calculate that, when the object being passed is a member of an object, or if it contains a reference to a field in an object (such as Self.field), Spark takes the entire object Sent to the work node--it's much bigger than what you meant to deliver. The alternative is to take the field you need from the object and put it in a local variable, and then pass the local variable:
Class Searchfunctions (Val query:string) {
def isMatch (s:string): Boolean = {
s.contains (query)
}
def Getmatchesfunctionreference (rdd:rdd[string]): rdd[string] = {
//problem: "IsMatch" means "this.ismatch", so the entire
this is passed Rdd.map (IsMatch)
}
def getmatchesfieldreference (rdd:rdd[string]): rdd[string] = {
//question: "Query" means " This.query ", so the entire this
rdd.map (x = x.split (query))
}
def getmatchesnoreference (rdd:rdd[string ]): rdd[string] = {
//safe: Just take the fields we need into the local variables
val localquery = this.query
rdd.map (x = X.split ( localquery))
}
}
Also, be aware that spark requires that our incoming functions and their applied data be serializable (implementing the Java Serializable Interface), otherwise notserializableexception will appear.
Author @wusuopubupt
November 11, 2016