Overview
The Spark application consists of the driver program ,which driver program runs the user's main function, performing various operations in parallel within the cluster
Main abstract RDD:
Spark provides an RDD, which is a collection of partition elements across all the nodes in the cluster and can be manipulated in parallel.
RDDS Source:
1. operation of a file in the Hadoop file system or in a Hadoop-enabled file system
2. Scala Collections that already exist in the driver program
3. Convert from another RDD to get
main abstract shared variables share variables:
Shared variables can also be manipulated in parallel
by default, when Spark uses a running function as a series of tasks on different nodes , all the variables used by the shared variable transfer function.
Scope of variable sharing: within Tasks/tasks and throughout driver program
Spark supports two types of shared variables:
Broadcast variable: the memory cache value that is used on all nodes
accumulators ( accumulator ):which is variables that is only ' added ' to, such as
Counters and sums.
Initialize Spark
The first thing the Spark program does is create a sparkcontext object (Tell Spark how to connect to the cluster)
in order to create Sparkcontext, you need to build a sparkconf object (containing application-related information)
Note: Only one JVM can have a single Sparkcontext in aitve, want to create a new must first stop the old
val
conf
=
new
SparkConf
().
setAppName
(
appName
).
setMaster
(
master
)
new < Span style= "Font-family:menlo, ' Lucida Console ', monospace;" > sparkcontext
(
Span style= "color: #333333;" > conf
AppName is used to specify the name of your app, shown in the cluster UI
Master is a Spark,Mesos,YARN cluster URL or a specified "local" string running in local mode
when actually running in a cluster, you do not need to specify these in the program, but instead use spark-submit.
of course, for local and unit tests, you can set the "local" through the program in the Spark run
Parallelization Collections
driver Program sc< Span style= "font-family: ' Droid Sans Fallback ';" > sparkcontext paralleliz< Span style= "font-family: ' Droid Sans Fallback ';" > method creation parallelized set, the elements within the collection are copied to form a distributed dataset (which can be manipulated in parallel). Create a parallelized collection as follows ( including 1 - 5 )
val
data
=
Array
(
1
,
2
,
3
,
4
,
5
)
val
distData
=
sc
.
parallelize
(
data
)
an important parameter to parallelize a collection is to cut the dataset into the number of partitions. Spark will run a task for each partition in the cluster. Typically, each CPU in a cluster will have 2-4 partitions. Under normal circumstances, the number of partitions is automatically set by spark based on the cluster situation. Of course, you can also overload parallelize () manual settings. Example:sc.parallelize (data, 10)
External data sets
text files, sequencefiles, and any other Hadoop inputformat.
Spark Program Guide website translation