Spark Program Guide website translation

Source: Internet
Author: User

Overview

The Spark application consists of the driver program ,which driver program runs the user's main function, performing various operations in parallel within the cluster

Main abstract RDD:

  Spark provides an RDD, which is a collection of partition elements across all the nodes in the cluster and can be manipulated in parallel.

RDDS Source:

1. operation of a file in the Hadoop file system or in a Hadoop-enabled file system

2. Scala Collections that already exist in the driver program

3. Convert from another RDD to get

main abstract shared variables share variables:

Shared variables can also be manipulated in parallel

by default, when Spark uses a running function as a series of tasks on different nodes , all the variables used by the shared variable transfer function.

Scope of variable sharing: within Tasks/tasks and throughout driver program

  Spark supports two types of shared variables:

     Broadcast variable: the memory cache value that is used on all nodes

     accumulators ( accumulator ):which is variables that is only ' added ' to, such as    

                Counters and sums.

Initialize Spark

The first thing the Spark program does is create a sparkcontext object (Tell Spark how to connect to the cluster)

in order to create Sparkcontext, you need to build a sparkconf object (containing application-related information)

Note: Only one JVM can have a single Sparkcontext in aitve, want to create a new must first stop the old

   

       val conf =newSparkConf().setAppName(appName).setMaster(master )

new < Span style= "Font-family:menlo, ' Lucida Console ', monospace;" > sparkcontext ( Span style= "color: #333333;" > conf

AppName is used to specify the name of your app, shown in the cluster UI

Master is a Spark,Mesos,YARN cluster URL or a specified "local" string running in local mode


when actually running in a cluster, you do not need to specify these in the program, but instead use spark-submit.

of course, for local and unit tests, you can set the "local" through the program in the Spark run


Parallelization Collections

driver Program sc< Span style= "font-family: ' Droid Sans Fallback ';" > sparkcontext paralleliz< Span style= "font-family: ' Droid Sans Fallback ';" > method creation parallelized set, the elements within the collection are copied to form a distributed dataset (which can be manipulated in parallel). Create a parallelized collection as follows ( including 1 - 5 )

             valdata =Array(1,2,3,4,5)
             valdistData =sc.parallelize(data)

an important parameter to parallelize a collection is to cut the dataset into the number of partitions. Spark will run a task for each partition in the cluster. Typically, each CPU in a cluster will have 2-4 partitions. Under normal circumstances, the number of partitions is automatically set by spark based on the cluster situation. Of course, you can also overload parallelize () manual settings. Example:sc.parallelize (data, 10)

External data sets

text files, sequencefiles, and any other Hadoop inputformat.

Spark Program Guide website translation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.