Official Spark documentation-Programming Guide

Source: Internet
Author: User
Tags random seed

This article from the official blog, slightly added: https://github.com/mesos/spark/wiki/Spark-Programming-Guide

Spark sending Guide

 

From a higher perspective, in fact, every Spark application is a Driver class that allows you to run user-defined main functions and perform various concurrent operations and calculations on the cluster.

The most important abstraction provided by Spark is a personalized distributed data set (RDD), which is a special set that can be distributed on the cluster points and operated in the form of a set of functions, various concurrent operations. It can be created from a file on hdfs, or from an existing collection in the Driver program. The user can store data sets, which are effectively reused for concurrent operations. Finally, the distributed data set can be automatically restored from the point loss and recalculated again.

The second abstraction of Spark is the shared variable used in parallel computing. By default, Spark runs a letter concurrently. It runs on multiple tasks at different points. It transfers a copy of each variable to the letter used by each task, therefore, some variables are not shared. However, I need to share the variables that can be shared in the task or between the task and the dynamic program. Spark supports two types of shared variables:

Broadcast variable: Can be stored in all the points, used to store variables (only)

Accumulators: Only variables used for addition, such as sum

Some examples of Nantong show some features. It is better to be familiar with Scala, especially the packet method. Note that Spark can run interactively through the Spark-Shell interpreter line. You may need it.

Spark access

To use Spark, you need to add Spark and its dependency to CLASSPATH. The best way is to run sbt/sbt assemblySpark and its dependency, hit a Jar inside core/target/scala_2.9.1/spark-core-assembly-0.0.0.jar, and then add it to your CLASSPATH. Or you can release spark to the maven local storage and use sbt/sbt publish. It is a spark-core under org. spark-project.

In addition, you will need to add some Spark classes and styles. The following lines will be added to your program

Import spark. SparkContext

Import SparkContext ._

Initialize Spark

The first thing the Spark program needs to do is to create a SparkContext image that tells Spark how to build a cluster. Usually through the following constructor:

New SparkContext (master, jobName, [sparkHome], [jars])

The Master parameter is a string that specifies the Mesos cluster to be connected to, or uses the special string "local" to indicate that the cluster runs in local mode. Generally, JobName is the name of your task. When running on the cluster, it is displayed on the Mesos Web UI control interface. The following two parameters are used in your generation and deployed to the mesos cluster for running. They will be mentioned later.

In the Spark interpreter, you have created a special SparkContext variable named SC. Creating your own SparkContext will not take effect. You can set the MASTER environment variable to connect the master to the desired context.

MASTER = local;./spark-shell

Master Name

The Master name can be one of the following three formats:

Master Name

Meaning

Local

Run Spark locally and use a Worker Program (with parallel execution)

 

Local [K]

Run Spark locally and use K Worker processes (based on the CPU cores of the machine)

 

HOST: PORT

Spark is connected to the specified Mesos Master and runs on the cluster. The Host parameter is the Hostname of the Mesos Master, and the port is the port configured by the master.

Note: In earlier versions of Mesos (the old-mesos branch of spark), you must use master @ HOST: PORT.

Cluster deploymentIf you want your task to run on a cluster, You need to specify two parameters:
  • SparkHome: Spark installation path on the Cluster machine (all must be consistent)
  • Jars: On a local machine, it contains the list of your task generation and dependent Jars files. Spark will deploy it on all cluster points. You need to package your jobs into a set of jars files using your own system. For example, if you use sbt, The sbt-assembly plug-in is a good method. Your generation and dependency are converted into a jar file.

If some classes are public and need to be shared among different jobs, you may need to manually copy them to the mesos point, in conf/spark-env, you can set the SPARK_CLASSPATH environment variable to point to it. For more information, seeConfiguration

Distributed Data Set

The core concept of Spark is a distributed data set (RDD). It is a set of compatible mechanisms that can be operated in parallel. There are currently two types of RDD: Parrallelized Collections, receiving an existing Scala set and running various concurrent computations on it; Hadoop DataSets ), run various letters on each file. As long as the file system is Hdfs or any storage system supported by hadoop. Both RDDs can be operated in the same way.

Parallel set

A parallel set is the parallelize method of SparkContext. It is built on an existing Scala set (as long as it is a seq image. Objects in a set are copied to create a distributed data set, which can be operated in parallel. The following example shows how to create a concurrent set from a spark interpreter.

Scala> val data = Array (1, 2, 3, 4, 5)

Data: Array [Int] = Array (1, 2, 3, 4, 5)

Scala> val distData = SC. parallelize (data)

DistData: spark. RDD [Int] = spark. ParallelCollection @ 10d13e3e

Once created, the distributed data set (distData) can be operated in parallel. For example, I can add distData. reduce (_ + _) elements. I will describe the distributed data set later.

An important parameter used to create a parallel set is the slices object, which specifies how many parts the data set is split. In cluster mode, Spark starts a Task on a slice. Typically, you can create 2-4 Slice instances on each cpu in the cluster (that is, 2-4 tasks are allocated to each cpu ). Generally, Spark automatically sets the slices category based on the cluster status. However, you can also set it manually by using the second parameter of the parallelize method (for example, SC. parallelize (data, 10 )).

Hadoop Data Set

Spark can be used to create distributed data sets from any file system that has HDFS or other file systems supported by Hadoop (including local files, Amazon S3, Hypertable, HBase, and so on. Spark supports Text File, SequenceFiles, and any other Hadoop input formats.

The RDDs of text files can be created through the textFile method of SparkContext. The method accepts the URI address of the file (or the local path of the file on the machine, or an hdfs: //, sdn ://, kfs: //, other URIs ). is an example:

Scala> val distFile = SC .textFile(“data.txt ")

DistFile: spark. RDD [String] = spark. HadoopRDD @ 1d4cee08

Once created, distFile can perform data set operations. For example, I can use the following map and reduce operations to add the length of all rows:

DistFile. map (_. size). reduce (_ + _)

The method also accepts the second parameter, which is the title of the control file. The size of each part of Spark files is 64 MB, but you can specify more parts with a larger value. Note: you cannot specify a smaller part value than the Block (Map cannot be smaller than the Block Value in hadoop)

In SequenceFiles, use the sequenceFile [K, V] Method of SparkContext. K and V are the key and values types in the file. It must be a subclass of Hadoop's Writable, such as IntWritable and Text. In addition, Spark allows you to specify several native generic Writable types. For example, sequencFile [Int, String] automatically retrieves IntWritable and Texts

Finally, for other Hadoop input formats, you can use SparkContext. hadoopRDD to receive any type of JobConf and input format classes, types, and value types. Set the Input Source in the same way as the Hadoop job.

Distributed Data Set Operations

Distributed Data Sets support two types of operations:

(Transformation): Creates a new data set based on some data sets.

Action: returns a value animation program after running the calculation on the dataset.

For example, Map is a data set. After each element of a data set is calculated by a function, a new distributed data set is returned. On the other hand, Reduce is an action. All the elements in the data set are aggregated using a certain function line, and then the animation program is returned, while the parallel cecebykey returns a distributed data set.

All Spark jobs are inert, that is, they do not happen immediately. On the contrary, it only uses some (Transformation) of the basic data set ). Some (Transformation) will only occur in one Action and require the return result to be used for real calculation. Sets Spark to run more efficiently. For example, I can create a data set through map and then use reduce, instead of returning the reduce driver instead of the entire big data set.

One important operation provided by spark is Caching. You cache a distributed data set. Each node stores all the pieces of the data set, calculates them in the memory, and re-uses them in other operations. This will make the subsequent calculation faster (usually 10 times). It is a tool in spark to construct iterative algorithms and can also be used in the interpreter.

The following table lists the currently supported and actions:

(Transformations)

Transformation

Meaning

Map (Func)

 

Returns a new distributed data set consisting of each original element following the func function.

Filter (Func)

 

Returns a new data set, which is composed of elements that return true after the func function.

FlatMap (Func)

Similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq instead of an element)

Sample (WithReplacement,Frac,Seed)

 

Random Sampling of frac data based on the specified Random seed

Union (otherDataset)

 

Returns a new data set, which is composed of the original data set and parameters.

GroupByKey ([numTasks])

 

Used in a data set composed of (K, V), returns a data set of (K, Seq [V. Note: If you use eight parallel Task row scores, You can input numTask parameters and set different tasks based on the data volume.

(The combination of groupByKey and filter can be similar to the Reduce function in Hadoop)

ReduceByKey (func, [numTasks])

When used in a (K, V) data set, the returned data set with the same key value (K, V) is aggregated using the specified reduce function. Similar to the groupbykey, the task can be configured through the second parameter.

Join (otherDataset, [numTasks])

Returns (K, (V, W )), data Set in which all elements of each key are together

GroupWith (otherDataset, [numTasks])

Used in data sets of the types (K, V) and (K, W), a data set is returned to form elements (K, Seq [V], Seq [W]). tuples. In other frameworks, CoGroup

Cartesian (otherDataset)

Cartesian. However, with the data set T and U, A (T, U) data set is returned, and all elements interact with Cartesian rows.

SortByKey ([ascendingOrder])

In the data set of the type (K, V), the (K, V) Data Set sorted by K rows is returned. Ascending or descending order is determined by the boolean-type ascendingOrder parameter.

(Similar to the Sort of the middle Map-Reduce segment of Hadoop, Sort by Key row)

Actions (Action)

Action

Meaning

Reduce (Func)

Func aggregates all elements in the dataset. The Func function accepts two parameters and returns a value. Letter (s) must be associated to ensure they can be correctly issued

Collect ()

In the Driver program, all elements of the data set are returned in the form. Usually, a small subset of data is returned after the filter or other operations are used, and the entire RDD set Collect returns directly. It is likely that the Driver program OOM

Count ()

Returns the element of a data set.

Take (N)

Returns the first n elements of the data set. Note that an operation is not performed on multiple points, but on the machine where the Driver program is located. The machine counts all elements.

(The Gateway memory will increase, so be careful when using it)

First ()

Returns the first element of a data set (similar to take (1 ))

SaveAsTextFile (path)

Data Set elements are saved as textfiles to the local file system, hdfs, or any other file system supported by hadoop. Spark uses the toString method of each element and a line of text in its file

SaveAsSequenceFile (path)

Data Set elements are saved to the specified directory, local system, hdfs, or any other file system supported by hadoop in sequencefile format. The elements of RDD must be composed of key-value and all implement the Writable interface of Hadoop, or the formula can be Writable (Spark includes basic types, such as Int, Double, String, etc)

Foreach (Func)

Run the function func on each element of the data set. It is usually used to update a accumulator variable or interact with an external storage system.

Storage

Shared VariablesShared variable

Broadcast variable

Scala> val broadcastVar = SC. broadcast (Array (1, 2, 3 ))

BroadcastVar: spark. Broadcast [Array [Int] = spark. Broadcast (b5c40191-a864-4c7d-b9bf-d87e1a4e787c)

Scala> broadcastVar. value

Res0: Array [Int] = Array (1, 2, 3)

Accumulators

Scala> val accum = SC. accumulator (0)

Accum: spark. Accumulator [Int] = 0

Scala> SC. parallelize (Array (1, 2, 3, 4). foreach (x => accum + = x)

...

10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

Scala> accum. value

Res2: Int = 10

More Materials

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.