The programming model in spark

Source: Internet
Author: User
Tags shuffle split

1. Basic Concepts in Spark

In Spark, there are the following basic concepts.
application: Spark-based user program that contains a driver programs and multiple executor in a cluster
DriverProgram: Runs the main () function of application and creates the Sparkcontext. Usually Sparkcontext represents driver program
Executor: A process that runs on worker node for a application. The process is responsible for running the task and is responsible for having the data in memory or on disk. Each application has its own independent executors.
Cluster Manager: External services that obtain resources on the cluster (e.g. Spark Standalon,mesos, Yarn)
Workernode: Any node in the cluster that can run application code
Task: The unit of work that is sent to the executor.
Job: A unit of work that can be split into task parallel computations, typically one execution job triggered by spark action.
stage: Each job is split into a number of task groups, each of which is called the stage, and can also be called Taskset. This term can often be seen in the log.
RDD : The basic computational unit of Spark, which is transformed through Scala collections, read data set generation, or manipulated by other RDD operators.


2. Spark Application Framework




Customer Spark Programs (Driver program) to operate the spark cluster is done by Sparkcontext objects, sparkcontext as a total entry for operations and scheduling, During initialization, the cluster Manager creates Dagscheduler job schedules and TaskScheduler task schedules.

Dagscheduler Job scheduling module is based on the stage of the high-level scheduling module (reference: Spark analysis of the Dagscheduler), the DAG full name Directed acyclic graph, directed acyclic graph. Simply put, there is a graph of vertices and directional edges, starting at any vertex, and no path will take it back to the starting vertex. It calculates multiple stage task stages with dependencies for each spark job (usually based on shuffle to divide the stage, such as Groupbykey, Reducebykey and other related shuffle transformation will produce a new stage), and then each stage is divided into a specific set of tasks, in the form of tasksets submitted to the underlying task scheduling module to execute. Among them, the RDD before the different stage is a wide dependency. TaskScheduler Task Scheduling module is responsible for the specific start-up tasks, monitoring and reporting of task operations.

creating Sparkcontext generally takes the following steps:

a). Importing spark classes and implicit conversions

Import Org.apache.spark. {sparkcontext, sparkconf}
Import Org.apache.spark.sparkcontext._
b). Build the application information object for the spark application sparkconf

Val conf = new sparkconf (). Setappname (AppName). Setmaster (Master_url)
c). Use the Sparkconf object to initialize the Sparkcontext

Val sc = new Sparkcontext (conf)
d). Create the Rdd, and execute the corresponding transformation and action and get the final result.
e). Close the context

after you have finished designing and writing your app, use Spark-submit to submit the app's jar package. Spark-submit's command line reference is as follows:

Submitting applications

./bin/spark-submit 
  --class <main-class>
  --master <master-url> 
  --deploy-mode < Deploy-mode>. 
  # Other options
  <application-jar> 
  [application-arguments]

The operating mode of spark depends on the value of the master environment variable passed to Sparkcontext. The master URL can be in one of the following forms:
Master URL Meaning
LocalUse a worker thread to localize the run spark (no parallelism at all)
local[*]Use the number of logical CPUs to localize a thread to run spark
Local[k]Use k worker thread localization to run Spark (ideally, K should be set based on the number of CPU cores running the machine)
Spark://host:portConnects to the specified spark standalone master. The default port is 7077.
yarn-clientConnect yarn clusters in client mode. The location of the cluster can be found in the HADOOP_CONF_DIR environment variable.
Yarn-clusterConnect yarn clusters in cluster mode. The location of the cluster can be found in the HADOOP_CONF_DIR environment variable.
Mesos://host:portConnects to the specified Mesos cluster. The default interface is 5050.

Spark-shell will automatically build the Sparkcontextwhen it starts, and the name is SC.


3. RDD creation Spark all operations are carried out around an elastic distributed data set (RDD), a collection of elements that have a fault-tolerant mechanism and can be manipulated in parallel, with features such as read-only, partitioned, fault-tolerant, efficient, no-materialized, cacheable, and RDD-dependent.

There are currently two types of base rdd:

Parallel Collection (parallelized collections): Receives a Scala collection that already exists and then carries out various parallel computations.

Hadoop Data Set (Hadoop Datasets) : Runs a function on each record in a file. As long as the file system is HDFs, or any storage system supported by Hadoop.

Both types of RDD can be manipulated in the same way, resulting in a series of extensions, such as sub-rdd, to form the lineage lineage diagram.


(1). Parallelization of Collections
A parallelized collection is a Seq object that is created on an already existing Scala collection by calling the Parallelize method of Sparkcontext. The objects of the collection will be copied, creating a distributed dataset that can be manipulated in parallel. For example, the following interpreter output demonstrates how to create a parallel collection from an array.
For example:

Val Rdd = Sc.parallelize (Array (1 to 10)) splits multiple slice based on the number of executor that can be started, and each slice initiates a task for processing.

Val Rdd = Sc.parallelize (Array (1 to 10), 5) specifies the number of partition


(2). Hadoop Data Set

Spark can convert any of the storage resources supported by Hadoop into an rdd, such as a local file (requiring a network file system, all nodes must be accessible), HDFS, Cassandra, HBase, Amazon S3, etc., and Spark supports text files, Sequencefiles and any Hadoop inputformat format.
a). Use the Textfile () method to convert a local file or HDFs file to an Rdd
Support the entire file directory read, the file can be text or compressed files (such as gzip, etc., automatically perform decompression and loading data). such as Textfile ("File:///dfs/data")
Wildcard reads are supported, for example:

Val rdd1 = Sc.textfile ("File:///root/access_log/access_log*.filter");
Val Rdd2=rdd1.map (_.split ("T")). Filter (_.length==6)
rdd2.count () ...
14/08/20 14:44:48 INFO hadooprdd:input split:file:/root/access_log/access_log.20080611.decode.filter:134217728+ 20705903 ...

Textfile () optional second parameter slice, by default, assigns a slice to each block. Users can also specify more shards through slice, but cannot use less than the number of shards in the HDFs block.

b). Use Wholetextfiles () to read small files in the directory, return (user name, content) to
c). Use the Sequencefile[k,v] () method to convert the Sequencefile to an RDD. Sequencefile files are flat files (Flat file) designed by Hadoop to store key-value pairs in binary form.
d). Use the Sparkcontext.hadooprdd method to convert any other Hadoop input type into an RDD usage method. In general, each HDFs block in the Hadooprdd becomes an RDD partition.
In addition, transformation can convert Hadooprdd to Filterrdd (dependent on one parent rdd) and Joinedrdd (dependent on all parent RDD), etc.


4. Rdd Operation


The RDD supports two types of operations:
conversion (transformation) the existing RDD Customs conversion generates a new RDD, and the conversion is deferred execution (lazy).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.