Basic instructions for Spark

Source: Internet
Author: User

1, about application

The user program, a application consists of a function code running in driver and several executor running on different nodes.

It is divided into multiple jobs, each of which consists of multiple rdd and some action actions, the job is a multiple task group, each task group is called: stage.

Each task is then divided into multiple nodes, executed by executor:

In the program, the RDD conversion actually does not really run, the real operation is the time of operation.

2. Program execution Process

1) Build Spark application running environment, that is, start Sparkcontext, after startup, to the resource manager

Register (standalone--spark own master management resource, Mesos, or yarn) and apply to run executor resources.

2) The resource manager allocates executor resources and starts standaloneexecutorbackend on each node (for standalone), Executor sends the runtime to the resource manager with the heartbeat.

3) Sparkcontext According to the user program, the DAG map is built, the DAG is decomposed into stage, the division principle is wide-dependent time division, the stage (TaskSet) sent to TaskScheduler. Stage

Determines the number of tasks based on the number of partition of the RDD; executor applies a task to sparkcontext. Task Scheduler sends the task to executor and sends the code to executor (as if the master turned on the HTTP service, executor to fetch the code).

4) Task runs on executor "this program exclusive", multi-threaded operation, the number of threads to see the number of cores that can be run.

5) Spark Context run location and worker do not separate too far, intermediate process has data exchange.

3. DAG Scheduler

1) The stage is divided according to the Rdd dependency, in a nutshell, if a child rdd only relies on one parent rdd, then in one stage, otherwise in multiple stages, relying on only one parent Rdd is called a narrow dependency, relying on multiple parent rdd for wide dependency,

The occurrence of a wide dependency is called shuffle.

2) When the shuffle data processing fails, it re-processes the previous information.

3) It constructs a dag (directed acyclic graph) based on the RDD, and then further finds the least expensive scheduling method. Sends the stage to task Scheduler.

4. Task Scheduler

1) Save and maintain all taskset.

2) When Executor sends a heartbeat to driver, TaskScheduler assigns the task according to its resource usage, and retries the failed task if it is allowed to fail.

5, the operation principle of RDD

1) Create an RDD based on the spark's internal objects or external objects such as Hadoop.

2) Build the DAG.

3) is divided into tasks, which are summarized after executing on multiple nodes.

Example: First alphabetical order:

Sc.textfile ("Hdfs://names"). Map (name = (Name.charat (0), name)). Groupbykey (). mapvalues (names = names.toSet.size). Collect ()

Assume that the file contents are by line name:

Ah (A,ah) (A, (ah,anlly) [(A , 2),

PPT---> Map----> (p,ppt)----->groupbykey---> (P, (PPT))-------->mapvalues---> (p,1)]

Anlly (a,anlly)

1) Create the RDD, the last collect for the action will not create the RDD, the other operations will create a new rdd.

2) Creating a dag,groupby () will take data that relies on multiple previous rdd, so it is divided into one stage.

3) Perform tasks, each phase must wait for the previous phase to complete. Each stage is divided into different task executions, each of which contains code + data.

Assuming that there are four file blocks below the names in the example, then the partitions in Hadooprdd is automatically divided into four partitions corresponding to the four pieces of data.

A four task is created to perform the related task.

Each task operation piece of data is then executed, the simple simulation of the above example:

Import Org.apache.spark. {sparkconf, sparkcontext}object namecountch {  def main (args:array[string]) {    if (Args.length < 1) {      System.err.println ("usage:<file>")      system.exit (1)    }    val conf = new sparkconf (). Setappname (" Namecountch ")    val sc = new Sparkcontext (conf)    sc.textfile (args (0))      . Map (name = = (Name.charat (0), name) )      . Groupbykey ().      mapvalues (names = names.toSet.size)      . Collect (). foreach (println)  }}

Actual execution Process:

Execute command:./spark-submit--master spark://xxxx:7077--class namecountch--executor-memory 512m--total-executor-cores 2/dat A/spark/miaohq/scalatestapp/scalatest4.jar Hdfs://spark29:9000/home/miaohq/testname.txt

1. Start an HTTP port:

2. Put files on this Web server according to the submitted file

3. Create a program to generate two executor

4. Dag Scheduling

Complete the first stage:

Dispatch Second Stage:

Complete the second stage output:

Doubts:

1, small files can not see the process of file partitioning, in addition to set a few execution cores, there will be a few executor, if more than the total number of threads may be more??

2, why a stage is two task, according to the principle should be divided into several partition files on a few tasks, the current test file is very small, can only be divided into 1 partition, is not related to executor,

3 execution cores Set up is still just two tasks?

3, why the second stage from the Mapvalues division should not be Groupbykey ()???

6. Execution of Spark under standalone architecture

1. Standalone is a resource scheduling framework implemented by Spark: client node, master node, worker node.

2, driver can be run on the master node, or run on the local client side.

When submitting the job to spark with the Spark-shell interaction tool, it runs on the master node;

Commit with Spark-submit or with Sparkconf.setmanager ("spark://master:7077") is run on client side.

3, run on the client side of the execution process is as follows:

Description

1) sparkcontext Connect to master, register and request resources (CPU and memory)

2) Master decides on which host to allocate resources based on the application information and the worker Heartbeat report, and then gets the resources to start standaloneexecutorbackend.

3) Standaloneexecutorbackend to Sparkcontext registration.

4) Sparkcontext send the code to standaloneexecutorbackend and build the DAG based on the code.

Encountering an action action generates a job and then generates multiple Stage,stage submitted to TaskScheduler based on the Rdd dependency within the job.

5) Standaloneexecutorbackend Get task information when reporting status call Executor Multithreading execute task, and report to Sparkcontext,

Until the task is complete.

6) After all the tasks are completed, Sparkcontext logs off to master and frees the resources.

Description

Images and content in the article from: Http://www.cnblogs.com/shishanyuan

Basic instructions for Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.