Apache Spark Source 2--Job submission and operation

Last Update:2015-08-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from: http://www.cnblogs.com/hseagle/p/3673123.html

Overview

This article takes WordCount as an example, detailing the process by which Spark creates and runs a job, with a focus on process and thread creation.

Construction of experimental environment

Ensure that the following conditions are met before you proceed with the follow-up operation.

Download Spark binary 0.9.1
Install Scala
Installing SBT
Installing Java

Start Spark-shell stand-alone mode operation, which is local mode

Local mode is very simple to run, just run the following command, assuming the current directory is $spark_home

Master=local Bin/spark-shell

"Master=local" means that it is currently running in stand-alone mode

Local Cluster mode operation

Local cluster mode is a pseudo-cluster mode, in a single-machine environment to simulate the standalone cluster, the boot sequence is as follows

Start Master
Start worker
Start Spark-shell

Master

$SPARK _home/sbin/start-master.sh

Note the output of the runtime, which is saved in the $spark_home/logs directory by default.

Master is mainly run class Org.apache.spark.deploy.master.Master, start listening on port 8080, log as shown

Modify Configuration

Enter the $spark_home/conf directory
Rename Spark-env.sh.template to spark-env.sh
To modify spark-env.sh, add the following:

Export Spark_master_ip=localhostexport Spark_local_ip=localhost

Running worker

bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -i 127.0.0.1  -c 1 -m 512M

The worker starts to complete and connects to master. Open the Maser Web UI to see the worker that is connected. The Master WEb UI has a listening address of http://localhost:8080

Start Spark-shell

master=spark://localhost:7077 Bin/spark-shell

If all goes well, you will see the following message.

Created Spark context. Spark context available as SC.

You can open localhost:4040 in your browser to see the following:

Stages
Storage
Environment
Executors

WordCount

After the environment is ready, let's run the simplest example in Sparkshell and enter the following code in Spark-shell

Scala>sc.textfile ("Readme.md"). Filter (_.contains ("Spark")). Count

The code above counts the number of lines in readme.md that contain spark

Detailed deployment process

The components in the spark layout environment are as shown.

Driver Program briefly describes the Driver program that corresponds to the WordCount statement entered in the Spark-shell.
The Cluster Manager is the one that corresponds to the above mentioned master, which is primarily the role of the Deploy management
The Worker node is slave node compared to master. Each executor,executor running above can correspond to a thread. Executor handles two basic business logic, one is driver programme, the other is that the job is split into stages after submission, and each stage can run one or more tasks

Notes: in cluster (cluster) mode, Cluster Manager runs in a JVM process, while the worker is running in another JVM process. In the local cluster, these JVM processes are in the same machine, and if they are real standalone or mesos and yarn clusters, the worker and master are distributed on different hosts.

Job Generation and operation

The simple process of job generation is as follows

The application first creates an instance of Sparkcontext, such as an instance of SC
Use Sparkcontext instances to create an RDD
After a series of transformation operations, the original RDD is converted into other types of RDD
The Runjob method of Sparkcontext is called when the action acts on the post-conversion rdd.
The call to Sc.runjob is the starting point for a sequence of responses, and the key jumps occur here.

The call path is roughly as follows

Sc.runjob->dagscheduler.runjob->submitjob
Dagscheduler::submitjob creates an event for jobsummitted to be sent to the inline class Eventprocessactor
Eventprocessactor calls the processevent handler after receiving the jobsubmmitted
Job-to-stage conversion, generating finalstage and committing to run, the key is to call submitstage
The dependencies between the stages are calculated in Submitstage, and the dependencies are divided into two types of wide dependencies and narrow dependencies .
If no dependencies are found in the current stage in the calculation or if all dependencies are ready, the task is committed
Commit task is called function submitmissingtasks to complete
The task actually runs on which worker is managed by TaskScheduler, which means that the submitmissingtasks above calls Taskscheduler::submittasks
The corresponding backend is created in Taskschedulerimpl based on the current operating mode of spark, and if it is run on a single machine localbackend
Localbackend received the receiveoffers event Taskschedulerimpl passed in.
Receiveoffers->executor.launchtask->taskrunner.run

Code Snippet Executor.lauchtask

def launchtask (Context:executorbackend, Taskid:long, Serializedtask:bytebuffer) {    new  Taskrunner (context, taskId, Serializedtask)    runningtasks.put (taskId, tr)    Threadpool.execute (tr)  }

To say such a chase, that is to say, the final logical processing is really happening in Taskrunner such a executor within.

Apache Spark Source 2--Job submission and operation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark Source 2--Job submission and operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Spark Source 2--Job submission and operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support