Apache Spark Source code reading 2 -- submit and run a job

Last Update:2014-07-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You are welcome to reprint it. Please indicate the source, huichiro.

Summary

This article takes wordcount as an example to describe in detail the Job Creation and running process in Spark, focusing on the creation of processes and threads.

Lab Environment Construction

Before performing subsequent operations, make sure that the following conditions are met.

Download spark binary 0.9.1
Install Scala
Install SBT
Install Java

Start spark-shell standalone mode, that is, Local Mode

The local mode is very simple to run. Just run the following command. Assume that the current directory is $ spark_home.

MASTER=local bin/spark-shell

"Master = Local" indicates that the instance is currently running in standalone mode.

Run in local cluster mode

The local cluster mode is a pseudo cluster mode. standalone clusters are simulated in a standalone environment. The startup sequence is as follows:

Start master
Start worker
Start spark-shell

Master

$SPARK_HOME/sbin/start-master.sh

Note the running output. logs are saved in the $ spark_home/logs directory by default.

Master is mainly a running classOrg. Apache. Spark. Deploy. Master. Master,Start the listener on port 8080, as shown in the log.

Modify configurations

Go to the $ spark_home/conf directory
Rename spark-env.sh.template to spark-env.sh
Modify the spark-env.sh to add the following

export SPARK_MASTER_IP=localhostexport SPARK_LOCAL_IP=localhost

Run worker

bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -i 127.0.0.1  -c 1 -m 512M

After the worker is started, it connects to the master. Open the Web UI of the maser and you can see that the listening address of the connected worker. Master web UI isHttp: // localhost: 8080

Start spark-shell

MASTER=spark://localhost:7077 bin/spark-shell

If everything goes well, you will see the following prompt.

Created spark context..Spark context available as sc.

You can open localhost: 4040 in a browser to view the following content:

Stages
Storage
Environment
Executors

Wordcount

After the preceding environment is ready, run the simplest example in sparkshell and enter the following code in spark-shell:

scala>sc.textFile("README.md").filter(_.contains("Spark")).count

The code above calculates the number of spark lines in readme. md.

Detailed description of deployment process

Shows the components in the spark layout environment.

Driver programIn brief, the wordcount statement entered in spark-shell corresponds to the driver program.
Cluster ManagerIt corresponds to the master mentioned above and mainly plays the role of deploy management.
Worker NodeThis is an slave node compared with the master node. Run the executor above. The executor can correspond to the thread. Executor processes two basic business logics: Driver programme and job split into various stages after submission. Each stage can run one or more tasks.

Notes:In cluster mode, Cluster Manager runs inJVMIn the process, while the worker runs in anotherJVMIn process. In the local cluster, these JVM processes are all on the same machine. If they are real standalone, mesos, and yarn clusters, worker and master nodes are distributed on different hosts.

Job generation and running

A simple process for generating a job is as follows:

First, the application creates a sparkcontext instance. For example, the instance is SC.
Use sparkcontext instances to create and generate RDD
After a series of transformation operations, the original RDD is converted to another type of RDD.
When an action acts on the transformed RDD, The runjob method of sparkcontext is called.
The call of SC. runjob is the starting point of a series of subsequent reactions, where the key hop occurs.

The call path is roughly as follows:

SC. runjob-> dagscheduler. runjob-> submitjob
Dagschedted: submitjob creates a jobsummitted event and sends it to the embedded class.Eventprocessactor
Eventprocessactor is called after receiving jobsubmmittedProcesseventProcessing functions
The key is to callSubmitstage
In submitstage, dependencies between stages are calculated.Wide dependencyAndNarrow dependencyTwo Types
If the calculation finds that the current stage does not have any dependencies or all dependencies have been prepared, submit the task
Submitting a task is to call a function.SubmitmissingtasksTo complete
Taskschedks is used to manage the worker on which the task actually runs. That is, the submitmissingtasks above calls taskschedks: submittasks.
In taskschedulerimpl, the corresponding backend is created based on the current running mode of spark. If it is run on a single machine, localbackend is created.
Localbackend receives the information passed in by taskschedulerimpl.ReceiveoffersEvent
Receiveoffers-> executor. launchtask-> taskrunner. Run

Code snippet executor. lauchtask

 def launchTask(context: ExecutorBackend, taskId: Long, serializedTask: ByteBuffer) {    val tr = new TaskRunner(context, taskId, serializedTask)    runningTasks.put(taskId, tr)    threadPool.execute(tr)  }

After talking about such a chase, it means that the final logic processing actually happens within the executor of taskrunner.

The calculation result is packaged into mapstatus and then fed back to dagscheduler through a series of internal messages. This message transmission path is not too complex and can be outlined if you are interested.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark Source code reading 2 -- submit and run a job

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Spark Source code reading 2 -- submit and run a job

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support