Apache Spark Source code reading 2 -- submit and run a job

Source: Internet
Author: User

You are welcome to reprint it. Please indicate the source, huichiro.

Summary

This article takes wordcount as an example to describe in detail the Job Creation and running process in Spark, focusing on the creation of processes and threads.

Lab Environment Construction

Before performing subsequent operations, make sure that the following conditions are met.

  1. Download spark binary 0.9.1
  2. Install Scala
  3. Install SBT
  4. Install Java
Start spark-shell standalone mode, that is, Local Mode

The local mode is very simple to run. Just run the following command. Assume that the current directory is $ spark_home.

MASTER=local bin/spark-shell

"Master = Local" indicates that the instance is currently running in standalone mode.

Run in local cluster mode

The local cluster mode is a pseudo cluster mode. standalone clusters are simulated in a standalone environment. The startup sequence is as follows:

  1. Start master
  2. Start worker
  3. Start spark-shell
Master
$SPARK_HOME/sbin/start-master.sh

Note the running output. logs are saved in the $ spark_home/logs directory by default.

Master is mainly a running classOrg. Apache. Spark. Deploy. Master. Master,Start the listener on port 8080, as shown in the log.

Modify configurations
  1. Go to the $ spark_home/conf directory
  2. Rename spark-env.sh.template to spark-env.sh
  3. Modify the spark-env.sh to add the following
export SPARK_MASTER_IP=localhostexport SPARK_LOCAL_IP=localhost
Run worker
bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -i 127.0.0.1  -c 1 -m 512M

After the worker is started, it connects to the master. Open the Web UI of the maser and you can see that the listening address of the connected worker. Master web UI isHttp: // localhost: 8080

Start spark-shell
MASTER=spark://localhost:7077 bin/spark-shell

If everything goes well, you will see the following prompt.

Created spark context..Spark context available as sc.

You can open localhost: 4040 in a browser to view the following content:

  1. Stages
  2. Storage
  3. Environment
  4. Executors
Wordcount

After the preceding environment is ready, run the simplest example in sparkshell and enter the following code in spark-shell:

scala>sc.textFile("README.md").filter(_.contains("Spark")).count

The code above calculates the number of spark lines in readme. md.

Detailed description of deployment process

Shows the components in the spark layout environment.

  • Driver programIn brief, the wordcount statement entered in spark-shell corresponds to the driver program.
  • Cluster ManagerIt corresponds to the master mentioned above and mainly plays the role of deploy management.
  • Worker NodeThis is an slave node compared with the master node. Run the executor above. The executor can correspond to the thread. Executor processes two basic business logics: Driver programme and job split into various stages after submission. Each stage can run one or more tasks.

Notes:In cluster mode, Cluster Manager runs inJVMIn the process, while the worker runs in anotherJVMIn process. In the local cluster, these JVM processes are all on the same machine. If they are real standalone, mesos, and yarn clusters, worker and master nodes are distributed on different hosts.

Job generation and running

A simple process for generating a job is as follows:

  1. First, the application creates a sparkcontext instance. For example, the instance is SC.
  2. Use sparkcontext instances to create and generate RDD
  3. After a series of transformation operations, the original RDD is converted to another type of RDD.
  4. When an action acts on the transformed RDD, The runjob method of sparkcontext is called.
  5. The call of SC. runjob is the starting point of a series of subsequent reactions, where the key hop occurs.

The call path is roughly as follows:

  1. SC. runjob-> dagscheduler. runjob-> submitjob
  2. Dagschedted: submitjob creates a jobsummitted event and sends it to the embedded class.Eventprocessactor
  3. Eventprocessactor is called after receiving jobsubmmittedProcesseventProcessing functions
  4. The key is to callSubmitstage
  5. In submitstage, dependencies between stages are calculated.Wide dependencyAndNarrow dependencyTwo Types
  6. If the calculation finds that the current stage does not have any dependencies or all dependencies have been prepared, submit the task
  7. Submitting a task is to call a function.SubmitmissingtasksTo complete
  8. Taskschedks is used to manage the worker on which the task actually runs. That is, the submitmissingtasks above calls taskschedks: submittasks.
  9. In taskschedulerimpl, the corresponding backend is created based on the current running mode of spark. If it is run on a single machine, localbackend is created.
  10. Localbackend receives the information passed in by taskschedulerimpl.ReceiveoffersEvent
  11. Receiveoffers-> executor. launchtask-> taskrunner. Run

Code snippet executor. lauchtask

 def launchTask(context: ExecutorBackend, taskId: Long, serializedTask: ByteBuffer) {    val tr = new TaskRunner(context, taskId, serializedTask)    runningTasks.put(taskId, tr)    threadPool.execute(tr)  }

After talking about such a chase, it means that the final logic processing actually happens within the executor of taskrunner.

The calculation result is packaged into mapstatus and then fed back to dagscheduler through a series of internal messages. This message transmission path is not too complex and can be outlined if you are interested.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.