This article takes WordCount as an example, detailing the process by which Spark creates and runs a job, with a focus on process and thread creation.
Construction of experimental environment
Ensure that the following conditions are met before you proceed with the follow-up operation.
1. Download Spark binary 0.9.1
2. Install Scala
3. Install SBT
4. Installing Java
Start Spark-shell Standalone mode operation, local mode
Local mode is very simple to run, just run the following command, assuming the current directory is $spark_home
master=local Bin/spark-shell
"Master=local" means that it is currently running in stand-alone mode
Local Cluster mode operation
Localcluster mode is a pseudo-cluster mode, in a single-machine environment to simulate the standalone cluster, the boot sequence is as follows
1. Start Master
2. Start worker
3. Start Spark-shell
Master $SPARK _home/sbin/start-master.sh
Note the output of the runtime, which is saved in the $spark_home/logs directory by default.
Master is mainly run class Org.apache.spark.deploy.master.Master, start listening on port 8080, log as shown 650) this.width=650; "Src=" Http://static.oschina.net/uploads/img/201505/28162436_g9yF.png "width=" "alt=" 28162436_g9yf.png "/>
Modify Configuration
1. Enter the $spark_home/conf directory
2. Rename the spark-env.sh.template to spark-env.sh
3. Modify spark-env.sh to add the following:
Export Spark_master_ip=localhostExport Spark_local_ip=localhostrunning workerbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077- I.127.0.0.1- C1- M +M
The worker starts to complete and connects to master. Open the maser WebUI to see the worker that is connected. The Master WEb UI has a listening address of http://localhost:8080
Start Spark-shell master=spark://localhost:7077 Bin/spark-shell
If all goes well, you will see the following message.
Created Spark Context . Spark context available as SC.
You can open localhost:4040 in your browser to see the following:
1. Stages
2. Storage
3. Environment
4. Executors
WordCount
After the environment is ready, let's run the simplest example in Sparkshell and enter the following code in Spark-shell
scala>sc.textfile ("readme.md"). Filter (_.contains ("Spark")). Count
The code above counts the number of lines in readme.md that contain spark
Detailed deployment Process
The components in the spark layout environment are as shown.
650) this.width=650; "src=" Http://static.oschina.net/uploads/img/201505/28162436_SPXn.jpg "width=" 534 "alt=" 28162436_spxn.jpg "/>
Driver Program Briefly, the WordCount statement entered in Spark-shell corresponds to the driver program.
Cluster Manager Which corresponds to the above mentioned master, mainly plays the role of the Deploy management
Worker Node This is slave node compared to master. Each executor,executor running above can correspond to a thread. Executor handles two basic business logic, one is driver programme, the other is that the job is split into stages after submission, and each stage can run one or more tasks
Notes: in cluster (cluster) mode, Cluster Manager runs in a JVM process, while the worker is running in another JVM process. In the local cluster, these JVM processes are in the same machine, and if they are real standalone or mesos and yarn clusters, the worker and master are distributed on different hosts.
JOB Generation and Operation
The simple process of job generation is as follows
1. First the application creates an instance of Sparkcontext, such as an instance of SC
2. Use Sparkcontext instances to create an RDD
3. After a series of transformation operations, the original RDD is converted into other types of RDD
4. When action is applied to the RDD after conversion, Sparkcontext's Runjob method is called
5. The call to Sc.runjob is the starting point for a sequence of responses, and the key jumps occur here
The call path is roughly as follows
1. Sc.runjob->dagscheduler.runjob->submitjob
2. Dagscheduler::submitjob creates an event for jobsummitted to be sent to the inline class Eventprocessactor
3. Eventprocessactor Call processevent processing function after receiving jobsubmmitted
4. Job-to-stage conversion, generate finalstage and commit to run, the key is to call submitstage
5. The dependencies between the stages are calculated in Submitstage, and the dependencies are divided into broad dependencies and narrow dependencies of two
6. If the calculation finds that the current stage has no dependencies or that all dependencies are ready, the task is submitted
7. Commit task is called function submitmissingtasks to complete
8. The task actually runs on which worker is managed by TaskScheduler, which means that the above Submitmissingtasks calls Taskscheduler::submittasks
9. The corresponding backend is created in Taskschedulerimpl based on the current operating mode of spark, and if it is run on a single machine localbackend
Localbackend received the receiveoffers event Taskschedulerimpl passed in.
Receiveoffers->executor.launchtask->taskrunner.run.
Code Snippet Executor.lauchtask
defLaunchtask (Context:executorbackend, Taskid:long, Serializedtask:bytebuffer) { Valtr =NewTaskrunner (context, taskId, Serializedtask)runningtasks.put (taskId, TR)Threadpool.execute (TR) }
To say such a chase, that is to say, the final logical processing is really happening in Taskrunner such a executor within.
The result of the operation is that the wrapper becomes mapstatus and then through a series of internal message passing, feedback to the Dagscheduler, this message passing path is not too complex, interested can be self-sketched.
For more highlights, please follow: http://bbs.superwu.cn
Focus on Superman Academy QR Code: 650) this.width=650; "Src=" http://static.oschina.net/uploads/space/2015/0528/162355_l6Hs_2273204.jpg " alt= "162355_l6hs_2273204.jpg"/>
Focus on the Superman college Java Free Learning Exchange Group: 650) this.width=650; "Src=" http://static.oschina.net/uploads/space/2015/0528/162355_2NBf_ 2273204.png "alt=" 162355_2nbf_2273204.png "/>
Apache Spark Source Analysis-job submission and operation