Apache Spark Source Analysis-job submission and operation

Source: Internet
Author: User

This article takes WordCount as an example, detailing the process by which Spark creates and runs a job, with a focus on process and thread creation.Construction of experimental environmentEnsure that the following conditions are met before you proceed with the follow-up operation. 1. Download Spark binary 0.9.12. Install SCALA3. Install SBT4. Installing JavaStart Spark-shellStandalone mode operation, local modeLocal mode is very simple to run, just run the following command, assuming the current directory is $spark_homemaster=local Bin/spark-shell"Master=local" means that it is currently running in stand-alone mode Local Cluster mode operation Localcluster mode is a pseudo-cluster mode, in the single-machine environment to simulate the standalone cluster, the boot sequence is as follows 1. Start Master2. Start Worker3. Start Spark-shellMaster$SPARK _home/sbin/start-master.shNote the output of the runtime, which is saved in the $spark_home/logs directory by default. Master mainly runs the classOrg.apache.spark.deploy.master.Master,Start listening on port 8080, as shown in the logModify Configuration1. Enter the $spark_home/conf directory 2. Rename Spark-env.sh.template to Spark-env.sh3. Modify spark-env.sh to add the followingExport Spark_master_ip=localhostExport Spark_local_ip=localhostrunning workerbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077- I.127.0.0.1- C1- M512MThe worker starts to complete and connects to master. Open the maser WebUI to see the worker that is connected. The listening address of the Master WEb UI ishttp://localhost:8080Start Spark-shellmaster=spark://localhost:7077 Bin/spark-shellIf all goes well, you will see the following message.Created Spark Context .Spark context available as SC.You can use the browser to open localhost:4040 to view the following 1. Stages2. Storage3. Environment4. ExecutorsWordCountAfter the environment is ready, let's run the simplest example in Sparkshell and enter the following code in Spark-shellscala>sc.textfile ("readme.md"). Filter (_.contains ("Spark")). CountThe code above counts the number of lines in readme.md that contain sparkDetailed deployment ProcessThe components in the spark layout environment are as shown.


  • Driver Program Briefly, the WordCount statement entered in Spark-shell corresponds to the driver program.
  • cluster manager 
  • Worker Node This is slave node compared to master. Each executor,executor running above can correspond to a thread. Executor handles two basic business logic, one is driver programme, the other is that the job is split into stages after submission, and each stage can run one or more tasks
Notes:in cluster (cluster) mode, Cluster Manager runs in aJVMProcess while the worker is running in anotherJVMThe process. In the local cluster, these JVM processes are in the same machine, and if they are real standalone or mesos and yarn clusters, the worker and master are distributed on different hosts. JOB Generation and Operation The simple process of job generation is as follows 1. The application first creates an instance of Sparkcontext, such as an instance of SC2. Use an instance of Sparkcontext to create a build RDD3. After a series of transformation operations, the original RDD is converted into other types of RDD4. When the action is applied to the RDD after conversion, the Sparkcontext Runjob method 5 is called. The Sc.runjob call is the starting point for a sequence of responses, and the critical jump takes place in the call path roughly as follows 1. Sc.runjob->dagscheduler.runjob->submitjob2. Dagscheduler::submitjob creates an event for jobsummitted to be sent to the inline classEventprocessactor3. Eventprocessactor is called after receiving the jobsubmmittedprocesseventProcessing function 4. Job-to-stage conversion, generating finalstage and committing to run, the key is to callSubmitstage5. The dependencies between stages are calculated in Submitstage, and the dependencies are divided intoWide DependencyAndNarrow DependenceTwo kinds of 6. Commit TASK7 If the calculation finds that the current stage has no dependencies or that all dependencies are ready. Committing a task is calling a functionSubmitmissingtasksTo complete the 8. The task actually runs on which worker is managed by TaskScheduler, which means that the submitmissingtasks above calls TASKSCHEDULER::SUBMITTASKS9. The corresponding backend is created in Taskschedulerimpl based on the current operating mode of spark, and LOCALBACKEND10 is created if it is run on a single machine. Localbackend received Taskschedulerimpl's delivery.receiveoffersEvent 11. Receiveoffers->executor.launchtask->taskrunner.run Code Snippet Executor.lauchtaskDefLaunchtask (Context:executorbackend, Taskid:long, Serializedtask:bytebuffer) {    Valtr =NewTaskrunner (context, taskId, Serializedtask)runningtasks.put (taskId, TR)Threadpool.execute (TR)  }To say such a chase, that is to say, the final logical processing is really happening in Taskrunner such a executor within. The result of the operation is that the wrapper becomes mapstatus and then through a series of internal message passing, feedback to the Dagscheduler, this message passing path is not too complex, interested can be self-sketched. For more information, please pay attention to: http://bbs.superwu.cn The two-dimensional code: Focus on the Superman college Java Free Learning Exchange Group:

Apache Spark Source Analysis-job submission and operation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.