1. Build the spark application runtime environment;
Create a sparkcontext in driver program (the program containing sparkcontext is called driver program );
Spark application runs as follows: a group of independent executor processes are running on the cluster, which are coordinated by sparkcontext.
2. sparkcontext applies to resource manager for running executor resources, starts standaloneexecutorbackend, and executor applies to sparkcontent for tasks;
The cluster connects to different cluster managers (standalone, yarn, and mesos) through sparkcontext. Cluster Manager allocates resources for executors running applications. Once the connection is established, each application of spark obtains the executor (process) on each node. Each application has its own independent executor process. The executor is the worker process that actually runs on the worknode, they compute or store data for applications;
3. After sparkcontext obtains the executor, the application code of the application will be sent to each executor;
4. sparkcontext: Build an RDD Dag graph, break down the RDD Dag graph into a stage Dag graph, submit the stage to taskscheduler, and then taskscheduler sends the task to the executor for running;
5. Tasks run on executor and all resources are released after running;
Spark running architecture features: 1. Each application obtains an exclusive executor process, which stays in the application and runs tasks in multiple threads. The advantages of this application isolation mechanism are, from the perspective of scheduling (each driver schedules its own tasks) or from the perspective of running (tasks from different applications run in different JVMs ). Of course, this also means that spark application cannot share data across applications unless it relies on external storage systems. For example: tachyon and sharkserver; 2. Spark does not care about what Cluster Manager is running at the underlying layer. It only cares about whether executor can be obtained and mutual communication can be maintained, because the final task is run on the executor; 3. Keep your driver program as close as possible to the worker (run the executor node), it is best to be in the same rack. Because there is a large amount of information interaction between sparkcontext and executor during application running. If you want to run it in a remote cluster, it is best to submit the application to the set using RPC instead of running the application far away from worker; 4. The task adopts the optimization mechanism of local data and speculative execution. For details, see http://spark.apache.org/docs/latest/cluster-overview.html.