About Spark
Spark is a large data distributed computing framework based on memory computing. Spark is based on memory computing, which improves the real-time processing in big data environments while guaranteeing high fault tolerance and high scalability.
In spark, calculations are performed through the RDD (resilient distributed dataset, Elastic distributed DataSet), which are distributed across the cluster in parallel. Rdds is the underlying abstract class that spark distributes data and calculations.
RDD Properties:
- -A List of partitions
- -A function for computing each split
- -A list of dependencies on other RDDs
- -Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- -Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Spark terminology
Application: User-defined Spark program, after the user submits, spark assigns the application resources to the app to convert and execute the program.
Driver Program: Runs the main () function of application and creates the Sparkcontext.
Sparkcontext: is the main interface between user logic and spark cluster, it interacts with Cluster Manager, is responsible for computing resource requests, and so on.
Cluster Manager: Resource Manager, responsible for the management and scheduling of cluster resources, supported by: Standalone,mesos and yarn. In standalone mode, the master master node controls the entire cluster and monitors the worker. In yarn mode, it is a resource manager.
Worker node: Slave node, responsible for controlling compute nodes, starting executor or driver. In yarn mode, NodeManager is responsible for the control of the compute nodes.
Executor: An executor is a process that runs on worker node for a application, which is responsible for running the task through the thread pool and is responsible for the memory or disk of the data. Each application has a separate set of executors.
Rdd DAG: When the Rdd encounters an action operator, all previous operators are formed into a directed acyclic graph (DAG). It is then converted into a job in spark and submitted to the cluster for execution. An app can contain more than one job.
Job: A task that is triggered by an rdd graph, often triggered by the spark action operator, submits the job to spark via Runjob () in Sparkcontext.
Stage: Each job is cut into many stages based on the wide dependency of the RDD, and each stage contains a set of identical tasks, which is also called Taskset.
Task: A partition corresponds to an operator contained in the corresponding stage in the Task,task execution Rdd. The task is encapsulated and then placed in the thread pool of executor execution.
Dag Scheduler: Builds a stage-based DAG based on the job and submits the stage to TaskScheduler.
TaskScheduler: The task is distributed to executor execution.
SPARKENV: A thread-level context that stores references to important components of the runtime.
Spark Architecture
Client submits application, a sparkcontext is created in the user program, and the newly created Sparkcontext is connected to the cluster manager based on the parameters set by the user during programming, or the default configuration of the system.
Cluster Manager finds a worker to start Driver,driver request resources from Cluster Manager or resource Manager, then converts the app to Rdd Graph, and Dagscheduler the RDD The Taskset graph, which is converted to the stage, is submitted to TaskScheduler and executed by the TaskScheduler Submit task (Task) to executor.
Executor when a task is received, the packages and libraries that the task's runtime relies on are downloaded, and the task is executed in the thread pool after preparing the information needed for the task to run the environment. The task reports the status and results to driver when it is running.
Driver handles different status updates based on the running state of the received task, and the task is divided into two types, one shuffle Map task, which implements the new shuffle of data, in all stages, except for the last, all stage becomes shuffle stage, The result is saved in the local file system of the executor, and the last stage is called: Result Task, which is responsible for generating the result data.
Driver will constantly invoke the task, sending the task to executor execution, and stopping when all of the tasks are executed correctly or the limit of execution times is not successfully executed.
Note: The Spark program completes the resource allocation at the time of registration. Start executor on the worker, assigning how many cores, which are done when the program starts initializing, not during the work. The program registers with master and the master assigns resources to the subclass under Coarsegrainedschedulerbackend: Sparkdeployschedulerbackend Management. Then there is the partition of the driver Dag, which is then handed to the Dag Scheduler (standalone mode is Taskschedulerimpl), Taskschedulerimpl obtains resources through Schedulerbackend, Assign specific tasks to specific machines (Executor).
The RDD mechanism realizes the model spark first knowledge