Original link: http://www.raincent.com/content-85-11052-1.html
In the field of large data, only deep digging in the field of data science, to walk in the academic forefront, in order to be in the underlying algorithms and models to walk in front of, and thus occupy the leading position. Source: Canada Rice Valley Large data
In the field of large data, only deep digging in the field of data science, to walk in the academic forefront, in order to be in the underlying algorithms and models to walk in front of, and thus occupy the leading position.
The spark of this academic gene makes it possible to establish an advantage in large data fields from the outset. Regardless of the performance, or the unity of the scheme, compared to the traditional Hadoop, the advantages are very obvious. Spark provides a RDD integrated solution that unifies models such as MapReduce, streaming, SQL, Machine Learning, Graph processing to a single platform and exposes them to a consistent API and provide the same deployment plan, make Spark engineering application field become more extensive. This article is mainly divided into the following chapters:
I. Definition of SPARK professional terms
Second, spark operation of the basic flow
Three, spark operation structure characteristic
Four, spark core principle perspective
I. Definition of SPARK professional terms
1. Application:spark Application
Refers to a user-written spark application that contains the driver feature code and executor code that is distributed across multiple nodes in the cluster.
Spark application, consisting of one or more job jobs, as shown in the following illustration:
2. Driver: Driver
The driver in spark runs the main () function of the application above and creates Sparkcontext, where the sparkcontext is created to prepare the operating environment for the spark application. In Spark, the Sparkcontext is responsible for and Clustermanager communication, the application of resources, assignment and monitoring of tasks, etc. driver is responsible for closing the sparkcontext when the executor part is completed. Usually sparkcontext represents driver, as shown in the following illustration:
3. Cluster Manager: Resource Manager
Refers to the acquisition of resources on the cluster external services, commonly used are: Standalone,spark Native resource Manager, the Master is responsible for the allocation of resources; HADDOP Yarn is responsible for the allocation of resources by the Researchmanager in Yarn; Messos, the Messos master in Messos is responsible for resource management, as shown in the following illustration:
4. Executor: Actuator
Application a process running on a worker node that is responsible for running the task and is responsible for the existence of the data in memory or on disk, each application has a separate batch of executor, as shown in the following illustration:
5. Worker: Compute node
Any node in the cluster that can run application code, similar to the NodeManager node in yarn. In standalone mode, the worker node is configured through the slave file, and the NodeManager node is the middle finger in the spark on yarn mode, and spark Messos node is the middle finger in the Messos on slave mode. As shown in the following illustration:
6, RDD: Elastic distributed Data set
The basic computing unit of the resillient distributed Dataset,spark can be manipulated through a series of operators (mainly transformation and action), as shown in the following illustration:
7. Narrow Reliance
Parent RDD Each partition is used by up to one child RDD partition; a partition that behaves as a parent rdd corresponds to a partition of a child rdd, or a partition of two parent rdd corresponding to a partition of a child rdd. As shown in the figure:
8. Wide Dependence
Each partition of the parent RDD may be used by multiple child RDD partitions, which typically correspond to all parent RDD partitions. As shown in the figure:
Common narrow dependencies are: map, filter, Union, Mappartitions, Mapvalues, join (parent RDD is hash-partitioned: If the Joinapi API invoked before Rdd is wide dependent ( There are shuffle), and the number of RDD partitions of the two join is the same, the number of RDD partitions of the join result is the same, and the join API is narrow dependent.
Common wide dependencies include Groupbykey, Partitionby, Reducebykey, join (parent RDD is not hash-partitioned: Otherwise, the RDD join API is wide dependent).
9, DAG: There is no ring map
Directed acycle graph, reaction rdd dependencies, as shown in the figure:
10, Dagscheduler: A direction-free loop diagram Scheduler
Based on Dag dividing stage and submitting stage to TaskScheduler in Taskset situation; responsible for splitting the job into multiple batches of dependencies with different stages; one of the most important tasks is to compute the dependencies of jobs and tasks, and to develop scheduling logic. Instantiated in the course of Sparkcontext initialization, a sparkcontext corresponds to creating a dagscheduler.
11. TaskScheduler: Task Scheduler
Submit Taskset to the worker (cluster) to run and return the results, and be responsible for the actual physical dispatch of each specific task. As shown in the figure:
12, Job: Homework
A single computation operation consisting of one or more scheduling stages; Parallel computations that contain multiple task compositions are often spawned by the spark action, and a job contains multiple RDD and various operation acting on the corresponding RDD. As shown in the figure:
13, Stage: Scheduling phase
The scheduling phase of a task set; each job is split into a number of task groups, each of which is called a stage, or taskset, and a job is divided into several stages; Stage is divided into two types of shufflemapstage and resultstage. As shown in the figure:
14. TaskSet: Task Set
A set of tasks that are associated with a set of tasks that are not shuffle dependent on each other. As shown in the figure:
Tips:
1) A stage creates a taskset;
2 Create a task for each RDD partition in the stage, with multiple tasks encapsulated as Taskset
15. Task: Mission
A task that is sent to a executor, the smallest process unit on a single partitioned dataset. As shown in the figure:
Overall as shown in the figure:
Second, spark operation of the basic flow
Three, spark operation structure characteristic
1, executor process exclusive
Each application gets the exclusive executor process, which resides throughout the application and runs tasks in multi-threaded fashion. Spark application cannot share data across applications unless the data is written to an external storage system. As shown in the figure:
2, support a variety of resource managers
Spark is not related to the resource manager, as long as it is able to acquire executor processes and keep communicating with each other, Spark support Explorer includes: Standalone, on Mesos, on YARN, or on EC2. As shown in the figure:
3, job submission of the nearest principle
The client submitting the Sparkcontext should be near the worker node (the node running the executor), preferably in the same rack (rack), because spark There is a great deal of information exchange between Sparkcontext and executor in the process of application operation; If you want to run in a remote cluster, it is best to use RPC to submit sparkcontext to the cluster and not to run sparkcontext away from the worker. As shown in the figure:
4, move the program instead of the principle of moving Data execution
The task uses the optimization mechanism of data locality and conjecture execution. Key methods: Taskidtolocations, Getpreferedlocations. As shown in the figure:
Four, spark core principle perspective
1. Calculation flow
2. Build DAG diagram from code
Spark Program
Val lines1 = Sc.textfile (inputPath1). Map (...)). Map (...)
Val lines2 = Sc.textfile (inputPath2). Map (...)
Val Lines3 = Sc.textfile (INPUTPATH3)
Val dtinone1 = lines2.union (LINES3)
Val Dtinone = Lines1.join (dtinone1)
Dtinone.saveastextfile (...)
Dtinone.filter (...). foreach (...)
The spark calculation takes place in the Rdd action, and all transformation,spark before the action only record the RDD generated trajectory without triggering a real calculation.
The spark kernel draws a forward-free graph of the computed path at the time that the calculation takes place, that is, the DAG.
3, divides the DAG into the stage core algorithm
Application multiple jobs multiple Stage:spark application can trigger numerous jobs because of different action, a application can have a lot of jobs, each job is composed of one or more stage, The stage behind is dependent on the previous stage, which means that only the previously dependent stage will run after the calculation is completed.
Dividing the basis: the basis of stage division is wide dependence, when the generation of wide dependence, Reducebykey, Groupbykey and other operators, will lead to the generation of wide dependence.
Core algorithm: From back to backward, encounter narrow dependence to join the stage, meet wide dependencies for stage segmentation. The spark kernel starts at the back of the RDD that triggers the action action, first creating a stage for the last Rdd, and then continuing backwards, and if you find that you are wide dependent on a RDD, you create a new rdd for that stage that is wide dependent. That Rdd is the last rdd of the new stage. And so on, and so on, continue backwards, stage by narrow dependencies or wide dependencies until all RDD are fully traversed.
4, divides the DAG into the stage analysis
Reading data from the HDFS generates 3 different Rdd, and then saves the results back to HDFs through a series of transformation operations. You can see that only the join operation in this DAG is a wide dependency, and the spark kernel divides it back and forth into different stage as a boundary. At the same time, we can note that in the graph Stage2, from map to union are narrow dependencies, these two steps can form a pipelined operation, the partition can be generated through the map operation without waiting for the entire RDD calculation end, but continue the union operation, This greatly increases the efficiency of the calculation.
5. Related code
6. Submit Stages
The commit of the scheduling phase will eventually be converted to the submission of a set of tasks, Dagscheduler through the TaskScheduler interface to commit the task set, This set of tasks will eventually trigger TaskScheduler to build a Tasksetmanager instance to manage the lifecycle of the task set, and for Dagscheduler, the task of submitting the scheduling phase is done. and the concrete implementation of TaskScheduler will be in the calculation of resources, further through the Tasksetmanager scheduling specific tasks to the corresponding executor node operation.
7. Related Code
Tasksetmanager is responsible for managing a separate taskset in Taskschedulerimpl, tracking each task and, if the task fails, is responsible for retrying the task until the maximum number of task retries is reached.
8, monitoring job, Task, Executor
Dagscheduler monitoring Job and task: to ensure that the interdependent job scheduling phase can be successfully scheduled to execute, Dagscheduler need to monitor the current job scheduling stage and even the completion of the task. This is accomplished by exposing a series of callback functions, which, for TaskScheduler, include the beginning and end of the task and the failure of the task set, and Dagscheduler further maintain the status information of the job and the scheduling phase according to the lifecycle information of these tasks.
Dagscheduler monitors the life state of executor: TaskScheduler notifies Dagscheduler of the specific executor state of life through a callback function, if one executor crashes, The output of the shufflemaptask of the corresponding schedule-phase task set will also be marked as unavailable, which will result in a change in the state of the task set and then perform the related calculation task again to get the missing related data.
9. Get Task Execution Results
Results Dagscheduler: When a specific task is executed in executor, the result needs to be returned to the Dagscheduler in some form, depending on the type of task, the result of the task is returned in a different way.
Two kinds of results, intermediate result and final result: for the task corresponding to Finalstage, return to Dagscheduler is the result of the operation itself, but for the task Shufflemaptask of the intermediate dispatch phase, Returned to Dagscheduler is the related storage information in a mapstatus, rather than the result itself, where information is used as a basis for obtaining input data as a task in the next scheduling phase.
Two types, Directtaskresult and Indirecttaskresult: Depending on the size of the task result, the result of the Resulttask return is divided into two categories, if the result is small enough, directly in the Directtaskresult object, If a particular dimension is exceeded, the Directtaskresult is first serialized at the executor end, and the serialized result is stored in Blockmanager as a block of data. The Blockid returned by Blockmanager is then returned to TaskScheduler in the Indirecttaskresult object, TaskScheduler then calls Taskresultgetter to remove the blockid from the Indirecttaskresult and finally obtain the corresponding Blockmanager through the Directtaskresult.
10, the overall interpretation of task scheduling