Analysis of the architecture of Spark (I.) Overview of the framework __spark

Source: Internet
Author: User
Tags shuffle hadoop mapreduce

1:spark Mode of operation

The explanation of some nouns in 2:spark

3:spark Basic process of operation

4:rdd Operation Basic Flow One: Spark mode of Operation

Spark operating mode of various, flexible, deployed on a single machine, can be run in local mode, can also be used in pseudo distribution mode, and when deployed in a distributed cluster, there are many operating modes to choose from, depending on the actual situation of the cluster, The underlying resource scheduling can depend on the external resource scheduling framework, or the spark-built standalone model. For the support of the external resource scheduling framework, the current implementation includes a relatively stable mesos model, as well as an ongoing development of the Hadoop yarn model in the update.

In practice, the operating mode of the spark application depends on the value of the master environment variable passed to Sparkcontext, and the individual pattern also relies on the auxiliary program interface to be used in conjunction with the currently supported master environment variable consisting of a specific string or URL, as follows:

LOCAL[N]: Local mode, using N threads

Local Cluster[worker,core,memory]: pseudo-distribution mode, you can configure the number of virtual work nodes that you need to start, and the number of CPUs and memory sizes that each work node manages

Spark://hostname:port:standalone mode, you need to deploy Spark to related nodes, URL is Spark master host address and Port

Mesos://hostname:port:mesos mode, you need to deploy spark and Mesos to related nodes, URL to Mesos host address and Port

One of the YARN Standalone/yarn Cluster:yarn modes, the main program logic and tasks are running in the YARN cluster

YARN Client:yarn mode Two, the main program logic runs locally, the specific task runs in the YARN cluster

Spark on Yarn mode diagram (detailed explanation reference click Read):


two: Spark some noun explanation

The concepts in the application and Hadoop MapReduce in Application:spark are similar, referring to the spark applications written by the user, Code that contains a driver function and executor code that runs on multiple nodes in a cluster

The Driver in Driver Program:spark runs the main () function of the application above and creates Sparkcontext, where the sparkcontext is created to prepare the operating environment for the spark application. In Spark, the Sparkcontext is responsible for and Clustermanager communication, the application of resources, assignment and monitoring of tasks, etc. driver is responsible for closing the sparkcontext when the executor part is completed. Usually used Sparkcontext to represent driver.

The general form should be like this

Package Thinkgamer

Import org.apache.spark.{ Sparkcontext, sparkconf}
import org.apache.spark.sparkcontext._

object wordcount{
  def main (args:array[ String]) {
    if (args.length = 0) {
      System.err.println ("Usage:wordcount <file1>")
      system.exit (1)
    }

    Val conf = new sparkconf (). Setappname ("WordCount")
    val sc = new Sparkcontext (conf) ...
    
    /write the spark code you wrote here

    sc.stop ()
  }
}

Executor:application a process running on a worker node that runs tasks and is responsible for the presence of data in memory or on disk, each application with its own set of Executor. In spark on yarn mode, the process name is Coarsegrainedexecutorbackend, similar to Yarnchild in the Hadoop mapreduce. A coarsegrainedexecutorbackend process has and only one executor object, which is responsible for wrapping the task into Taskrunner and extracting an idle thread from the thread pool to run the task. The number of tasks that each coarsegrainedexecutorbackend can run in parallel depends on the number of CPUs allocated to it.

Cluster Mananger: Refers to the external services to obtain resources on the cluster, currently:

Østandalone:spark resource Management, the Master is responsible for the allocation of resources;

Øhadoop Yarn: Responsible for the allocation of resources by the ResourceManager in Yarn;

Worker: Any node in the cluster that can run application code, similar to the NodeManager node in yarn. The middle finger in standalone mode is the worker node configured through the slave file, and the NodeManager node is the middle finger of the spark on yarn mode.

Job: Parallel computations that contain multiple task compositions, often spawned by the spark action, a job that contains multiple RDD and various operation that act on the corresponding RDD

Starge: Each job will be split into many groups of tasks, each group is called stage, also can be called Taskset, a job is divided into several stages

Task: A job that is sent to a executor three: the basic running process of spark 1:spark The following diagram:

(1): Build spark application running environment, start sparkcontext

(2): Sparkcontext to the Resource manager (can be Standalone,mesos,yarn) request to run Executor resources, and start Standaloneexecutorbackend, Executor to Sparkcontext for a task

(3): Sparkcontext distribute the application to executor

(4): Sparkcontext is constructed into a DAG graph, the Dag graph is decomposed into stage, the Taskset is sent to task Scheduler, and the task Scheduler is sent to executor to run

(5): Task on the executor run, run the release of all resources 2:spark running architecture features

(1): Each application gets the exclusive executor process, which resides throughout the application and runs the task in a multi-threaded manner. This application isolation mechanism is advantageous, whether from the scheduling point of view (each driver scheduling his own task), or from the operational point of view (from different application task runs in different JVMs), of course, this means that spark Application cannot share data across applications unless the data is written to an external storage system

(2): Spark is not related to the resource manager, as long as the executor process can be obtained, and can keep communication with each other

(3): The client submitting the Sparkcontext should be near the worker node (the node running the executor), preferably in the same rack, because spark There is a large amount of information exchange between Sparkcontext and executor during the application operation, if running in a remote cluster, it is best to use RPC to submit sparkcontext to the cluster, not to run away from the worker Sparkcontext

(4) Task uses the optimization mechanism of data locality and conjecture execution 3:dagscheduler

Dagscheduler transforms a spark job into a stage dag (directed acyclic graph-direction-free graph) and finds the least expensive scheduling method based on the relationship between Rdd and stage. The stage is then presented to the TaskScheduler in the form of Taskset, and the following illustration shows the role of Dagscheduler:

4:taskscheduler

Dagscheduler determines the ideal position of the task and passes it on to the lower TaskScheduler. In addition, Dagscheduler also handles failures due to shuffle data loss, and it may be necessary to resubmit the stage before the run (shuffle the task failure caused by data loss is handled by TaskScheduler)

TaskScheduler maintains all taskset, and when executor heartbeat occurs to driver, TaskScheduler assigns the corresponding task according to the resource surplus. In addition, TaskScheduler maintains the running tab of all tasks and retries the failed task. The following figure shows the role of TaskScheduler:

In different operating modes the Task Scheduler is specific:

(1): Spark on standalone mode is TaskScheduler;

(2): yarn-client mode is Yarnclientclusterscheduler

(3): Yarn-cluster mode is Yarnclusterscheduler four: rdd operation basic Flow

So rdd how to run in spark. It is probably divided into the following three steps:

1: Create the Rdd object

The 2:dagscheduler module intervenes and calculates the dependence between RDD, and the dependence between Rdd forms a DAG

3: Each job is divided into multiple stage. One of the main bases for dividing the stage is whether the input of the current calculation factor is determined, and if so, it is divided into the same stage to avoid the message passing overhead between multiple stage.


Look at the following example of a-Z to find out the total number of different names under the same initial letter to see how RDD is running

Step 1: Create the example above RDD to remove the last collect is an action, will not create RDD, the first four conversions will create a new RDD. So the first step is to create all the RDD (five internal information).

Step 2: Create an execution plan Spark as much as possible, and based on whether you want to rearrange the data to divide the phase (stage) , for example, the GroupBy () transformation will divide the entire execution plan into two phases. Eventually , a DAG (directed acyclic graph, a direction-free graph) is produced as a logical execution plan.


Step 3: The scheduling task divides the phases into different tasks , each of which is a combination of data and computation. All tasks for the current phase are completed before the next stage. Because the first transition in the next phase must be to rearrange the data, all the result data for the current phase must be calculated to continue.

Assuming that there are four blocks under the hdfs://names in this example, Hadooprdd partitions will have four partitions corresponding to the four block data, and Preferedlocations will indicate the best location for the four blocks. Now you can create four tasks and dispatch them to the appropriate cluster nodes.


In the next article we will discuss the yarn framework and Spark operating mode: http://blog.csdn.net/gamer_gyt/article/details/51833681

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.