Only know what the kernel architecture is based on, and then know why to write programs like this?
Manual drawing to decrypt the spark kernel architecture
Validating the spark kernel architecture with a case
Spark Architecture considerations
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Several concepts of ==========spark runtime ============
Download down run, basically are standalone mode, if mastered standalone, then yarn and mesos, not to do special instructions, are standalone mode
Application=driver+executor,executor is the processing data shard, inside is the thread pool concurrent processing data shard
Drvier (part of the Code is Sparkconf+sparkcontext): When running the program, with the main method, and create an environment object such as Sparkcontext, is the core of the entire program run schedule, it is not a resource scheduling. There are high-level dispatching, the bottom scheduling. When you run the object, the current program is registered with master. Master allocates the resources, there will be a series of jobs, the stage is given to the Task Scheduler, sent to executor, run the results back to Drvier, and then close Sparkcontext.
Executor: Runs on the node where the worker is located, the objects in the process that are opened for the current application, concurrently executed and thread reused through the thread pool. Code distributed across clusters, but to run, the code has to be sent over. A worker typically opens only one executor, but can also configure multiple. How many excutor, in the optimization of the time to explain.
The following code is Drvier, which has a DAG
val conf = new sparkconf ()//Create sparkconf Object
Conf.setappname ("My first Spark app!") //Set the name of the application, you can see the name in the monitoring interface of the program run
Conf.setmaster ("local")//This time the program is running locally without the need to install the spark cluster
val sc = new sparkcontext (conf)//Customize the specific parameters and configuration information of the spark run by passing in the sparkconf instance by creating a Sparkcontext object
Then comes the RDD, which is the code that executor executes, and that's the code below
val lines = sc.textfile ("f:/installation file/operating system/spark-1.6.0-bin-hadoop2.6/readme.md", 1)
Several concepts of ==========spark cluster ============
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Cluster Manager: The external service that gets resources in the cluster, the Spark application does not depend on Cluster manager. The spark application is registered with Cluster Manager, such as master registration succeeds, and Cluster manager has allocated resources to master. Running the program at this point does not require cluster Manager at all. This is a coarse-grained way of running.
Master's machine configuration should be very good.
Woker: The node that can run the operation code specifically. Woker itself does not run the code of the program, it is to manage the current node's memory CPU and other resources usage status, it will accept the master allocation of computing resources (executor, in the new process allocation) instructions, through Executor runner start a new process, Inside there is Excutor,worker foreman, Cluster Manager is the project manager, there are a lot of people working under the worker.
Will the worker report the current machine's memory and CPU information to the master when the heartbeat is being made?
No! The worker's heartbeat to master is only Workerid.
If not, how does master know about its resources?
Application registration is successful, master allocates resources and knows it.
Job: A parallel calculation that includes some column tasks. Action triggers are generally required. Action does not produce an RDD. The action is preceded by an RDD, which is transfoemation, which is the lazy level.
Spark is not fast because of memory-based. The most basic is the dispatch, then is fault-tolerant, is the essence, there are many other content.
Narrow dependence, in addition to one-to-a, range-level, fixed number, will not vary with size, such as the original 3, even if added to 100, or 3 Mappartitionrdd.
The internal computing logic of the stage is exactly the same, except that the calculated data is different. This is distributed parallel computing, which is the essential point of big data.
A partition is not a fixed 128M? No, because the last piece of data spans two blocks.
A application can have more than one job, usually one action operation. Other situations, such as checkpoint, can also lead to job.
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
==========spark the kernel schema to decrypt the drawing ============
~~~1, driver~~~
The machine dedicated to the Spark program (Driver, whose core is sparkcontext), must be in the same network environment as the spark cluster, and its configuration is consistent with the normal worker. Because of the constant interaction, so the network together. The machine may have other enterprise-class Java EE programs. This machine has appication, various dependent external resources (such as. So, file, and so on). On this basis, using Spark-submit to run, you can configure various runtime parameters, such as memory, cores and so on. The actual production environment will not be manually spark-submit to run, generally write scripts, with crontab Automation configuration and submission program. The machine that submitted the spark program must have spark installed, except that the spark installed here does not belong to the cluster!!! Driver does not normally do ha when a single column, but the master form, in--supervise mode, it will automatically start driver.
(There are two modes of operation of the Spark program, the client mode and the Cluster mode, the default is to use the client mode, because you can see a lot of log information, but the actual use, Cluster)
Before writing the program to master, because there are now only 3 machines, so the general driver and workers are separate.
~~~2, sparkcontext~~~
The most important thing to do three things: Create Dagscheduler, TaskScheduler, Schedulerbackend
In the process of instantiating the register current program to Master,master accept registration, if there is no problem, Master will assign AppID to the current program and allocate compute resources
~~~3, sparkcluster~~~
Master accepts jobs submitted by the current user, gives the worker the resources of the current program, and each worker node implicitly considers the current program to be assigned a executor and executes concurrently through the thread pool in executor
Slaves Teacher machine configuration Worker1, 2, 3, 4 that's what this means.
My configuration is master, Woker1, 2, because my Computer configuration is low
Master allocates how many resource sources: 1, spark-env.sh and spark-defaults.sh parameters set, 2, submit parameters, 3, the parameters provided in the program sparkconf
The worker is started remotely with a proxy object instance of Executorrunner Executorbackend
There's a executor thread pool inside the Executorbackend ThreadPool
The task is encapsulated by Taskrunner when actually working
Taskrunner is a runner interface and then gets a thread from ThreadPool to execute a task, and after execution, the thread is recycled
This is where the resources are allocated, and then the job is triggered by action, and this time Dagscheduler
~~~4, job~~~
In general, when a job is triggered through an action, Sparkcontext uses Dagscheduler to divide the Dag in the job into different stages, each inside a series of internal logic exactly the same. But the tasks that work with different data make up the Taskset
~~~5, TaskScheduler ~ ~ ~
TaskScheduler and schedulerbacked specific task runs (data locality compliance)
~~~5, Task type ~ ~ ~
The last stage task, called Resulttask, produces the job result
Other tasks on the previous stage are shufflemaptask preparing data for the next stage, equivalent to the mapper of MapReduce, and knowledge is a more refined and efficient implementation
The essence of the whole spark operation is that dagscheduler the job into different stages, submits taskset to TaskScheduler, and submits it to executor (conforming to the data local line), Each task computes a partition in the RDD, based on the partition to execute a series of functions that we define within the same stage, and so on, until the entire program runs!
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/7A/C3/wKioL1a0GIWBqQMtAAlClnFVfUw973.jpg "title=" Spark kernel architecture diagram. jpg "alt=" wkiol1a0giwbqqmtaalclnfvfuw973.jpg "/>
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Liaoliang Teacher's card:
China Spark first person
Sina Weibo: Http://weibo.com/ilovepains
Public Number: Dt_spark
Blog: http://blog.sina.com.cn/ilovepains
Mobile: 18610086859
qq:1740415547
Email: [Email protected]
This article from "a Flower proud Cold" blog, declined reprint!
Spark Kernel architecture decryption (dt Big Data Dream Factory)