Perspective job from the spark architecture (DT Big Data DreamWorks)

Source: Internet
Author: User

Content:

1, through the case observation spark architecture;

2. Manually draw the internal spark architecture;

3, the Spark job logic view resolution;

4. The physical view resolution of Spark job;

Action-triggered job or checkpoint trigger job

========== the spark architecture through a case study ============

JPS See the Master, the role: the management of cluster computing resources, mainly refers to memory, CPU, will also consider the disk or network, but also to accept the client submitted job requests, allocation of resources. Description: Job coarse granularity, when committing the allocation of resources, the process of running the job, generally with the allocated resources, unless the running process occurs abnormally.

Worker process, mainly responsible for the current node per inch and CPU resource usage.

Spark is a master-slave distributed.

Start Spark-shell

./spark-shell--master spark://master:7077,worker1:7077,worker2:7077

Refresh the Web console again

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Why is 24 cores, each node's memory 1024M?

There's a configuration inside spark-env.sh.

[Email protected]:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cat spark-env.sh

Export java_home=/usr/java/jdk1.8.0_71

Export scala_home=/usr/local/scala-2.10.4

Export hadoop_home=/usr/local/hadoop-2.6.0

Export Hadoop_conf_dir=/usr/local/hadoop-2.6.0/etc/hadoop

#export Spark_master_ip=master

Export SPARK_WORKER_MEMORY=2G

Export SPARK_EXCUTOR_MEMORY=2G

Export SPARK_DRIVER_MEMORY=2G

Export spark_worker_cores=8

Export spark_daemon_java_opts= "-dspark.deploy.recoverymode=zookeeper-dspark.deploy.zookeeper.url=master:2181, Worker1:2181,worker2:2181-dspark.deploy.zookeeper.dir=/spark "

Did you start Spark-shell assigned a resource?

Of course, the standalone pattern is a coarse-grained allocation of resources.

Spark-shell By default, there is no job, and the stage is not, but resources are allocated.

But the executors is there:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

And the diagram below can also correspond to:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

A worker on the default one executor, can also be multiple executor, if there is only one executor CPU utilization is not high, you can configure multiple.

No matter how many jobs spark has, it is the resource allocated at the time of registration. The default resource allocation method, assigning a executorbackend to the current program on each worker and maximizing the use of cores and memory by default. If there is no limit, once the spark is running and the resources are directly filled, then a resource manager such as yarn or mesos is required.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

The task runs above executor:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

How many concurrent tasks can be run at a time depends on the number of cores that the current executor can use.

Sc.textfile ("/historyserverforspark/readme.md", 3). FlatMap (_.split ("")). Map (word=> (word,1)). Reducebykey (_+_, 1). cache

The degree of parallelism is also inherited

Sc.textfile ("/historyserverforspark/readme.md", 3). FlatMap (_.split ("")). Map (word=> (word,1)). Reducebykey (_+_) . Sortbykey (). Cache

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

3 parallel jobs have been inherited.

========== manually draw the spark internal architecture ============

Master, Driver, Worker, Executor

/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit--class Com.dt.spark.SparkApps.cores.WordCount_ Cluster--master Spark://master:7077/root/documents/sparkapps/sparkappsinjava.jar

Submit the job, that is, Drvier is submitted by spark-submit

Does the thread inside the E xecutor care what code to run? Threads are just computational resources, so the task and thread are decoupled because the thread does not care what code is running in the specific task. So thread can be reused.

What to do, run the encapsulated code through the runner interface.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

When the cluster starts, it starts the master process, manages and assigns the entire cluster resource, receives the job submission, and allocates compute resources for the job. That is, each worker node, by default, initiates a worker process to manage the memory, CPU, and other computing resources of the current node. and report to master that the worker is still working properly.

When a user job is submitted to master, Master assigns an ID to the program and assigns a compute resource, which by default assigns a goarsegrainedexecutorbackend process to each process for the current application. The process maximizes the use of memory and CPU on the current node by default.

Each thread can be reused to execute multiple tasks

Each appication contains a driver and multiple executors, each of which runs within a single executor

Logical view resolution for ==========spark job ============

The entire cluster is the master and Worker node, which is the master-slave structure

The worker is the Guardian node on the Worknode, and the worker node has work process

The work process assigns the goarsegrainedexecutorbackend process to the application that is currently running for the command that receives master

Will worker process manage compute resources? No. Worker process knowledge goes through the form and looks like it manages resources, but the real management of resources is master!!! Master manages the compute resources on each machine.

Drvier inside the main method, there are sparkcontext ...

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

The data flows past within the stage. There are multiple transformation in a stage.

Physical view resolution for ==========spark job ============

, Stage5 is the mapper of Stage6. Stage6 is the reducer of Stage5.

Spark is a concrete implementation of a more refined and efficient map reduce idea.

The task in the last stage is the result task type, and the type of the task in all previous stages is the Shufflemaptask type.

The contents of the stage must be executed in executor.

And the stage must be executed from the go.

One application of spark can generate a large number of jobs for different actions, with at least one stage for each job

The important role of this lecture is to consolidate the important elements of the previous session and to open the following Spark's password journey.

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]


This article from "a Flower proud Cold" blog, declined reprint!

Perspective job from the spark architecture (DT Big Data DreamWorks)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.