Perspective job from the spark architecture (DT Big Data DreamWorks)

Last Update:2016-02-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Content:

1, through the case observation spark architecture;

2. Manually draw the internal spark architecture;

3, the Spark job logic view resolution;

4. The physical view resolution of Spark job;

Action-triggered job or checkpoint trigger job

========== the spark architecture through a case study ============

JPS See the Master, the role: the management of cluster computing resources, mainly refers to memory, CPU, will also consider the disk or network, but also to accept the client submitted job requests, allocation of resources. Description: Job coarse granularity, when committing the allocation of resources, the process of running the job, generally with the allocated resources, unless the running process occurs abnormally.

Worker process, mainly responsible for the current node per inch and CPU resource usage.

Spark is a master-slave distributed.

Start Spark-shell

./spark-shell--master spark://master:7077,worker1:7077,worker2:7077

Refresh the Web console again

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Why is 24 cores, each node's memory 1024M?

There's a configuration inside spark-env.sh.

[Email protected]:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cat spark-env.sh

Export java_home=/usr/java/jdk1.8.0_71

Export scala_home=/usr/local/scala-2.10.4

Export hadoop_home=/usr/local/hadoop-2.6.0

Export Hadoop_conf_dir=/usr/local/hadoop-2.6.0/etc/hadoop

#export Spark_master_ip=master

Export SPARK_WORKER_MEMORY=2G

Export SPARK_EXCUTOR_MEMORY=2G

Export SPARK_DRIVER_MEMORY=2G

Export spark_worker_cores=8

Export spark_daemon_java_opts= "-dspark.deploy.recoverymode=zookeeper-dspark.deploy.zookeeper.url=master:2181, Worker1:2181,worker2:2181-dspark.deploy.zookeeper.dir=/spark "

Did you start Spark-shell assigned a resource?

Of course, the standalone pattern is a coarse-grained allocation of resources.

Spark-shell By default, there is no job, and the stage is not, but resources are allocated.

But the executors is there:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

And the diagram below can also correspond to:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

A worker on the default one executor, can also be multiple executor, if there is only one executor CPU utilization is not high, you can configure multiple.

No matter how many jobs spark has, it is the resource allocated at the time of registration. The default resource allocation method, assigning a executorbackend to the current program on each worker and maximizing the use of cores and memory by default. If there is no limit, once the spark is running and the resources are directly filled, then a resource manager such as yarn or mesos is required.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

The task runs above executor:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

How many concurrent tasks can be run at a time depends on the number of cores that the current executor can use.

Sc.textfile ("/historyserverforspark/readme.md", 3). FlatMap (_.split ("")). Map (word=> (word,1)). Reducebykey (_+_, 1). cache

The degree of parallelism is also inherited

Sc.textfile ("/historyserverforspark/readme.md", 3). FlatMap (_.split ("")). Map (word=> (word,1)). Reducebykey (_+_) . Sortbykey (). Cache

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

3 parallel jobs have been inherited.

========== manually draw the spark internal architecture ============

Master, Driver, Worker, Executor

/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit--class Com.dt.spark.SparkApps.cores.WordCount_ Cluster--master Spark://master:7077/root/documents/sparkapps/sparkappsinjava.jar

Submit the job, that is, Drvier is submitted by spark-submit

Does the thread inside the E xecutor care what code to run? Threads are just computational resources, so the task and thread are decoupled because the thread does not care what code is running in the specific task. So thread can be reused.

What to do, run the encapsulated code through the runner interface.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

When the cluster starts, it starts the master process, manages and assigns the entire cluster resource, receives the job submission, and allocates compute resources for the job. That is, each worker node, by default, initiates a worker process to manage the memory, CPU, and other computing resources of the current node. and report to master that the worker is still working properly.

When a user job is submitted to master, Master assigns an ID to the program and assigns a compute resource, which by default assigns a goarsegrainedexecutorbackend process to each process for the current application. The process maximizes the use of memory and CPU on the current node by default.

Each thread can be reused to execute multiple tasks

Each appication contains a driver and multiple executors, each of which runs within a single executor

Logical view resolution for ==========spark job ============

The entire cluster is the master and Worker node, which is the master-slave structure

The worker is the Guardian node on the Worknode, and the worker node has work process

The work process assigns the goarsegrainedexecutorbackend process to the application that is currently running for the command that receives master

Will worker process manage compute resources? No. Worker process knowledge goes through the form and looks like it manages resources, but the real management of resources is master!!! Master manages the compute resources on each machine.

Drvier inside the main method, there are sparkcontext ...

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

The data flows past within the stage. There are multiple transformation in a stage.

Physical view resolution for ==========spark job ============

, Stage5 is the mapper of Stage6. Stage6 is the reducer of Stage5.

Spark is a concrete implementation of a more refined and efficient map reduce idea.

The task in the last stage is the result task type, and the type of the task in all previous stages is the Shufflemaptask type.

The contents of the stage must be executed in executor.

And the stage must be executed from the go.

One application of spark can generate a large number of jobs for different actions, with at least one stage for each job

The important role of this lecture is to consolidate the important elements of the previous session and to open the following Spark's password journey.

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]

This article from "a Flower proud Cold" blog, declined reprint!

Perspective job from the spark architecture (DT Big Data DreamWorks)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Perspective job from the spark architecture (DT Big Data DreamWorks)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Perspective job from the spark architecture (DT Big Data DreamWorks)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support