Overview of Spark cluster mode

Last Update:2015-08-03 Source: Internet

Author: User

Tags apache mesos hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview of Spark cluster mode

This article briefly reviews how Spark runs on clusters to make the components easier to understand.

Components

Spark applications run in a collection of independent processes on the cluster and are adjusted using the SparkContext object in your main program (called the driver. In particular, to run on a cluster, SparkContext can be connected to several types of cluster managers (Spark's own separate cluster manager or Mesos/YARN, these cluster managers can allocate resources between applications. Once connected, Spark requires sub-nodes in the thread pool on the cluster, that is, the worker processes that execute computing and Store application data. It then sends your application code (files defined in JAR or Python and transmitted to SparkContext) to the thread pool. Finally, SparkContext sends a task to run the thread pool.

Note the following when using this architecture:

Each application has its own thread pool process, which is maintained throughout the application running process and runs tasks in multiple threads. The advantage of this is that applications are isolated from each other in terms of scheduling (each driver schedules its own tasks) and execution (tasks of different applications run on different JVMs ). However, this also means that data cannot be shared among different Spark applications (SparkContext instances) without writing data to an additional storage system.
Spark is unknown for potential cluster managers. As long as it requires the process in the thread pool to communicate with each other, it is relatively easy to run on the Cluster Manager that supports other applications (for example, Mesos/YARN.
Because the driver schedules tasks on the cluster, it should run close to the working node and be better in the same LAN. If you want to send a request to a remote cluster, it is better to enable an RPC for the driver so that it can submit the operation nearby instead of running a driver far away from the working node.

The cluster management system currently supports three types of cluster management:

Singleton mode-a simple cluster management, which includes a Spark
Apache Mesos mode-a common cluster management mode that can run Hadoop MapReduce and service applications
Hadoop YARN mode-Resource Management Mode in Hadoop2.0

In fact, the Spark EC2 STARTUP script in Amazon EC2 (Amazon Elastic Computing cloud) can easily start the singleton mode.

Publish code to the Cluster

A recommended way to publish code to a cluster is through the SparkContext constructor, which can generate a JAR file list (Java/scala00000000.egg file and .zip package file (Python) for the worker node ). You can also execute SparkContext. addJar and addFile to dynamically create the sending file.

Each driver of the monitor has a web UI, typically on port 4040. You can see information about running tasks, programs, and storage space. You can enter a simple url in the browser to access: http: // <driver node>: 4040. The monitor can also guide other monitor information.

Task Scheduling

Spark can allocate resources outside the application (cluster management level) and in the application (if there are multiple computing commands in the same SparkContext. You can learn more about task scheduling here.

Vocabulary

You will see the terms summarized in the following table in the cluster concept:

Terms	Meaning
Application	Programs built on Spark are run by drivers and sub-execution clusters.
Driver	Processes that run the man function and also create SparkContext
Cluster Administrator	Extended services on the resource cluster (for example, single-instance mode administrator, Mesos, YARN)
Worker Node	Any node that can run applications in the Cluster
Performer	A process started for the application on the worker node. It can run tasks and store data in the memory or hard disk. Each application has its own executor.
Task	A unit of work that can send data to the executor.
Work	A parallel computing composed of multiple tasks and can receive responses from Spark actions (for example, save and collect). You can see this term in the driver log.
Phase	Each job is divided into many small task sets called stages (similar to the map and reduce stages in MapReduce). You can see this term in the driver log.

-------------------------------------- Split line --------------------------------------

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

-------------------------------------- Split line --------------------------------------

Spark details: click here
Spark: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More