Overview of Spark cluster mode
This article briefly reviews how Spark runs on clusters to make the components easier to understand.
Components
Spark applications run in a collection of independent processes on the cluster and are adjusted using the SparkContext object in your main program (called the driver. In particular, to run on a cluster, SparkContext can be connected to several types of cluster managers (Spark's own separate cluster manager or Mesos/YARN, these cluster managers can allocate resources between applications. Once connected, Spark requires sub-nodes in the thread pool on the cluster, that is, the worker processes that execute computing and Store application data. It then sends your application code (files defined in JAR or Python and transmitted to SparkContext) to the thread pool. Finally, SparkContext sends a task to run the thread pool.
Note the following when using this architecture:
- Each application has its own thread pool process, which is maintained throughout the application running process and runs tasks in multiple threads. The advantage of this is that applications are isolated from each other in terms of scheduling (each driver schedules its own tasks) and execution (tasks of different applications run on different JVMs ). However, this also means that data cannot be shared among different Spark applications (SparkContext instances) without writing data to an additional storage system.
- Spark is unknown for potential cluster managers. As long as it requires the process in the thread pool to communicate with each other, it is relatively easy to run on the Cluster Manager that supports other applications (for example, Mesos/YARN.
- Because the driver schedules tasks on the cluster, it should run close to the working node and be better in the same LAN. If you want to send a request to a remote cluster, it is better to enable an RPC for the driver so that it can submit the operation nearby instead of running a driver far away from the working node.
The cluster management system currently supports three types of cluster management:
- Singleton mode-a simple cluster management, which includes a Spark
- Apache Mesos mode-a common cluster management mode that can run Hadoop MapReduce and service applications
- Hadoop YARN mode-Resource Management Mode in Hadoop2.0
In fact, the Spark EC2 STARTUP script in Amazon EC2 (Amazon Elastic Computing cloud) can easily start the singleton mode.
Publish code to the Cluster
A recommended way to publish code to a cluster is through the SparkContext constructor, which can generate a JAR file list (Java/scala00000000.egg file and .zip package file (Python) for the worker node ). You can also execute SparkContext. addJar and addFile to dynamically create the sending file.
Each driver of the monitor has a web UI, typically on port 4040. You can see information about running tasks, programs, and storage space. You can enter a simple url in the browser to access: http: // <driver node>: 4040. The monitor can also guide other monitor information.
Task Scheduling
Spark can allocate resources outside the application (cluster management level) and in the application (if there are multiple computing commands in the same SparkContext. You can learn more about task scheduling here.
Vocabulary
You will see the terms summarized in the following table in the cluster concept:
Terms |
Meaning |
Application |
Programs built on Spark are run by drivers and sub-execution clusters. |
Driver |
Processes that run the man function and also create SparkContext |
Cluster Administrator |
Extended services on the resource cluster (for example, single-instance mode administrator, Mesos, YARN) |
Worker Node |
Any node that can run applications in the Cluster |
Performer |
A process started for the application on the worker node. It can run tasks and store data in the memory or hard disk. Each application has its own executor. |
Task |
A unit of work that can send data to the executor. |
Work |
A parallel computing composed of multiple tasks and can receive responses from Spark actions (for example, save and collect). You can see this term in the driver log. |
Phase |
Each job is divided into many small task sets called stages (similar to the map and reduce stages in MapReduce). You can see this term in the driver log. |
-------------------------------------- Split line --------------------------------------
Spark1.0.0 Deployment Guide
Install Spark0.8.0 in CentOS 6.2 (64-bit)
Introduction to Spark and its installation and use in Ubuntu
Install the Spark cluster (on CentOS)
Hadoop vs Spark Performance Comparison
Spark installation and learning
Spark Parallel Computing Model
-------------------------------------- Split line --------------------------------------
Spark details: click here
Spark: click here
This article permanently updates the link address: