Comparison of Three distributed deployment modes of Apache Spark

Source: Internet
Author: User
Tags hadoop ecosystem

Among them, the first is similar to the pattern adopted by MapReduce 1.0, which implements fault tolerance and resource management internally, and the last two are the future development trends, some fault tolerance and resource management are completed by a unified Resource Management System: Spark runs on a general resource management system, which can share a cluster resource with other computing frameworks, such as MapReduce, the biggest benefit is to reduce O & M costs and improve resource utilization. Resources are allocated on demand ). This article describes the three deployment methods and compares their advantages and disadvantages.

Standalone mode: standalone mode. It comes with a complete service and can be deployed to a single cluster without relying on any other resource management systems. To a certain extent, this mode is the basis of the other two types. Drawing on the Spark development mode, we can get a general idea of developing a new computing framework: first design its standalone mode. In order to achieve rapid development, services such as master/slave do not need to be considered at first) and then develop the corresponding wrapper to deploy services in the stanlone mode to the Resource Management System yarn or mesos. The resource management system is responsible for Fault Tolerance of services. Currently, Spark does not have any single point of failure (spof) in standalone mode, which is implemented by zookeeper. The idea is similar to the Hbase master single point of failure solution. Comparing Spark standalone with MapReduce, we will find that the two of them are completely consistent in Architecture:

1) They are all composed of master/slaves services, and the master has a single point of failure at first. Later, zookeeper was used to solve the single point of failure in Apache MRv1 JobTracker, but the CDH version is resolved );

2) resources on each node are abstracted into coarse-grained slots, and the number of slots can run at the same time. The difference is that MapReduce divides slots into map slot and reduce slot, which can only be used by Map tasks and Reduce tasks, but cannot be shared. This is one of the reasons for inefficient MapReduce resource interest rates, spark is more optimized. It does not differentiate slot types. There is only one slot available for various types of tasks. This method can improve resource utilization, but is not flexible enough, slot resources cannot be customized for different types of tasks. In short, the two methods have their own advantages and disadvantages.

Spark On Mesos mode. This is a pattern adopted by many companies. The official recommendation of this pattern is, of course, due to kinship ). Since Spark was initially developed to support Mesos, Spark is now more flexible and more natural to run on Mesos than on YARN. Currently, in the Spark On Mesos environment, you can choose one of the two Scheduling modes to run your own applications. For details, refer to Andrew Xia's "Mesos Scheduling Mode on Spark "):

1) Coarse-grained Mode (Coarse-grained Mode): The running environment of each application is composed of a Dirver and several executors. Each Executor occupies several resources, number of "slots" corresponding to multiple tasks that can be run internally "). Before running each task of the application, you must apply for all the resources in the running environment and keep occupying these resources during the running process, reclaim these resources. For example, when you submit an application, you must use five executors to run your application. Each executor occupies 5 GB of memory and 5 CPUs, if five slots are set in each executor, Mesos needs to allocate resources to the executor and start them before scheduling the task. In addition, while the program is running, the master and slave of mesos do not know the running status of each task in the executor. The executor directly reports the task status to the Driver through the internal communication mechanism, to a certain extent, it can be considered that each application uses mesos to build a virtual cluster for its own use.

2) Fine-grained Mode): Spark On Mesos also provides another scheduling Mode: Fine-grained Mode, this mode is similar to the current cloud computing, and the idea is to allocate resources on demand. Like in coarse-grained mode, executor is started when an application is started, but each executor occupies only the resources required for running its own tasks. You do not need to consider the tasks to be run in the future, mesos will dynamically allocate resources for each executor. Each time some resources are allocated, a new Task can be run, and corresponding resources can be released immediately after a single Task is run. Each Task reports the status to the Mesos slave and Mesos Master to facilitate fine-grained management and fault tolerance. This scheduling mode is similar to the MapReduce scheduling mode. Each Task is completely independent, and the advantage is that it facilitates resource control and isolation, however, the disadvantage is also obvious, and the short job has a high latency.

Spark On YARN mode. This is the most promising deployment mode. However, limited to the development of YARN, only the Coarse-grained Mode is supported currently ). This is because the Container resources on YARN cannot be dynamically scaled, once the Container is started, the available resources cannot change, but this has been in the YARN plan specific reference: https://issues.apache.org/jira/browse/YARN-1197).

In short, these three distributed deployment methods have their own advantages and disadvantages, and usually need to decide which solution to adopt according to the company's situation. When selecting a solution, you often need to consider the company's technical route using the Hadoop ecosystem or other ecosystems) and if the server resources are limited, do not consider the standalone Model) and related technical talent reserves.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.