Apache Spark three kinds of distributed deployment comparison

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Among them, the first one is similar to the one adopted by MapReduce 1.0, which implements fault tolerance and resource management internally. The latter two are the future development trends. Some fault tolerance and resource management are managed by a unified resource management system: http : //www.aliyun.com/zixun/aggregation/13383.html "> Spark is running on top of a common resource management system so that it can be shared with other computing frameworks, such as MapReduce, with the greatest benefit of a cluster resource Operation and maintenance costs and improve resource utilization (resource allocation on demand) .This article describes these three deployment methods, and compare their advantages and disadvantages.

standalone mode, or stand-alone mode, comes with a complete service that can be deployed in a separate cluster without having to rely on any other resource management system. To a certain extent, this model is the basis of the other two. Drawing lessons from the Spark development model, we can get a general idea of how to develop a new computing framework: first, design its standalone model, for rapid development, initially do not need to consider the service (such as master / slave) fault tolerance, and then develop the corresponding wrapper, stanlone mode of service will be deployed intact to the resource management system yarn or mesos, the resource management system is responsible for service fault tolerance. At present, Spark has no single point of failure problem in standalone mode, which is implemented by using zookeeper. The idea is similar to Hbase master single point of failure solution. Compare Spark standalone with MapReduce and find that both of them are architecturally identical:

1) is made up of master / slaves services, and at first there was a single point of failure on the master that was later solved by zookeeper (there is still a single point in the JobTracker for Apache MRv1, but the CDH version was resolved);

2) The resources on each node are abstracted into a coarse-grained slot, how many slots can run at the same time how many tasks. The difference is that MapReduce divides slot into map slot and reduce slot, which can only be used by Map Task and Reduce Task, respectively, but can not be shared. This is one of the reasons for low efficiency of MapReduce resources, and Spark is more optimized. It does not distinguish between slot type, there is only a slot, can be used for various types of Task, this method can improve resource utilization, but not flexible enough, can not customize slot resources for different types of Task. In short, these two methods have their own advantages and disadvantages.

Spark On Mesos mode. This is the model many companies use, the official recommended this model (of course, one of the reasons is blood). Because of the Mesos support for Spark since its inception, Spark now runs on Mesos and is more flexible and natural than running on YARN. Currently in the Spark On Mesos environment, users can choose to run their own application in one of two scheduling modes (see Andrew Xia's "Mesos Scheduling Mode on Spark"):

1) Coarse-grained Mode: The running environment of each application consists of a Dirver and several Executors. Each Executor occupies several resources and can run multiple Task internally (corresponding to how many "slots" ). Before each task of an application runs formally, all the resources in the operating environment need to be applied well, and the resources must be occupied in the running process. If not, the resources are recovered after the last program running. For example, when you submit an application, specify that you run your application with five executors, each with 5GB of memory and 5 CPUs, and five slots in each executor, so Mesos needs to be assigned to the executor first Resources and start them, then start scheduling tasks. In addition, in the process of running, mesos master and slave do not know the executor internal task of the operation, executor directly through the internal communication mechanism to report to the Driver, to some extent, can be considered that each application to use mesos set up a virtual cluster to use their own.

2) Fine-grained Mode: As coarse-grained mode causes a lot of waste of resources, Spark On Mesos also provides another scheduling mode: fine-grained mode, which is similar to the current cloud computing, the idea is Assigned on demand. As with coarse-grained mode, executors are started when the application starts up, but each executor consumes only the resources it needs to run itself, regardless of the tasks to be run in the future, after which mesos dynamically allocates for each executor Resources, each allocated some, you can run a new task, a Task can run immediately after the release of the corresponding resources. Each Task will report the status to Mesos slave and Mesos Master for more granular management and fault tolerance. This scheduling mode is similar to MapReduce scheduling mode. Each Task is completely independent. The advantage is that it is convenient for resource control and isolation, but the disadvantages are obvious , Short operation delay large.

Spark On YARN mode. This is one of the most promising deployment patterns. However, limited to the development of YARN itself, only Coarse-grained Mode is currently supported. This is because Container resources on YARN can not be scaled dynamically. Once the Container is started, the available resources can not be changed anymore, but this is already on the YARN plan (for a specific reference: https://issues.apache.org/jira / browse / YARN-1197).

In summary, these three types of distributed deployment have their own advantages and disadvantages, usually depending on the company to decide which option to adopt. When choosing a solution, it is often necessary to consider the company's technology path (using Hadoop ecosystem or other ecosystem), server resources (standalone mode should not be considered if resources are limited), and related technical talent pooling.

Editor recommends: 1. Comparison of Hadoop analysis of Spark sought after by many reasons

Is Apache Spark the next big guy in big data?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More