Apache Spark Quest: Three ways to compare distributed deployments

Last Update:2016-01-23 Source: Internet

Author: User

Tags zookeeper hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Currently, Apache Spark supports three distributed deployment methods, standalone, spark on Mesos, and Spark on YARN, the first of which is similar to the pattern used in MapReduce 1.0, where fault tolerance and resource management are implemented internally. The latter two are the trend of future development, partial fault tolerance and resource management by the Unified resource management system: Let Spark run on a common resource management system, so that with other computing frameworks, such as mapreduce, a common cluster of resources, The greatest benefit is reduced operational costs and increased resource utilization (resource allocation). This article describes the three deployment methods and compares their pros and cons.

The standalone mode , the standalone mode, comes with a complete service that can be deployed to a single cluster without the need to rely on any other resource management system. To some extent, this pattern is the basis of the other two. Drawing on the spark development model, we can get a general idea of developing a new computational framework: first, design its standalone mode, in order to develop quickly, at first do not need to consider service (such as Master/slave) fault tolerance, then develop the corresponding wrapper, By deploying the services in Stanlone mode to the resource management system yarn or MESOS, the resource management system is responsible for the fault tolerance of the service itself. Currently, Spark does not have any single point of failure in standalone mode, which is implemented with zookeeper, which is similar to the HBase master single point of failure solution. Comparing Spark Standalone to MapReduce, you will find that the two are architecturally identical:

1) are composed of master/slaves services, and at first master have a single point of failure, and then all through the zookeeper solution (Apache MRv1 Jobtracker There is still a single point of problem, but the CDH version has been resolved);

2) The resources on each node are abstracted into a coarse-grained slot, and how many slots will be able to run as many tasks at the same time. The difference is that mapreduce divides the slots into map slots and reduce slots, which can only be used by map task and reduce task, but not shared, which is one of the reasons why MapReduce resource rates are inefficient, and spark is more optimized, It does not differentiate between slot types, only one slot, which can be used by various types of tasks, which improves resource utilization, but is not flexible enough to customize slot resources for different types of tasks. In short, both of these approaches have advantages and disadvantages.

Spark on Mesos mode . This is the model used by many companies, and it is officially recommended (one of the reasons, of course, is the blood relationship). It is because of the early development of spark that the support for Mesos is taken into account, so it is now more flexible and more natural to run on Mesos than it is to run on yarn. Currently in the Spark on Mesos environment, users can choose one of two scheduling modes to run their own applications (refer to Andrew Xia's "Mesos scheduling mode on Spark"):

1) coarse-grained (coarse-grained mode): Each application's operating environment consists of a dirver and several executor, each of which consumes several resources and internally can run multiple tasks (corresponding to how many " Slot "). Before the various tasks of the application are formally run, the resources in the running environment need to be fully applied and the resources will be consumed during the run, even if not, and the resources are recycled after the final program runs. For example, when you submit an application, you specify that you run your application with 5 executor, each executor consumes 5GB of memory and 5 CPUs, and each executor has 5 slots inside. Then Mesos needs to assign resources to the executor and start them before scheduling the task. In addition, in the process of running the program, Mesos master and slave do not know executor internal task operation, executor directly to the mission status through the internal communication mechanism to report to driver, to a certain extent, it can be considered that Each application uses Mesos to build a virtual cluster for its own use.

2) fine-grained (fine-grained mode): In view of the large amount of resources wasted in coarse-grained mode, Spark on Mesos also provides another scheduling mode: fine-grained mode, which is similar to today's cloud computing, and thought is on demand. Like coarse-grained mode, when an application starts, executor is started, but each executor consumes only the resources it needs to run itself, does not need to consider the tasks that will be run in the future, and then mesos dynamically allocates resources for each executor, each assigned You can run a new task that releases the corresponding resource immediately after a single task has finished running. Each task reports status to Mesos slave and Mesos Master for finer granularity management and fault tolerance, which is similar to the MapReduce scheduling pattern, where each task is completely independent and has the advantage of easy resource control and isolation, but with obvious drawbacks Short job run latency is large.

Spark on yarn mode . This is one of the most promising deployment modes. However, the development of yarn itself is only supported in coarse-grained mode (coarse-grained modes). This is because the container resources on yarn can not be dynamically scaled, once the container is started, the available resources can no longer change, but this is already in the Yarn program (for specific reference: Https://issues.apache.org/jira /browse/yarn-1197).

In summary, there are pros and cons to these three types of distributed deployment, usually depending on the company's circumstances. When choosing a solution, it is often necessary to consider the company's technical route (using the Hadoop ecosystem or other ecosystems), server resources (do not consider the standalone model if resources are limited), and related technical talent reserves.

original articles, reproduced please specify: reproduced from Dong's blog

This article link address: http://dongxicheng.org/framework-on-yarn/apache-spark-comparing-three-deploying-ways/

Apache Spark Quest: Three ways to compare distributed deployments

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More