[Series] Dr. Matei zaharia (MA tie)-2 Introduction

Last Update:2014-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As the computing and I/O capabilities of a single machine cannot meet the increasing data processing needs, more and more organizations need to expand their applications to larger clusters. However, in the cluster environment, the following challenges will be encountered in programming:

Parallel Programming; In order to parallelize applications, the support of parallel programming models is required.
Fault Tolerance and slow node problems. When the cluster size is large, this problem is also very serious.
Multi-user shared clusters require Elastic Computing capabilities and interference issues.

As a result, many programming models have emerged. First, mapreduce makes data batch processing simple and common, and can handle fault tolerance. However, it is difficult to handle other types of loads, so a variety of dedicated programming models emerged:

Pregel, used to solve iterative graph algorithm problems
F1, processing SQL queries
Millwheel, continuous stream processing
Storm, impala, Piccolo, graphlib...

We believe that we can design a general programming abstraction, not only to handle a variety of current types of work load, but also to deal with new application types in the future. We propose RDD (resilient distributed dataset, elastic distributed dataset), an efficient data sharing primitive, which greatly improves its versatility. The framework built around RDD has the following advantages over the existing framework:

A runtime system also supports batch processing, iteration, stream processing computing, and interactive query. This makes it possible for a large number of emerging applications combined by these computing models. At the same time, the performance is much higher than that of a single distributed system.
Provides high-intensity fault tolerance and slow node tolerance solutions.
Compared with mapreduce, the performance is improved by more than 100 times.
Supports multi-tenant and flexible resource scheduling.

The spark system 1 built around RDD is shown in.
Figure 1 problems with the dedicated spark System

Repeat the work, and each system needs to consider issues such as fault tolerance and load distribution.
Combined computing of different dedicated systems is very difficult to combine
Limited scope. If the application and system do not match, either modify the application or invent a new system
Resource Sharing: it is very difficult for different computing systems to share data, because each system assumes that it owns the resources of the entire cluster.
Management and Maintenance, each system needs to re-learn its API, operating principles, deployment methods, and so on...

Based on these problems, we need a unified abstraction of cluster computing to improve cluster availability, performance, multi-user processing, and support for complex applications.
After a careful study of the elastic distributed data set (RDD) mapreduce cannot support other different types of applications, we will find that the problem is all due to the lack of an efficient data sharing mechanism in the computing stage. RDD was born to solve this problem. Taking mapreduce and Dryad as examples, they all structured the computing through the directed acyclic graph (DAG) of the computing task. In addition to making multiple copies of the data in the file system, these models do not provide storage abstraction. When a fault occurs, a large amount of data needs to be copied through the network. RDD is a fault-tolerant distributed memory abstraction that does not use data replication. By memorizing the RDD graph creation, you can restore the lost data in the case of a fault like a batch processing. As long as these RDD creation operations are coarse-grained, this method is much more efficient than simply copying data. Why is RDD so versatile? From the perspective of expressive power, RDD can imitate any distributed system. 2. From the system perspective, RDD gives enough control to the application to optimize the resources in the cluster that may cause computing bottlenecks. By optimizing these resources, RDD-based application performance is almost comparable to that of dedicated systems.
RDD-based model

Iterative algorithms, RDD can solve iterative algorithm problems
Relational Database Query, corresponding to shark SQL
Mapreduce and RDD support mapreduce and Dryad applications
Stream processing, which is implemented based on RDD, is called D-stream.
Combination model. Based on RDD, you can combine the above models to construct more complex applications.

Figure 2 Comparison with dedicated Systems 2 Comparison of various programming models and dedicated Systems Based on spark. Compare the code volume on the left and compare efficiency on the right. As you can see, both expressive ability and Running Efficiency RDD have achieved the expected goal.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Series] Dr. Matei zaharia (MA tie)-2 Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Series] Dr. Matei zaharia (MA tie)-2 Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support