[Series] Dr. Matei zaharia (MA tie)-2 Introduction

Source: Internet
Author: User

As the computing and I/O capabilities of a single machine cannot meet the increasing data processing needs, more and more organizations need to expand their applications to larger clusters. However, in the cluster environment, the following challenges will be encountered in programming:

  1. Parallel Programming; In order to parallelize applications, the support of parallel programming models is required.
  2. Fault Tolerance and slow node problems. When the cluster size is large, this problem is also very serious.
  3. Multi-user shared clusters require Elastic Computing capabilities and interference issues.

As a result, many programming models have emerged. First, mapreduce makes data batch processing simple and common, and can handle fault tolerance. However, it is difficult to handle other types of loads, so a variety of dedicated programming models emerged:

  1. Pregel, used to solve iterative graph algorithm problems
  2. F1, processing SQL queries
  3. Millwheel, continuous stream processing
  4. Storm, impala, Piccolo, graphlib...
We believe that we can design a general programming abstraction, not only to handle a variety of current types of work load, but also to deal with new application types in the future. We propose RDD (resilient distributed dataset, elastic distributed dataset), an efficient data sharing primitive, which greatly improves its versatility. The framework built around RDD has the following advantages over the existing framework:
  1. A runtime system also supports batch processing, iteration, stream processing computing, and interactive query. This makes it possible for a large number of emerging applications combined by these computing models. At the same time, the performance is much higher than that of a single distributed system.
  2. Provides high-intensity fault tolerance and slow node tolerance solutions.
  3. Compared with mapreduce, the performance is improved by more than 100 times.
  4. Supports multi-tenant and flexible resource scheduling.
The spark system 1 built around RDD is shown in.
Figure 1 problems with the dedicated spark System
  • Repeat the work, and each system needs to consider issues such as fault tolerance and load distribution.
  • Combined computing of different dedicated systems is very difficult to combine
  • Limited scope. If the application and system do not match, either modify the application or invent a new system
  • Resource Sharing: it is very difficult for different computing systems to share data, because each system assumes that it owns the resources of the entire cluster.
  • Management and Maintenance, each system needs to re-learn its API, operating principles, deployment methods, and so on...
Based on these problems, we need a unified abstraction of cluster computing to improve cluster availability, performance, multi-user processing, and support for complex applications.
After a careful study of the elastic distributed data set (RDD) mapreduce cannot support other different types of applications, we will find that the problem is all due to the lack of an efficient data sharing mechanism in the computing stage. RDD was born to solve this problem. Taking mapreduce and Dryad as examples, they all structured the computing through the directed acyclic graph (DAG) of the computing task. In addition to making multiple copies of the data in the file system, these models do not provide storage abstraction. When a fault occurs, a large amount of data needs to be copied through the network. RDD is a fault-tolerant distributed memory abstraction that does not use data replication. By memorizing the RDD graph creation, you can restore the lost data in the case of a fault like a batch processing. As long as these RDD creation operations are coarse-grained, this method is much more efficient than simply copying data. Why is RDD so versatile? From the perspective of expressive power, RDD can imitate any distributed system. 2. From the system perspective, RDD gives enough control to the application to optimize the resources in the cluster that may cause computing bottlenecks. By optimizing these resources, RDD-based application performance is almost comparable to that of dedicated systems.
RDD-based model
  • Iterative algorithms, RDD can solve iterative algorithm problems
  • Relational Database Query, corresponding to shark SQL
  • Mapreduce and RDD support mapreduce and Dryad applications
  • Stream processing, which is implemented based on RDD, is called D-stream.
  • Combination model. Based on RDD, you can combine the above models to construct more complex applications.

Figure 2 Comparison with dedicated Systems 2 Comparison of various programming models and dedicated Systems Based on spark. Compare the code volume on the left and compare efficiency on the right. As you can see, both expressive ability and Running Efficiency RDD have achieved the expected goal.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.