A brief analysis of spark design idea

Last Update:2015-10-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark is no rocket science!—— blogger

Knowing a friend of distributed computing, you must know a concept like dag. I actually contacted the Dag and learned about MapReduce. (See the " Big Data Day: Architecture and Algorithms " book for more details.) Recommended reading. ）

DAG, directed acyclic graph. Can be a brain to complement what the DAG is a thing: the direction, no ring, the figure, if only, meaning is really clear. If the DAG is more complex, it will also involve more in-depth design and implementation details such as the three-tier structure of the Dag computing system.

There is also a concept that everyone should know about: Batch computing. Batch processing, batch processing. Batch, there is a huge amount of data meaning. A batch computing system is the calculation of a huge amount of data.

Spark is a batch computing system for a DAG. MapReduce calculates a batch calculation framework for a DAG.

The computational process of a forward loop-free graph, how is data transferred between adjacent two nodes? In the MapReduce framework, the results are saved to disk and then transferred over the network to the next node. To save the results to disk there is a problem, that is, the disk read and write speed, take the SSD as an example, about 500m/s, this speed is fast, and DDR3 memory read and write speed is roughly around 8g/s. This gap is at an order of magnitude. The speed of network transmission between nodes can be increased to gigabit or even million, in the case of hardware support, there are roughly 1g/s transmission speeds. It can be said that the disk read and write is the most time-consuming.

Reducing the proportion of disk reads and writes is an important aspect of increasing the computational rate. So the Berkeley Lab proposed a concept like RDD. A paragraph of text:

The RDD has fault tolerance for data flow models such as MapReduce and allows developers to perform memory-based computations on large clusters. Existing data flow systems are less efficient at processing two applications: iterative algorithms, which are common in graph applications and machine learning fields, and interactive data mining tools. In both cases, storing the data in memory can greatly improve performance. To effectively implement fault tolerance, the RDD provides a highly restricted shared memory that the RDD is read-only and can only be created through bulk operations on other rdd. Nonetheless, the RDD is still sufficient to represent many types of computations, including MapReduce and dedicated iterative programming models (such as Pregel). Our implementation of the RDD is 20 times faster than Hadoop in terms of iterative computing, and can also query 1TB datasets interactively within 5-7 seconds. --Resilient distributed datasets:a fault-tolerant abstraction for in-memory Cluster computing translation

In addition to the increase in the rate of read and write rates, and on the other hand, in the MapReduce computing framework, each job can only have a single reduce, each job startup overhead problem is also solved under the more general DAG model. Why? The DAG model is the abstraction of the Map/reduce task. In Spark, the DAG batch computing system, there are more transformation and actions to choose from, including map and reduce.

A brief analysis of spark design idea

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A brief analysis of spark design idea

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A brief analysis of spark design idea

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support