A brief analysis of spark design idea

Source: Internet
Author: User

Spark is no rocket science!—— blogger

Knowing a friend of distributed computing, you must know a concept like dag. I actually contacted the Dag and learned about MapReduce. (See the " Big Data Day: Architecture and Algorithms " book for more details.) Recommended reading. )

DAG, directed acyclic graph. Can be a brain to complement what the DAG is a thing: the direction, no ring, the figure, if only, meaning is really clear. If the DAG is more complex, it will also involve more in-depth design and implementation details such as the three-tier structure of the Dag computing system.

There is also a concept that everyone should know about: Batch computing. Batch processing, batch processing. Batch, there is a huge amount of data meaning. A batch computing system is the calculation of a huge amount of data.

Spark is a batch computing system for a DAG. MapReduce calculates a batch calculation framework for a DAG.

The computational process of a forward loop-free graph, how is data transferred between adjacent two nodes? In the MapReduce framework, the results are saved to disk and then transferred over the network to the next node. To save the results to disk there is a problem, that is, the disk read and write speed, take the SSD as an example, about 500m/s, this speed is fast, and DDR3 memory read and write speed is roughly around 8g/s. This gap is at an order of magnitude. The speed of network transmission between nodes can be increased to gigabit or even million, in the case of hardware support, there are roughly 1g/s transmission speeds. It can be said that the disk read and write is the most time-consuming.

Reducing the proportion of disk reads and writes is an important aspect of increasing the computational rate. So the Berkeley Lab proposed a concept like RDD. A paragraph of text:

The RDD has fault tolerance for data flow models such as MapReduce and allows developers to perform memory-based computations on large clusters. Existing data flow systems are less efficient at processing two applications: iterative algorithms, which are common in graph applications and machine learning fields, and interactive data mining tools. In both cases, storing the data in memory can greatly improve performance. To effectively implement fault tolerance, the RDD provides a highly restricted shared memory that the RDD is read-only and can only be created through bulk operations on other rdd. Nonetheless, the RDD is still sufficient to represent many types of computations, including MapReduce and dedicated iterative programming models (such as Pregel). Our implementation of the RDD is 20 times faster than Hadoop in terms of iterative computing, and can also query 1TB datasets interactively within 5-7 seconds. --Resilient distributed datasets:a fault-tolerant abstraction for in-memory Cluster computing translation


In addition to the increase in the rate of read and write rates, and on the other hand, in the MapReduce computing framework, each job can only have a single reduce, each job startup overhead problem is also solved under the more general DAG model. Why? The DAG model is the abstraction of the Map/reduce task. In Spark, the DAG batch computing system, there are more transformation and actions to choose from, including map and reduce.

A brief analysis of spark design idea

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.