"Spark" RDD mechanism implementation model

Source: Internet
Author: User

Rdd Source

Elastic distributed Data Set (RDD), which is a simple extension and extension of the MapReduce model, the RDD needs to ensure that the RDD has the capability to efficiently share data between parallel computing phases in order to achieve iterative, interactive, and streaming queries. The RDD uses an efficient data sharing concept and a mapreduce-like operation that enables all computational work to be performed efficiently and to achieve critical optimizations in today's specific systems.

Rdd is a special set of fault-tolerant mechanisms, which can be distributed on the nodes of a cluster, and perform various parallel operations in the form of functional compilation operations. The RDD can be understood as a special collection of fault-tolerant mechanisms that provide a read-only, shared memory that can only be transformed by an existing RDD, and then load all the data into memory for easy reuse.

A. It is distributed and can be distributed across multiple machines for calculation.
B. It is elastic, and when there is not enough memory in the calculation process it will exchange data with the disk.
C. These limits can greatly reduce the automatic fault-tolerant overhead
D. Substance is a more general iterative parallel computing framework in which users can explicitly control the intermediate results of calculations and then freely apply them to subsequent computations.

The RDD is a fault-tolerant distributed storage concept that avoids replication. Instead, each Rdd remembers a diagram of the operations that were built on it, similar to the batch calculation model, which effectively recalculates the data lost due to the failure. Because the operation to create an RDD is relatively coarse-grained, a single operation is applied to many data elements, which is more efficient than replicating data over the network. RDD is well used in the current wide range of data parallel algorithms and processing models, all of which use the same operation for multiple tasks.

The model of the RDD mechanism implementation

The RDD mechanism implements a multi-class model, including several existing cluster programming models and new applications that were not supported by the previous model. In these models, the RDD mechanism not only matches the performance aspects of the previous system, but in other ways they can also incorporate new features that are missing from existing systems, such as fault tolerance, straggler tolerance, and resiliency. We discuss the following four types of models.

An iterative algorithm

One of the most common working modes that have been developed for a particular system is iterative algorithms, such as those applied to graph processing, numerical optimization, and machine learning. RDD can support a wide variety of models, including Pregel, iterative mapreduce models such as Haloop and Twister, and deterministic versions of Graphlab and Powergraph models.

Relationship Query

One of the first requirements in a mapreduce cluster is the execution of SQL queries, long-running or multiple-hour batch computing tasks and instant queries. This has facilitated the development of many parallel database systems used in commercial clusters.
MapReduce has very large flaws in interactive queries compared to parallel databases, such as MapReduce's fault-tolerant model, and we find that it achieves considerable performance by implementing many of the common database engine features in the RDD operation (for example, column processing).

Mapreduce

The RDD can efficiently execute a mapreduce program by providing a superset of MapReduce, as well as the application of a common opportunity dag data stream such as Dryadlinq.

Streaming data-processing

The biggest difference between spark and a custom system is that spark also uses the RDD for streaming. Streaming data processing has been studied for a long time in the field of databases and systems, but achieving large-scale streaming data processing remains a challenge. The current model does not deal with the frequently occurring problems of straggler in large-scale clusters, and there are very limited ways to recover from failures, requiring large amounts of replication or wasted long recovery time. In particular, the current system is based on a model of continuous operation, which requires lengthy stateful operations to process each arriving record. In order to recover a lost node, the current system needs to save two copies of each operator, or replay the upstream data through a series of costly serial processing.
Spark proposes a new model, discrete data stream (d-streams), to solve such problems. Replacing the process using long-term state processing, d-streams the execution of the streaming calculation as a series of short, deterministic batch computations that store the state in the RDD. The D-stream model achieves rapid failback by parallelizing the dependency graph of the associated RDD, which does not need to be replicated. In addition, it uses speculation (speculative) to support the execution of straggler migrations, for example, running a speculative backup copy of those slow tasks. Although d-stream the calculation into a number of unrelated jobs to run thereby adding some delay, we have proved that the D-stream can be implemented with sub-second delay, which can achieve the performance of a single node of the previous system, and can scale to 100 nodes linearly. D-stream's strong recovery features make them the first streaming model to handle large-scale cluster features, and their RDD-based implementations enable applications to integrate batch and interactive queries efficiently.

Summary

Spark integrates these models together, and the RDD can support new applications that are not represented by existing systems. For example, many data flow applications also need to include information about historical data, and by using the RDD, you can use both batch processing and streaming in the same program to achieve data sharing and fault-tolerant recovery in all models. Similarly, the operators of streaming applications often need to perform instant queries on the state of the data stream; The Rdd in D-stream can be queried in the form of static data. We use some of the practical applications of online machine learning and video analytics to illustrate these use cases. More generally, each batch application often needs to consolidate multiple processing types: For example, an application might need to extract a dataset using SQL, train a machine learning model on the dataset, and then query the model. Because most of the computational time is spent on the I/O overhead of Distributed file systems that share data between systems, workflows that use the current combination of multiple systems are inefficient. Using a system based on the RDD mechanism, these calculations can be executed immediately in the same engine without the need for additional I/O.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" RDD mechanism implementation model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.