Research on Spark distributed computing and RDD model

Source: Internet
Author: User

1 Background Introduction

Today's distributed computing framework , like MapReduce and Dryad, provides a high-level primitive that allows users to easily write parallel computing programs without worrying about task distribution and error tolerance. However, these frameworks lack the abstraction and support for distributed memory, making it less efficient and powerful in some scenarios. the motivation of the RDD (resilient distributed datasets elastic distributed Data set ) model is mainly derived from two main application scenarios:

Ø iterative algorithms: Iterative machine learning, graph algorithms, including PageRank,K-means clustering, and logistic regression

Ø Interactive Data Mining tool: Users run multiple adhoc queries on the same subset of data .

It is not difficult to see that the similarities between the two scenarios are: reusing intermediate results across multiple phases of computation or computation . Unfortunately, in the current framework such as MapReduce, the only way to reuse data between calculations is to save the data to an external storage system, such as a distributed file system. This results in huge data replication, disk I/O, serialization overhead, and even a large portion of the entire application execution time.

To solve this problem, researchers have developed a special framework for applications that require such data reuse. For example, the intermediate results are stored in memory in an iterative graph calculation framework Pregel. However, these frameworks support only a few specific computational patterns, and do not provide a common abstraction of data reuse. As a result, theRdd was born, and its main functions are:

Ø high-efficiency error tolerance

Ø Intermediate result persistence to memory parallel data structure

Ø can control data partitioning to optimize data storage

Ø Rich Operation method

The biggest challenge for designing Rdd is how to provide efficient error tolerance (fault-tolerance). Memory storage abstractions on existing clusters, such as distributed shared memory,key-value storage, memory databases, and piccolo, provide fine-grained updates to mutable states, such as cells in the database table. In this design, in order to be fault-tolerant, it is necessary to replicate data between the cluster nodes or log replicate. Both methods are expensive for data-intensive tasks because there is a need to copy large amounts of data between nodes, and network bandwidth is much lower than RAM.

        are different from these frameworks, rdd provided map, filter, join, capable of applying the same operation to many data items. This enables efficient fault tolerance by documenting the log of coarse-grained transformations of these build datasets rdd is lost, rdd has ample information about the lost rdd is how from other The RDD generates information that can be recalculated to restore lost data, avoiding the high cost of data replication.

Although an interface based on coarse-grained conversion seems to be somewhat limited and not powerful at first glance, the Rdd is actually good for many parallel computing applications, because these applications themselves naturally use the same operations on multiple data items . In fact,Rdd can efficiently express many of the framework's programming models, such as MapReduce,dryadlinq,SQL,Pregel, and Haloop, as well as interactive data mining applications that they cannot handle.

2 Rdd Introduction2.1 Concepts

An RDD is a read-only, partitioned collection of records. In particular, theRdd has some of the following features:

Ø Create: You can create an RDD from two data sources only by converting (transformation, such as Map/filter/groupby/join, and so on, as distinct from Action action) : 1) Stabilize the data in the storage;2) Other RDD.

Ø Read Only: The state is immutable and cannot be modified

Ø partitioning: Enables the elements in the RDD to be partitioned (partitioning) based on that key and saved to multiple nodes. When you restore, only the data of the lost partition is recalculated, without affecting the entire system.

Ø path: called aristocracy or descent (lineage) in the Rdd, that is, the Rdd has ample information about how it was produced from other rdd.

Ø persistent: Support will be reused for the rdd cache (e.g. in-memory or overflow to disk )

Ø delay calculation: Like dryadlinq,Spark also delays the calculation of the RDD, enabling it to pipelined the conversion (pipeline transformation)

Ø operation: Rich action (action),Count/reduce/collect/save and so on.

The difference between a transform (transformation) and an action (action) is that the former generates a new Rdd, which simply returns the result of an operation on the RDD to the program without generating a new Rdd:

2.2 Examples

Assuming a webservice error in the site, we want to find the cause of the problem from the terabytes of HDFs log files, where we can load the log file with Spark to a set of nodes that make up the cluster's RAM. and interactively make queries. The following is a code example:

First line 1 creates an RDD from the HDFs file , while row 2 derives an rdd that has been filtered by some conditions . Line 3 Caches this RDD errors into memory, but the first RDD lines does not reside in memory. This is necessary because the errors can be very small enough to fit into memory, and the raw data will be very large. Once cached, errors data can now be reused over and over again . We do two operations here, the first is to count the total number of rows that contain MySQL in errors, and the second is to take the third column of the line containing the HDFs typeface and save it as a collection.

It is important to note that Spark's deferred processing has been mentioned earlier. The Spark Scheduler saves both the filter and map transformations to the pipeline and then sends them together to the node to calculate.

2.3 Advantages

the biggest difference between RDD and DSM (distributed shared memory) is thatRdd can only be created with coarse-grained conversions, while DSM allows the reading and writing of data on each memory location. In this definition,DSM includes not only the traditional shared memory system, but also the Piccolo and distributed database such as the shared DHT (Distributed hash table). So Rdd has the following advantages over DSM:

Ø efficient fault-tolerant mechanism: No checkpoint (checkpoint) overhead, can be restored through aristocracy relationships. And the restore involves only the recalculation of the lost data partition, and the recalculation process can be done in parallel at different nodes without having to roll back the entire system.

Ø Ease of node lag (mitigate straggler): The immutabilityof the RDD allows the system to run similar mapreduce backup tasks to mitigate slow junctions. This is difficult to achieve in the DSM system because multiple identical tasks running together will access the same memory data and interfere with each other.

Ø Bulk Operations: Tasks can be assigned based on data locality, which improves performance.

Ø Graceful downgrade (degrade gracefully): When memory is low, large partitions are overrun to disk, providing similar performance to other current data parallel computing systems.

2.4 Application Scenarios

The RDD is best suited for batch-processing applications that perform the same operations on all elements of a dataset. In this case, theRdd simply records each conversion in the aristocracy map to restore the lost data partition without having to log a large number of data operations. So Rdd is not suitable for applications that require an asynchronous, granular update state , such as a Web application's storage system, or an incremental web crawler. For these applications, it is more efficient to use a database system with transactional update logs and data checkpoints.

3 rdd form of expression3.1 In-depth RDD

One of the challenges of using the Rdd as an abstraction is to choose an appropriate representation to track the Rdd aristocracy relationship across many transformations. In Spark, we use a simple, graph-based representation that enables spark to support a large number of conversion types without having to add special processing logic for each transformation, which greatly simplifies the design of the system.

In general, for each Rdd contains five parts of information, that is, a collection of data partitions, can be quickly accessed based on local data preference location, dependency, calculation method, is the hash /range partition metadata:

For example, several rdd built in Spark :

Information /rdd

Hadooprdd

Filteredrdd

Joinedrdd

Partitions

One partition per HDFs block, consisting of a collection

Same as parent Rdd

one partition per reduce task

Preferredloc

HDFs Block Location

None (or ask parent RDD)

No

Dependencies

None (parent RDD)

One-to-one with parent Rdd

To mix each rdd

Iterator

Read the corresponding block data

Filter

Join mixed rows of data

Partitioner

No

No

Hashpartitioner

3.2 Working principle

How does the RDD work after understanding the concept and internal representations of the Rdd ? At a high level, there are three main steps: Creating an Rdd object,creating an execution plan for the DAG Scheduler,assigning tasks to the Task Scheduler, and dispatching the worker to run.

Look at how the rdd works by looking at the following example of a-Z first letter to find the total number of different names under the same first letter .

Step 1: Create the RDD. The above example removes the last collect as an action and does not create an Rdd, and the first four transformations will create a new Rdd. So the first step is to create all the RDD (five internal information ).

Step 2: Create an execution plan. Spark is piped as much as possible, and based on whether the data is to be re-organized (stage), such as the GroupBy () transformation in this example, divides the entire execution plan into two-phase execution. Eventually a DAG (directed acyclic graph, directed acyclic graph ) is produced as a logical execution plan.

Step 3: Schedule the task. Each phase is divided into different tasks, Each of which is a combination of data and computation. All tasks in the current phase will be completed before the next phase is performed. Because the first transition in the next phase must be to reorganize the data, you must wait until all the result data in the current stage is calculated before continuing.

Assuming that there are four file blocks under Hdfs://names in this example, Hadooprdd will have four partitions corresponding to the four block data in the partitions, and The preferedlocations will indicate the best location for these four blocks. Now, you can create four tasks and dispatch them to the appropriate cluster nodes.

3.3 Mixed Rows

(To be added: about how the Shuffle is performed )

3.4 Narrow Dependence

an interesting question when designing an Rdd interface is how to show the dependencies between the RDD. In the Rdd, dependencies are divided into two types: narrow dependency (narrow dependencies) and wide dependency (wide dependencies). Narrow dependency means that each partition of the parent RDD is used only by a partition of the comforter . Correspondingly, a wide dependency means that the partition of the parent RDD is dependent on the partition of the multiple child Rdd. For example,a map is a narrow dependency, and a join causes a wide dependency (unless the parent Rdd is hash-partitioned, see ).

This division has two uses. First, narrow dependencies support pipelining execution on a single node. For example, based on a one-to-one relationship, map can be executed after filter . Second, narrow reliance supports more efficient failure restores. Because for narrow dependencies, only the partition of the lost parent Rdd needs to be recalculated. For a wide dependency, a node failure can cause the partition to be lost from all the parent rdd, so it needs to be completely re-executed. So for a wide dependency,Spark simplifies the failback by persisting the intermediate data on the nodes that hold each parent partition, just as the MapReduce will persist the map's output.

4 Internal implementations4.1 Scheduler

The Spark Scheduler is similar to Dryad, but increases the consideration of whether the persistent RDD partition is in memory. Revisit the previous example: The scheduler creates a staged DAG based on the RDD genealogy , each of which contains as many transformations as possible with narrow dependencies; a mixed operation with wide dependencies is the boundary of the stage; the scheduler dispatches tasks to cluster nodes based on data locality.

4.2 Interpreter Integration

(to be added )

4.3 memory management

Spark supports three memory management methods:memory storage for Java objects, memory storage for serialized data, and disk storage. The first provides the fastest performance because the JVM has direct access to each Rdd object. The second allows users to choose a more efficient storage method than the Java object graph when memory space is limited. The third is very useful for the RDD, which is too large to fit into memory, but is time-consuming to recalculate each .

At the same time, when a new RDD partition is computed and the memory space is insufficient,Spark uses the LRU policy to remove the old partition to disk.

4.4 Checkpoint Support

Although the lineage of the Rdd can be used to restore data, this is often time consuming. So it can be useful to persist some rdd to disk, such as the previously mentioned, wide-dependent intermediate data. For Spark, the support for checkpoints is very simple, because the Rdd is immutable. So you can persist the RDD in the background without pausing the entire system.

5 Advanced Features

(To be added:broadcast ...)

6 References

This article mainly from:1)Rdd paper "Resilient distributed datasets:a fault-tolerant abstraction for In-memory Cluster Computing,2)Spark Summit ppt: "a-deeper-understanding-of-spark-internals" and "Introduction toSpark" Internals ". Interested can be found on their own.

Original address: http://blog.csdn.net/dc_726/article/details/41381791

Research on Spark distributed computing and RDD model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.