"Spark" Elastic Distributed Data Set RDD overview

Last Update:2015-07-08 Source: Internet

Author: User

Tags shuffle hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elastic distribution Data Set Rdd

The RDD (resilient distributed Dataset) is the most basic abstraction of spark and is an abstraction of distributed memory, implementing an abstract implementation of distributed datasets in a way that operates local collections. The RDD is the core of Spark, which represents a collection of data that has been partitioned, immutable, and can be manipulated in parallel, with different data set formats corresponding to different RDD implementations. The RDD must be serializable. The RDD can be cache into memory, and the results of each operation on the RDD dataset can be stored in memory, and the next operation can be entered directly from memory, eliminating the mapreduce bulk of disk IO operations. This is a relatively common machine learning algorithm for iterative operations, and interactive data mining, the efficiency of the increase is relatively large.

你将RDD理解为一个大的集合，将所有数据都加载到内存中，方便进行多次重用。第一，它是分布式的，可以分布在多台机器上，进行计算。第二，它是弹性的，在计算处理过程中，机器的内存不够时，它会和硬盘进行数据交换，某种程度上会减低性能，但是可以确保计算得以继续进行。

RDD Features

The RDD is a distributed read-only and partitioned collection object. These collections are resilient and can be rebuilt if part of the data set is lost. With automatic fault tolerance, location-aware scheduling, and scalability, fault tolerance is the most difficult to implement, and most distributed datasets have two ways of fault tolerance: data checkpoints and record data updates. For large-scale data analysis systems, the cost of data checkpoint operations is high, mainly due to the large-scale data in the transmission of the server between the various aspects of the problem, compared to record data update, RDD only support coarse-grained conversion, that is, how to record from other RDD (ie lineage), To recover the lost partition.
Its characteristics are:

Data storage structure is not variable

Support for distributed data operations across clusters

The data record can be partitioned by key

Provides coarse-grained conversion operations

Data is stored in memory to ensure low latency

The benefits of RDD

Rdd can only be generated from persistent storage or through transformations operations, and is more efficient than distributed shared memory (DSM) for fault tolerance, and the loss of some data partitions can be recalculated only by its lineage, without the need for a specific checkpoint.
The non-variability of the RDD enables the speculative execution of class Hadoop mapreduce.
The data partitioning feature of the RDD can improve performance through the local nature of the data, as is the case with Hadoop MapReduce.
The RDD is serializable, can be automatically degraded to disk storage when it is out of memory, and the RDD is stored on disk, and performance will be greatly reduced but not worse than the current mapreduce.

RDD Programming Interface

For RDD, there are two types of actions, one is transformation and the other is action. Their essential differences are:

Transformation返回值还是一个RDD。它使用了链式调用的设计模式，对一个RDD进行计算后，变换成另外一个RDD，然后这个RDD又可以进行另外一次转换。这个过程是分布式的Action返回值不是一个RDD。它要么是一个Scala的普通集合，要么是一个值，要么是空，最终或返回到Driver程序，或把RDD写入到文件系统中

Transformations conversion operation, return value or an RDD, such as map, filter, union;
Actions action, return results or persist the RDD, such as count, collect, save.

Rdd Dependency Relationship

Depending on the nature of the operation, different dependencies may be generated, and there are two types of dependencies between Rdd:

Narrow dependency (Narrow Dependencies)
A parent RDD partition is referenced by at most one of the child RDD partitions, and is represented as a parent RDD partition;
A partition corresponding to a sub-rdd or multiple parent rdd is a partition of a child rdd, meaning that a partition of a parent RDD cannot correspond to multiple partitions of a child rdd, such as map, filter, union, and so on, resulting in a narrow dependency;

Wide dependency (Wide Dependencies)
A sub-RDD partition relies on multiple partitions or all partitions of the parent Rdd, that is, one partition of the parent RDD that corresponds to multiple partitions of a child rdd, such as Groupbykey, generates a wide dependency operation;

, a solid blue box represents a partition, and a blue-edged rectangle represents an rdd:

Stage DAG

When Spark submits a job, it generates multiple stages, and multiple stages are dependent, and the dependencies between the stages form a dag (directed acyclic graph).
For narrow dependencies, Spark tries to place the RDD conversion as much as possible on the same stage, while for wide dependencies, but most of the time it is shuffle, so spark defines this stage as shufflemapstage. To facilitate the registration of shuffle operations with Mapoutputtracker. Spark typically defines the shuffle action as the boundary of the stage.

RDD Data Storage Management

The

Rdd can be abstractly understood as a large array, but the array is distributed across the cluster. Each partition of the logical RDD is called a partition.
during the execution of Spark, the RDD undergoes a transfomation operator and is finally triggered by an action operator. Each time a transformation is logically experienced, the RDD is converted into a new rdd,rdd between the lineage, which has a very important role in fault tolerance. Both the input and output of the transformation are RDD. The RDD is divided into a number of partitions that are distributed across multiple nodes in the cluster. Partitioning is a logical concept, and the old and new partitions before and after the transformation are physically likely to be stored in the same piece of memory. This is an important optimization to prevent the unlimited expansion of memory requirements caused by functional data invariance (immutable). Some rdd is the intermediate result of the calculation, and its partition does not necessarily have corresponding memory or disk data corresponding to it, if you want to iterate over the use of data, the cache () function can be cached data. In

, RDD1 contains 5 partitions (P1, p2, p3, P4, p5), each stored in 4 nodes (Node1, Node2, Node3, Node4). The RDD2 contains 3 partitions (P1, p2, p3), distributed in 3 nodes (Node1, Node2, Node3).

In physics, the Rdd object is essentially a metadata structure that stores the mappings of blocks, node, and other metadata information . An RDD is a component area, and on the physical data store, each partition of the RDD corresponds to a block,block that can be stored in memory and stored on disk when there is not enough memory.
Each block stores a subset of all data items in the RDD, exposing the user to an iterator that can be a block (for example, the user can get a partitioned iterator through mappartitions), or it can be a data item (for example, Computes each data item in parallel using the map function. This book provides a detailed introduction to the underlying implementation of data management in later chapters.
If you are using external storage such as HDFS as the input data source, data is partitioned according to the data distribution policy in HDFs, and one block in HDFs corresponds to one partition of spark. At the same time, Spark supports repartitioning, which determines which nodes the data block is distributed through by Spark's default or user-defined partitioner. For example, partitioning strategies such as hash partitioning (hash values by key value of data items, elements with the same hash value in the same partition) and range partitions (data that belong to the same data range are placed in the same partition) are supported.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Spark" Elastic Distributed Data Set RDD overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More