Elastic distribution Data Set Rdd
The RDD (resilient distributed Dataset) is the most basic abstraction of spark and is an abstraction of distributed memory, implementing an abstract implementation of distributed datasets in a way that operates local collections. The RDD is the core of Spark, which represents a collection of data that has been partitioned, immutable, and can be manipulated in parallel, with different data set formats corresponding to different RDD implementations. The RDD must be serializable. The RDD can be cache into memory, and the results of each operation on the RDD dataset can be stored in memory, and the next operation can be entered directly from memory, eliminating the mapreduce bulk of disk IO operations. This is a relatively common machine learning algorithm for iterative operations, and interactive data mining, the efficiency of the increase is relatively large.
你将RDD理解为一个大的集合,将所有数据都加载到内存中,方便进行多次重用。第一,它是分布式的,可以分布在多台机器上,进行计算。第二,它是弹性的,在计算处理过程中,机器的内存不够时,它会和硬盘进行数据交换,某种程度上会减低性能,但是可以确保计算得以继续进行。
RDD Features
The RDD is a distributed read-only and partitioned collection object. These collections are resilient and can be rebuilt if part of the data set is lost. With automatic fault tolerance, location-aware scheduling, and scalability, fault tolerance is the most difficult to implement, and most distributed datasets have two ways of fault tolerance: data checkpoints and record data updates. For large-scale data analysis systems, the cost of data checkpoint operations is high, mainly due to the large-scale data in the transmission of the server between the various aspects of the problem, compared to record data update, RDD only support coarse-grained conversion, that is, how to record from other RDD (ie lineage), To recover the lost partition.
Its characteristics are:
- Data storage structure is not variable
- Support for distributed data operations across clusters
- The data record can be partitioned by key
- Provides coarse-grained conversion operations
- Data is stored in memory to ensure low latency
The benefits of RDD
- Rdd can only be generated from persistent storage or through transformations operations, and is more efficient than distributed shared memory (DSM) for fault tolerance, and the loss of some data partitions can be recalculated only by its lineage, without the need for a specific checkpoint.
- The non-variability of the RDD enables the speculative execution of class Hadoop mapreduce.
- The data partitioning feature of the RDD can improve performance through the local nature of the data, as is the case with Hadoop MapReduce.
- The RDD is serializable, can be automatically degraded to disk storage when it is out of memory, and the RDD is stored on disk, and performance will be greatly reduced but not worse than the current mapreduce.
RDD Programming Interface
For RDD, there are two types of actions, one is transformation and the other is action. Their essential differences are:
Transformation返回值还是一个RDD。它使用了链式调用的设计模式,对一个RDD进行计算后,变换成另外一个RDD,然后这个RDD又可以进行另外一次转换。这个过程是分布式的Action返回值不是一个RDD。它要么是一个Scala的普通集合,要么是一个值,要么是空,最终或返回到Driver程序,或把RDD写入到文件系统中
Transformations conversion operation, return value or an RDD, such as map, filter, union;
Actions action, return results or persist the RDD, such as count, collect, save.
Rdd Dependency Relationship
Depending on the nature of the operation, different dependencies may be generated, and there are two types of dependencies between Rdd:
- Narrow dependency (Narrow Dependencies)
A parent RDD partition is referenced by at most one of the child RDD partitions, and is represented as a parent RDD partition;
A partition corresponding to a sub-rdd or multiple parent rdd is a partition of a child rdd, meaning that a partition of a parent RDD cannot correspond to multiple partitions of a child rdd, such as map, filter, union, and so on, resulting in a narrow dependency;
- Wide dependency (Wide Dependencies)
A sub-RDD partition relies on multiple partitions or all partitions of the parent Rdd, that is, one partition of the parent RDD that corresponds to multiple partitions of a child rdd, such as Groupbykey, generates a wide dependency operation;
, a solid blue box represents a partition, and a blue-edged rectangle represents an rdd:
Stage DAG
When Spark submits a job, it generates multiple stages, and multiple stages are dependent, and the dependencies between the stages form a dag (directed acyclic graph).
For narrow dependencies, Spark tries to place the RDD conversion as much as possible on the same stage, while for wide dependencies, but most of the time it is shuffle, so spark defines this stage as shufflemapstage. To facilitate the registration of shuffle operations with Mapoutputtracker. Spark typically defines the shuffle action as the boundary of the stage.
RDD Data Storage Management
The
Rdd can be abstractly understood as a large array, but the array is distributed across the cluster. Each partition of the logical RDD is called a partition.
during the execution of Spark, the RDD undergoes a transfomation operator and is finally triggered by an action operator. Each time a transformation is logically experienced, the RDD is converted into a new rdd,rdd between the lineage, which has a very important role in fault tolerance. Both the input and output of the transformation are RDD. The RDD is divided into a number of partitions that are distributed across multiple nodes in the cluster. Partitioning is a logical concept, and the old and new partitions before and after the transformation are physically likely to be stored in the same piece of memory. This is an important optimization to prevent the unlimited expansion of memory requirements caused by functional data invariance (immutable). Some rdd is the intermediate result of the calculation, and its partition does not necessarily have corresponding memory or disk data corresponding to it, if you want to iterate over the use of data, the cache () function can be cached data. In
, RDD1 contains 5 partitions (P1, p2, p3, P4, p5), each stored in 4 nodes (Node1, Node2, Node3, Node4). The RDD2 contains 3 partitions (P1, p2, p3), distributed in 3 nodes (Node1, Node2, Node3).
In physics, the Rdd object is essentially a metadata structure that stores the mappings of blocks, node, and other metadata information . An RDD is a component area, and on the physical data store, each partition of the RDD corresponds to a block,block that can be stored in memory and stored on disk when there is not enough memory.
Each block stores a subset of all data items in the RDD, exposing the user to an iterator that can be a block (for example, the user can get a partitioned iterator through mappartitions), or it can be a data item (for example, Computes each data item in parallel using the map function. This book provides a detailed introduction to the underlying implementation of data management in later chapters.
If you are using external storage such as HDFS as the input data source, data is partitioned according to the data distribution policy in HDFs, and one block in HDFs corresponds to one partition of spark. At the same time, Spark supports repartitioning, which determines which nodes the data block is distributed through by Spark's default or user-defined partitioner. For example, partitioning strategies such as hash partitioning (hash values by key value of data items, elements with the same hash value in the same partition) and range partitions (data that belong to the same data range are placed in the same partition) are supported.
reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
"Spark" Elastic Distributed Data Set RDD overview