Understanding the core rdd

Understanding the core rdd_spark of Spark

Last Update:2018-08-22 Source: Internet

Author: User

Tags prev hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unlike many proprietary large data-processing platforms, Spark is built on the unified abstraction of RDD, making it possible to deal with different large data-processing scenarios in a fundamentally consistent manner, including mapreduce,streaming,sql,machine learning and graph. This is what Matei Zaharia called "Designing a Generic programming abstraction (Unified programming Abstraction)." This is the place where the spark sparks are fascinating.

To understand spark, you need to understand RDD. What Rdd is.

RDD, called resilient distributed datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data in disk and memory and control the partitioning of data. At the same time, RDD also provides a rich set of operations to manipulate the data. In these operations, conversion operations such as map, Flatmap, and filter implement the Monad pattern, well suited to Scala's set operations. In addition, RDD provides more convenient operations such as join, GroupBy, Reducebykey (note that Reducebykey is action rather than transformation) to support common data operations.

Generally speaking, there are several common models for data processing, including: iterative algorithms,relational queries,mapreduce,stream processing. For example, the Hadoop MapReduce employs the mapreduces model, and Storm uses the stream processing model. Rdd mixes these four models so that spark can be applied to a variety of large data processing scenarios.

Rdd, as a data structure, is essentially a read-only collection of partitioned records. A RDD can contain multiple partitions, and each partition is a dataset fragment. Rdd can depend on each other. If each partition of the RDD can only be used by one partition of a child rdd, it is called narrow dependency, and if multiple child rdd partitions can be relied upon, it is called wide dependency. Depending on their characteristics, different operations may have different dependencies. For example, a map operation produces narrow dependency, while a join operation produces a wide dependency.

The reason why spark is divided into narrow and wide is based on two points.

First, narrow dependencies can support the execution of multiple commands on the same cluster node as pipelines, such as after the map is executed, followed by the filter. Conversely, wide dependencies requires all parent partitions to be available and may also need to invoke operations such as MapReduce for cross node delivery.

Secondly, it is considered from the point of view of failure recovery. Narrow dependencies failure recovery is more efficient because it only needs to recalculate the lost parent partition, and it can be recalculated at different nodes in parallel. And wide dependencies involves multiple parent partitions at all levels of RDD. The following figure illustrates the difference between narrow dependencies and wide dependencies:

This figure is from Matei Zaharia's paper an architecture for Fast and General Data processing on Large clusters. In the picture, a box represents a rdd, and a shaded rectangular frame represents a partition. Rdd How to guarantee the efficiency of data processing.

RDD provides two features persistence and patitioning that allow users to control both aspects of Patitionby through persist and RDD functions. The partitioning and parallel computing capabilities of RDD (RDD defines the parallerize function) make it possible for spark to make better use of scalable hardware resources. If you combine partitioning with persistence, you can handle large amounts of data more efficiently. For example:

Input.map (Parsearticle _). Partitionby (Partitioner). Cache ()

The Partitionby function needs to accept a Partitioner object, such as:

Val partitioner = new Hashpartitioner (sc.defaultparallelism)

Rdd is essentially a memory dataset, and when accessing Rdd, the pointer points only to the part that is relevant to the operation. For example, there is a column-oriented data structure in which an array is implemented as int and another is an array of float. If you only need to access the Int field, the RDD pointer can access only the int array, avoiding a scan of the entire data structure.

Rdd the operations into two categories: transformation and action. No matter how many times the transformation operation is performed, RDD does not actually perform the operation, and the operation is triggered only if the action is executed. In the internal implementation mechanism of RDD, the underlying interface is based on iterators, which makes data access more efficient and avoids the memory consumption of a large number of intermediate results.

When implemented, RDD provides the corresponding inherited RDD type for the transformation operation, for example, the map operation returns MAPPEDRDD, and Flatmap returns FLATMAPPEDRDD. When we perform a map or flatmap operation, we simply pass the current Rdd object to the corresponding Rdd object. For example:

def Map[u:classtag] (f:t => U): rdd[u] = new Mappedrdd (this, Sc.clean (f))

These inherited from Rdd classes all define the COMPUTE function. The function is triggered when the action action is invoked, and the corresponding conversion operation is performed through the iterator within the function:

Private[spark]
class Mappedrdd[u:classtag, T:classtag] (prev:rdd[t], f:t => U)
  extends Rdd[u] (prev) {

  o Verride def getpartitions:array[partition] = firstparent[t].partitions

  override def compute (split:partition, Context:taskcontext) =
    Firstparent[t].iterator (split, context). map (f)
}

RDD support for fault tolerance

Support for fault tolerance typically takes two forms: data replication or logging. Both approaches are expensive for data-centric systems because it requires copying large amounts of data across a clustered network, after all, bandwidth is far less than memory.

Rdd is born to support fault tolerance. First, it itself is a constant (immutable) dataset, and secondly, it remembers the action graph that builds it (graph of Operation), so that when the worker that executes the task fails, it is entirely possible to recalculate the actions that were performed prior to the Operation diagram. The cost of data transfer across the network is well reduced by the need to support fault tolerance in a replication way.

However, in some scenarios, spark also needs to use logging to support fault tolerance. For example, in spark streaming, you need to restore the intermediate state of the execution process when you update the data, or when you invoke the window operation provided by streaming. At this point, the checkpoint mechanism provided by spark is needed to support operations that can be recovered from checkpoint.

For Rdd's wide dependency, the most effective fault-tolerant approach is also to use the checkpoint mechanism. However, it seems that the latest version of Spark still does not introduce the auto checkpointing mechanism. Summarize

Rdd is the core of Spark and the architectural foundation of the whole spark. Its characteristics can be summed up as follows: It is invariant to data structure storage It is a distributed data structure that supports across clusters provides coarse-grained operations that can be partitioned by the key of data records, and these operations support partitioning it stores data in memory, providing low latency

Source: http://www.infoq.com/cn/articles/spark-core-rdd/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More