The understanding of the spark learning Rdd

Last Update:2015-11-02 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Turn from: http://www.infoq.com/cn/articles/spark-core-rdd/thanks to Zhang Yicheng teacher for his selfless sharing

RDD, called resilient distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data in disk and memory, and to control the partitioning of data. The RDD also provides a rich set of operations to manipulate the data. In these operations, conversion operations such as map, FlatMap, and filter implement the Monad mode, which fits nicely with Scala's collection operations. In addition to this, the RDD provides more convenient operations such as joins, GroupBy, Reducebykey, etc. (Note that Reducebykey is action, not transformation) to support common data operations.

Generally speaking, there are several common models for data processing, including: iterative algorithms,relational queries,mapreduce,stream processing. For example, Hadoop MapReduce uses the mapreduces model, and Storm uses the stream processing model. The RDD mixes these four models so that spark can be used in a variety of big data processing scenarios.

The RDD, as a data structure, is essentially a read-only collection of partition records. An RDD can contain multiple partitions, and each partition is a dataset fragment. Rdd can depend on each other. If each partition of the RDD can be used by at most one of the child Rdd's partitions, it is called narrow dependency; if multiple child rdd partitions are dependent, they are called wide dependency. Different operations may have different dependencies depending on their characteristics. For example, the map operation produces narrow dependency, and the join operation generates wide dependency.

Spark's reliance is divided into narrow and wide, based on two-point reasons. First, narrow dependencies can support multiple commands in a pipeline on the same cluster node, for example, after a map is executed, the filter is executed immediately thereafter. Conversely, wide dependencies requires that all parent partitions are available, and it may be necessary to invoke operations like MapReduce to pass across nodes. Second, it is from the perspective of failure recovery. The failure recovery of the narrow dependencies is more effective because it only needs to recalculate the lost parent partition and can be recalculated in parallel on different nodes. The wide dependencies involves multiple parent partitions at the RDD levels. Illustrates the difference between narrow dependencies and wide dependencies:

This figure is from the paper written by Matei Zaharia, a Architecture for Fast and General Data processing on Large Clusters. In the diagram, a box represents an rdd, and a shaded rectangle represents a partition.

How does RDD ensure data processing efficiency?

The RDD provides two aspects of features persistence and patitioning, which allow the user to control both aspects of the RDD through persist and Patitionby functions. The partition characteristics of the RDD and the Parallel Computing power (RDD defines the parallerize function) enable spark to make better use of the scalable hardware resources. By combining partitioning with persistence, you can process massive amounts of data more efficiently. For example:

Input.map (Parsearticle _). Partitionby (Partitioner). Cache ()

The Partitionby function needs to accept a Partitioner object, such as:

Val partitioner = new Hashpartitioner (sc.defaultparallelism)

The Rdd is essentially an in-memory dataset, and when you access the RDD, the pointer only points to the part that is relevant to the operation. For example, there is a column-oriented data structure, one that is implemented as an array of int, and another that is implemented as a float. If you only need access to the Int field, the RDD pointer can access only the int array, avoiding scanning the entire data structure. The RDD divides operations into two categories: transformation and action. No matter how many times the transformation operation is performed, the RDD does not actually perform the operation, and the operation is triggered only when the action action is executed. In the internal implementation mechanism of the RDD, the underlying interface is based on an iterator, which makes the data access more efficient and avoids the memory consumption of a large number of intermediate results. When implemented, the RDD provides a corresponding type of inherit from Rdd for transformation operations, such as a map operation that returns MAPPEDRDD, and Flatmap returns FLATMAPPEDRDD. When we perform a map or flatmap operation, we simply pass the current Rdd object to the corresponding Rdd object. For example:def map[u:classtag] (f:t = U): rdd[u] = new Mappedrdd (this, Sc.clean (f))

These classes that inherit from the RDD define the COMPUTE function. The function is triggered when the action action is invoked, and the corresponding conversion operation is performed inside the function via an iterator:

Private[spark]class Mappedrdd[u:classtag, T:classtag] (prev:rdd[t], f:t = U)  extends Rdd[u] (prev) {  Overri De def getpartitions:array[partition] = firstparent[t].partitions  override def compute (split:partition, Context: Taskcontext) =    Firstparent[t].iterator (split, context). Map (f)}

RDD support for fault tolerance

support for fault tolerance typically takes two ways: data replication or logging. Both approaches are expensive for data-centric systems because it requires a large amount of data to be copied across the cluster network, after all, the bandwidth data is much lower than the memory. Rdd is inherently support for fault tolerance. First, it is itself a constant (immutable) dataset, and secondly, it remembers the graph of operation that built it, so when the worker performing the task fails, it is entirely possible to get the previous action through the action diagram and recalculate it. This eliminates the need to use replication to support fault tolerance, which reduces the cost of data transfer across the network. However, in some scenarios, spark also needs to use logging to support fault tolerance. For example, in spark streaming, the intermediate state of the execution process needs to be resumed when the update operation is performed on the data, or when the window operation provided by the streaming is called. At this point, you need the checkpoint mechanism provided by spark to enable operations to be recovered from checkpoint. for the wide dependency of RDD, the most effective fault-tolerant method is also using checkpoint mechanism. However, it seems that the latest version of Spark still does not introduce the auto checkpointing mechanism.

Summary

The RDD is the heart of Spark and the architecture foundation of the entire spark. Its characteristics can be summarized as follows:

It is the unchanging data structure stored
It is a distributed data structure that supports cross-cluster
The structure can be partitioned according to the key of the data record
Provides coarse-grained operations, and these operations support partitioning
It stores data in memory, providing low latency

The understanding of the spark learning Rdd

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More