Spark Rdd using detailed 1--rdd principle

Source: Internet
Author: User
Tags shuffle hadoop mapreduce spark rdd

about Rdd

Behind the cluster, there is a very important distributed data architecture, the elastic distributed data set (resilient distributed Dataset,rdd). The RDD is the most basic abstraction of spark and is an abstraction of distributed memory, implementing an abstract implementation of distributed datasets in a way that operates local collections. The RDD is the core of Spark, which represents a collection of data that has been partitioned, immutable, and can be manipulated in parallel, with different data set formats corresponding to different RDD implementations. The RDD must be serializable. The RDD can be cache into memory, and the results of each operation on the RDD dataset can be stored in memory, and the next operation can be entered directly from memory, eliminating the mapreduce bulk of disk IO operations. This is a relatively common machine learning algorithm for iterative operations, and interactive data mining, the efficiency of the increase is relatively large.

      (1) rdd features
      1) Creation: can only be converted ( Transformation, such as Map/filter/groupby/join, are different from action action) to create RDD 1 from both data sources) to stabilize the data in storage; 2) Other RDD.
      2) Read only: The state is immutable and cannot be modified.
      3) Partitioning: Enables the elements in the RDD to be partitioned (partitioning) based on that key and saved to multiple nodes. When you restore, only the data of the lost partition is recalculated, without affecting the entire system.
      4) path: called Aristocracy or descent (lineage) in the Rdd, the RDD has ample information about how it was produced from other Rdd.
      5) Persistence: Supports an RDD cache that will be reused (such as in-memory or overflow to disk).
      6) Deferred calculation: Spark also delays the calculation of the RDD so that it can be pipelined (pipeline transformation).
      7) Operation: Rich conversion (transformation) and action (action), Count/reduce/collect/save, etc.
      How many transformation operations are performed, the RDD does not actually perform the operation (record lineage), and the operation is triggered only when the action action is executed.

      (2) Benefits of Rdd
      1) Rdd can only be generated from persistent storage or through transformations operations, and is more efficient than distributed shared memory (DSM) to achieve fault tolerance, and the loss of some data partitions can only be recalculated based on its lineage, without the need for specific checkpoint.
      2) The immutability of the RDD enables speculative execution of class Hadoop mapreduce. The data partitioning feature of the
      3) Rdd can improve performance through data locality, which is not the same as Hadoop MapReduce. The
      4) Rdd is serializable, can be automatically degraded to disk storage when it is out of memory, and the RDD is stored on disk, and performance will be greatly reduced but not worse than the current mapreduce.
      5) Bulk Operations: Tasks can be assigned based on data locality, which improves performance.

      (3) The internal properties of the Rdd
      through the internal properties of the RDD, The user can obtain the corresponding metadata information. This information can be used to support more complex algorithms or optimizations.
      1) partition list: A list of partitions allows you to find all the partitions contained in an RDD and their location.
      2) computes the function of each shard: a user-defined function operation that can be performed on each block of data by a function.
      3) dependency list for parent RDD: Provides support for fault tolerance, in order to backtrack to the parent RDD.
      4) controls the partitioning policy and the number of partitions on the Key-value pair data type Rdd. Partitioning functions allow you to determine the allocation of data records across partitions and nodes, reducing distribution imbalances.
      5) An address list for each data partition (such as the address of a data block on HDFs).
      If the data has a copy, the address list allows you to learn all the replica addresses of a single block of data, providing support for load balancing and fault tolerance.

(4) The storage and partitioning of the RDD
1) Users can choose different storage levels to store the RDD for reuse.
2) The current rdd is stored in memory by default, but when memory is low, the RDD is spill to disk.
3) The RDD partitions the data in the cluster when it needs to be partitioned (such as a hash partition) according to each record key, ensuring that two datasets are efficient at join.
The RDD defines the following storage levels based on a combination of Usedisk, usememory, Useoffheap, deserialized, and replication parameters:


(5) The fault-tolerant mechanism of RDD

There are two ways to implement a distributed dataset fault tolerant approach to

      RDD: Data checkpoints and records update the RDD with record updates: The cost of recording all update points is high. Therefore, the RDD only supports coarse-grained transformations, that is, only a single action performed on a single block is recorded, and then the transformation sequence (descent) of an RDD is created and stored, and the transformation sequence means that each RDD contains information about how he was transformed by other Rdd and how to reconstruct a piece of data. Therefore, the fault-tolerant mechanism of RDD is also called "descent" fault tolerance. The biggest challenge in implementing this "descent" tolerance mechanism is how to express the dependency between the parent Rdd and the child Rdd. In fact, dependencies can be divided into two types, narrow dependencies and wide dependencies: narrow dependencies: Each chunk of the child RDD relies only on a finite fixed block of data in the parent Rdd; wide dependency: a block of data in a child rdd can depend on all blocks of data in the parent RDD. For example: The map transform, the data block in the child Rdd only depends on the corresponding data block in the parent Rdd; the Groupbykey transform, the data block in the child Rdd depends on the data block in the parent RDD, because a key may be wrong in any of the data blocks of the parent Rdd Two attributes that classify dependencies: first, a narrow dependency can calculate a block of data corresponding to a child rdd directly on a compute node by calculating a block of data from the parent Rdd; the wide dependency waits until all the data of the parent RDD is computed. And the parent RDD evaluates the hash and passes it to the corresponding node before calculating the child Rdd. Second, when data is lost, it is only necessary to recalculate the missing piece of data for a narrow dependency, and for a wide dependency, all the blocks of the ancestor Rdd are recalculated to recover. So in the long "descent" chain, especially when there is a wide dependency, you need to set the data checkpoint at the appropriate time. Also, these two features require different task scheduling mechanisms and fault-tolerant recovery mechanisms for different dependency relationships.

(6) Spark calculation workflow
The input, run conversion, and output of Spark are described in Figure 1-5. The RDD is converted by operator in the run transformation. Operators are functions defined in the RDD and can be transformed and manipulated into the data in the RDD.
• Input: When the Spark program is running, data is entered into spark from an external data space (for example, HDFS, Scala collection, or data), the data enters the spark runtime data space and is converted into a block of data in Spark and managed by Blockmanager.
• Run: After the Spark data input forms the RDD, it is possible to manipulate the data and convert the RDD into a new rdd via the transform operator Fliter, etc., triggering the spark submission job through the action operator. If the data needs to be reused, the data can be cached to memory through the cache operator.
• Output: Program run end data outputs spark runtime space, stored in distributed storage (e.g. Saveastextfile output to HDFs) or Scala data or collections (collect output to Scala collection, COUNT returns scala int data).


The core data model of Spark is the RDD, but the RDD is an abstract class that is implemented by subclasses such as Mappedrdd, Shuffledrdd, and so on. Spark translates common big data operations into a subclass of the RDD.


RDD programming Model

Take a look at the code: the textfile operator reads the log file from HDFs, returns a "file" (RDD), filters the line with "ERROR", assigns it to "errors" (new RDD), caches it for future use, and the Count operator returns The number of rows for "errors". The RDD does not look much different from the Scala collection types, but their data and running models are very different.


The RDD data model is given, and the four operators used in the above example are mapped to four operator types. The Spark program works in two spaces: Spark Rdd Space and Scala native data space. In the native data space, the data is represented as scalar (scalar, the Scala basic type, represented by a small orange square), a collection type (a blue dashed box), and a persistent store (a red cylinder).

This paper describes the operation of the RDD by operator, which is a function defined in the RDD, which can transform and manipulate the data in the RDD.


Figure 12 Switching of space, four different types of RDD operators

The input operator (orange arrow) sucks the data from the Scala collection type or storage into the RDD space and into the RDD (solid blue wireframe). There are roughly two types of input operator inputs: one for Scala collection types, such as parallelize, and the other for storing data, such as textfile in the previous example. The output of the input operator is the RDD of the spark space.

Because of the function semantics, the RDD passes the transform (transformation) operator (blue arrow) to generate a new rdd. The input and output of the transform operator are rdd. The RDD is divided into a number of partitions (partition) distributed across multiple nodes of the cluster, and figure 1 represents the partition with a small blue square. Note that partitioning is a logical concept, and the old and new partitions before and after the transformation may be physically the same piece of memory or storage. This is an important optimization to prevent the infinite expansion of memory requirements caused by functional invariance. Some rdd is the intermediate result of the calculation, and its partition does not necessarily have corresponding memory or storage corresponding to it, if necessary (for future use), you can call the cache operator (the cache operator in the example, the gray arrow indicates) the partition materialization (materialize) to save (Gray square).

Some of the transformation operators consider the elements of the RDD as simple elements, divided into the following categories:

    • The input and output one-to (element-wise) operators, and the result of the RDD partition structure is unchanged, mainly map, FlatMap (map after the flat for a-dimensional rdd);

    • The input and output are one to the other, but the result is that the partition structure of the RDD has changed, such as Union (two Rdd together), coalesce (partition reduction);

    • The operator that selects a part of the element from the input, such as filter, distinct (remove redundant elements), subtract (this rdd has, the element it has no rdd), and sample (sample).

Another part of the transformation operator for the Key-value collection, but also divided into:

    • Perform element-wise operations on a single rdd, such as mapvalues (keeping the source Rdd partitioned, which is different from map);

    • Rearrangement of a single rdd, such as sort, Partitionby (partition partitioning for consistency, which is important for data locality optimization, will be said later);

    • A single RDD is based on key reorganization and reduce, such as Groupbykey, Reducebykey;

    • Join and reorganize two RDD based on key, such as join, Cogroup.

The latter three types of operations involve rearrangement, called shuffle class operations.

The sequence of transformation operators from RDD to RDD has been occurring in the Rdd space. The important design here is the lazy evaluation: The calculation does not actually occur, but it is continuously logged to the meta data. The structure of the metadata is a dag (directed acyclic graph), where each "vertex" is an RDD (including the operator that produces the RDD), from the parent Rdd to the child Rdd with an "edge" that represents the dependency between the RDD. Spark gives the metadata dag a cool name, lineage (lineage). This lineage is also the log update described in the previous fault tolerant design.

The lineage continues to grow until the action operator (the green arrow in Figure 1) is evaluate, and all the operators just accumulated are executed once. The input to the action operator is the RDD (and all the RDD that the RDD relies on on the lineage), which is the native data generated after execution, possibly Scala scalar, collection-type data, or storage. When the output of an operator is the above type, the operator must be an action operator, and the effect is to return the original data space from the RDD space.


Rdd Run Logic, in the Spark application, the entire execution process will form a direction-free graph between the logical operations. After the action operator is triggered, all the accumulated operators are formed into a direction-free graph, which is then dispatched by the Scheduler to dispatch the task on the graph. Spark is dispatched in a different way than mapreduce. Spark creates different phases (stages) based on the different dependencies between the RDD, and a stage contains a series of functions for pipelining. A, B, C, D, E, F, G in the figure represent a block of data in a box representing a different rdd,rdd. Data is entered from HDFs into Spark, which forms Rdd A and Rdd C,rdd C to perform map operations, converted to RDD D,rdd B and Rdd F for the join operation to convert to G, and in the process of B to G shuffle. The last Rdd G is saved to HDFs via the function saveassequencefile output.


Rdd Dependency Relationship the dependency of the RDD is as follows:Narrow dependency (narrowdependencies) and wide dependency (widedependencies). Narrow dependencies refer to each partition of the parent RDD that is used only by a partition of an RDD, such as map, filter. Correspondingly, a wide dependency means that the partition of the parent RDD is dependent on a partition of multiple child rdd, such as Groupbykey, Reducebykey, and so on. If a partition of the parent RDD is used by a partition of a child rdd, it is a narrow dependency, otherwise it is wide dependent.
This division has two uses. First, narrow dependencies support pipelining execution on a single node. For example, based on a one-to-one relationship, map can be executed after filter. Second, narrow reliance supports more efficient failure restores. Because for narrow dependencies, only the partition of the lost parent RDD needs to be recalculated. For a wide dependency, a node failure can cause the partition to be lost from all the parent RDD, so it needs to be completely re-executed. So for a wide dependency, Spark simplifies the failback by persisting the intermediate data on the nodes that hold each parent partition, just as the MapReduce will persist the map's output.
Special note: For JOIN operation There are two cases, if the join operation using each partition only with the known partition to join, at this time the join operation is narrow dependence, the other case of the join operation is a wide dependency Because it is a deterministic partition number of dependencies, so it is narrow dependence, to draw a conclusion that narrow dependence not only contains a one-on narrow dependence, Also contains a fixed number of narrow dependencies (that is, the number of dependent partition on the parent RDD does not change with the size of the RDD data)

How to divide the stage as shown:
The stage division is based on the broad dependence, when to generate wide dependence? For example, Reducebykey,groupbykey action.
1. From the backward reasoning, encountered wide dependence on the disconnection, encountered narrow dependence on the current rdd into the stage;
2. The number of tasks in each stage is determined by the number of partition of the last Rdd in the stage;
3. The type of the task in the last stage is Resulttask, and the task type in the previous stage is shufflemaptask;
4. The operator representing the current stage must be the last calculation step of the stage;

Add: Mapper and Reducer in the MapReduce operation in Hadoop are the basic equal operators in spark: map, Reducebykey; inside a stage, the first is operator merging, The so-called functional programming of the implementation of the end of the function of the expansion of a stage to merge multiple operators into a large operator (which contains the current stage of all operators to the data calculation logic), followed by the transformation operation of the lazy feature!! The operator is optimized first through the Spark Framework (Dagscheduler) before the specific operator is given the executor calculation of the cluster.

How the RDD operates

(1) How the RDD is created
1) is created from the Hadoop file system (or other persistent storage systems compatible with Hadoop, such as Hive, Cassandra, HBase) input (for example, HDFs).
2) Convert from parent Rdd to get new Rdd.
3) Create the standalone data as a distributed Rdd via Parallelize or Makerdd.

(2) Two operator operators for Rdd
There are two operator operators for RDD: conversion (transformation) and action.
1) conversion (transformation): The transformation operation is deferred calculation, that is, the conversion from one RDD conversion to another RDD is not performed immediately, and it is necessary to wait until there is an action action to actually trigger the operation.
2) Actions: The action operator triggers the Spark submission job (job) and outputs the data to the spark system.

1.Transformation Specific content:


2.Action Specific content:



SummaryProvides a more optimized and complex execution flow than Mapreduce,spark. The reader can also gain insight into how spark operates and the spark operator, which makes it more intuitive to understand the use of the API. Spark provides a richer functional operator, which lays a solid foundation for the development of the spark's upper component. The following articles will detail the spark operator source code and examples.

Spark Rdd using detailed 1--rdd principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.