Vii. What is an RDD?

Source: Internet
Author: User
Tags hadoop mapreduce

The RDD is an abstract class that defines methods such as map (), reduce (), but in fact the derived class that inherits the Rdd typically implements two methods:

    • def Getpartitions:array[partition]
    • def compute (thepart:partition, Context:taskcontext): Nextiterator[t]

GetPartitions () is used to tell how to partition input.

Compute () is used to output all the rows of each partition (the line is an inaccurate statement I gave, which should be a unit of function processing);

Features of the RDD:

    1. It is an immutable, partitioned collection object on a cluster node.
    2. It is created by parallel transformations such as (map, filter, join, etc).
    3. Failed to rebuild automatically.
    4. You can control the storage level (memory, disk, and so on) for reuse.
    5. Must be serializable.
    6. is of a static type.
A, partition B, dependency (lineage) C, function D, best position (data localization) e, partitioning policy

The benefits of RDD

    1. Rdd can only be generated from persistent storage or through transformations operations, and is more efficient than distributed shared memory (DSM) for fault tolerance, and the loss of some data partitions can be recalculated only by its lineage, without the need for a specific checkpoint.
    2. The non-variability of the RDD enables the speculative execution of class Hadoop mapreduce.
    3. The data partitioning feature of the RDD can improve performance through the local nature of the data, as is the case with Hadoop MapReduce.
    4. The RDD is serializable, can be automatically degraded to disk storage when it is out of memory, and the RDD is stored on disk, and performance will be greatly reduced but not worse than the current mapreduce.

Storage and partitioning of RDD

    1. Users can choose different storage levels to store the RDD for reuse.
    2. The current rdd is stored in memory by default, but when memory is low, the RDD spill to disk.
    3. The RDD partitions the data in a cluster when it needs to be partitioned (such as a hash partition) based on each record key, ensuring that two datasets are efficient at join.

Internal representation of the RDD

In the internal implementation of the RDD, each RDD can be represented by a 5-facet feature:

    1. Partition list (data Block list)
    2. function to compute each shard (this rdd is computed from the parent RDD)
    3. Dependency list for parent Rdd
    4. Partitioner "optional" for Key-value Rdd
    5. A list of predefined addresses for each data shard (such as the address of a block of data on HDFs) "optional"

The storage level of the RDD

The RDD provides 11 storage levels based on a combination of four parameters, Usedisk, usememory, deserialized, replication:

  1. Val NONE = new Storagelevel (False, False, false)
  2. Val disk_only = new Storagelevel (True, False, False)
  3. Val disk_only_2 = new Storagelevel (True, False, False, 2)
  4. Val memory_only = new Storagelevel (False, True, true)
  5. Val memory_only_2 = new Storagelevel (False, True, true, 2)
  6. Val memory_only_ser = new Storagelevel (False, True, false)
  7. Val memory_only_ser_2 = new Storagelevel (False, True, False, 2)
  8. Val memory_and_disk = new Storagelevel (True, True, true)
  9. Val memory_and_disk_2 = new Storagelevel (True, True, true, 2)
  10. Val memory_and_disk_ser = new Storagelevel (True, True, false)
  11. Val memory_and_disk_ser_2 = new Storagelevel (True, True, False, 2)

The RDD defines a variety of operations, different types of data are abstracted by different RDD classes, and different operations are pumped by the RDD.

The creation of an RDD

There are two ways to create an rdd:

1, from the Hadoop file system (or other Hadoop-compatible storage systems) input (such as HDFS) created.

2. Convert from parent Rdd to get new Rdd.

Here's a way to generate an RDD from the Hadoop file system, such as: val file = spark.textFile("hdfs://...") the file variable is the RDD (actually a Hadooprdd instance), and the core code generated is as follows:

    1. // sparkcontext Create rdd,  based on file/directory and optional number of shards here we can see that spark and hadoop mapreduce are much like    
    2.    //  requires the type of Inputformat, key, value, in fact, the inputformat, writable type of Hadoop used by spark.    
    3.    def textfile (path: string, minsplits: int =  defaultminsplits):  rdd[string] = {   
    4.         hadoopfile (path, classof[textinputformat], classof[longwritable],    
    5.        classof[text], minsplits)  .map (pair = > pair._2.tostring)  }  
    6.    
    7.    //  Create hadooprdd     based on Hadoop configuration, InputFormat, and so on;
    8.    new hadooprdd (this,  conf, inputformatclass, keyclass, valueclass, minsplits)  

When calculating the RDD, the RDD reads data from HDFs almost the same as Hadoop MapReduce:

The conversion and operation of RDD

There are two ways to calculate an Rdd: A transform (return value or an RDD) and an operation (the return value is not an RDD).

Conversion (transformations) (such as: Map, filter, groupBy, join, etc.), transformations operation is lazy, that is, the operation from one RDD to generate another rdd is not executed immediately, Spark will only record the need for such an operation when it encounters a transformations operation and will not execute until the calculation process is actually started when there is an actions action.

Actions (such as: count, collect, save, and so on), the actions return results or write the RDD data to the storage system. Actions are the drivers that trigger the spark start calculation.

Vii. What is an RDD?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.