Spark inside: What the hell is an RDD?

Source: Internet
Author: User

Rdd It is the spark base, which is the most fundamental data abstraction. Http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf It is open with an Rdd file. Suppose the English reading is too time consuming: http://shiyanjun.cn/archives/744.html

This article is also based on this paper and source code, analysis of the implementation of RDD.

First question, what is an RDD? Resilient Distributed Datasets(RDD,) Elastic distributed data sets. An RDD is a collection of only read, partitioned records.

The RDD can only be created based on running deterministic operations on datasets in stable physical storage and other existing RDD. These deterministic operations are called transformations. such as map, filter, GroupBy, join (the conversion is not a process developer running on the RDD).

The RDD does not need to be materialized. The RDD contains information about how to derive (that is, calculate) the Rdd from other rdd (i.e., lineage). Accordingly, the corresponding RDD partition can be computed from the data of the physical storage.

Take a look at an overview of the internal implementation for RDD:

internally, each RDD are characterized by five main properties:
 *
 * & nbsp;-a list of partitions
 *  -a function for computing each split
 *  -A List of Dependenci Es on other RDDs
 *  -Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is Hash-partiti oned)
 *  -Optionally, a list of preferred locations to compute each split on (e.g. block locations for
& nbsp;*    an HDFS file)

Each RDD has 5 basic properties:

    1. A set of shards (partition). That is, the basic constituent unit of the data set
    2. A function to compute each shard
    3. Dependency on the parent Rdd, this dependency description describes the lineage between the Rdd
    4. For Key-value's Rdd, a partitioner
    5. A list that stores access to each partition's preferred location. For an HDFs file. Stores the location of the block where each partition resides.

Org.apache.spark.rdd.RDD is an abstract class that defines the basic operations and properties of the RDD. These basic operations include map,filter and persist. Other than that. Org.apache.spark.rdd.PairRDDFunctions defines the operation of the Key-value type of RDD, including Groupbykey,join,reducebykey,countbykey,saveashadoopfile, and so on. The org.apache.spark.rdd.SequenceFileRDDFunctions contains all the saveassequencefile that are applicable to the RDD.

The RDD supports two operations: the transform (transformation) creates a new dataset from an existing dataset, and the Action (actions) returns a value to the driver after the calculation is run on the dataset. For example, it map is a conversion. It passes each element of the dataset to a function. and returns a new distribution dataset representing the result.

On the other hand, reduce it is an action that adds all the elements together through some functions and returns the result to the driver program. (There is just one parallel reduceByKey that can return a distributed dataset)

All conversions in spark are inert. In other words, they do not directly calculate the results. Instead, they simply remember these transformation actions applied to the underlying dataset (such as a file).

These conversions will only actually run when there is an action that requires the return result to driver. This design allows spark to run more efficiently. Like what. We are able to implement: by map creating a new data set. and used in the reduce . Finally only reduce the results returned to driver, not the entire large new data set.

By default, each converted Rdd is calculated again when you run an action on top of it. Just, you can also use the persist (or cache ) method to persist an RDD in memory. In this case, spark will save the relevant elements in the cluster. The next time you check this rdd, it will be able to access it at a higher speed. Persist the dataset on disk. or replicating datasets between clusters is also supported.

The following table lists the RDD transitions and actions in spark.

Each operation is given an identifier when the brackets indicate the type parameter. As mentioned earlier, transitions are deferred operations. Used to define a new rdd, whereas an action initiates a calculation operation and returns a value to the user program or writes data to an external store.

Table 1 The RDD conversions and actions supported in Spark
Transformation Map (f:t) U): Rdd[t]) Rdd[u]
Filter (F:t) Bool): Rdd[t]) rdd[t]
FlatMap (f:t) seq[u]): Rdd[t]) Rdd[u]
Sample (Fraction:float): Rdd[t]) rdd[t] (deterministic sampling)
Groupbykey (): rdd[(k, V)]) rdd[(k, seq[v])
Reducebykey (f: (V; v)) v): rdd[(k, v)]) rdd[(k, V)]
Union (): (Rdd[t]; RDD[T]) Rdd[t]
Join (): (Rdd[(K, V)]; Rdd[(k, W)]) rdd[(k, (V, W))]
Cogroup (): (Rdd[(K, V)]; rdd[(k, W)]) rdd[(K, (Seq[v], seq[w]))
Crossproduct (): (Rdd[t]; Rdd[u]) rdd[(T, U)]
Mapvalues (F:V) W): rdd[(k, V)]) rdd[(K, W)] (preserves partitioning)
Sort (C:comparator[k]): rdd[(k, v)]) rdd[(k, V)]
Partitionby (P:partitioner[k]): rdd[(k, v)]) rdd[(k, V)]
Action Count (): Rdd[t]) Long
Collect (): Rdd[t]) seq[t]
Reduce (f: (T; t)) T): Rdd[t]) t
Lookup (K:K): rdd[(k, V)]) Seq[v] (on Hash/range partitioned RDDs)
Save (path:string): Outputs RDD to a storage system, e.g., HDFS

Note that some operations are only available for key-value pairs, such as join.

Other than that. The function name matches the API in Scala and other functional languages. For example, a map is a one-to-many mapping, and flatmap is mapping each input to a single or multiple output (similar to the map in MapReduce).

In addition to these operations. Users can also request that the RDD be cached. Also, users are able to get the partition order of the RDD through the Partitioner class. Then there will be an RDD that is partitioned in the same way. Some operations will voluntarily generate a hash or range of the Rdd, like Groupbykey. Reducebykey and sort and so on.

Start with a sample

The following sample, taken from the RDD paper, implements the logic for processing an error log in an HDFS log file.

Lines = Spark.textfile ("hdfs://...")  //Lines is a org.apache.spark.rdd.MappedRDDerrors = Lines.filter (_. StartsWith ("ERROR"))//Errors is a org.apache.spark.rdd.FilteredRDDerrors.cache ()//persist to In-memory errors.count ()// Triggers the action. Calculate the number of errors, that is, how many lines of error//Count errors mentioning MySQL:errors.filter (_.contains ("MySQL")). Count ()//Return the time Fields of Errors mentioning//HDFS as an array (assuming time was field//number 3 in a tab-separated format): Errors.filter (_.contains ("HDFS"))        . Map (_.split (' \ t ') (3))        . Collect ()


Spark is an example of a org.apache.spark.SparkContext. Basically, the spark application starts with defining a sparkcontext. The definition of textfile is as follows:

/**   * Read a text file from HDFS, a local file system (available in all nodes), or any   * hadoop-supported file sys Tem URI, and return it as an RDD of Strings.   *  /def textfile (path:string, minpartitions:int = defaultminpartitions): rdd[string] = {    Hadoopfile (path, Classof[textinputformat], classof[longwritable], Classof[text],      minpartitions). Map (pair = pair._2.tostring ). SetName (path)  }

Hadoopfile creates a Org.apache.spark.rdd.HadoopRDD, and the Hadooprdd on the map generates a MAPPEDRDD:
  /**   * Return a new rdd by applying a function to all elements of the this RDD.   */  def Map[u:classtag] (f:t = U): rdd[u] = new Mappedrdd (this, Sc.clean (f))

Errors.cache () does not run immediately, it is the function of the RDD after the calculation, the results of the cache up. To be used for future calculations. Such words can speed up the calculation later.

Errors.Count () Triggers an action, at which point the job needs to be submitted to the cluster:

/** * Return The number of elements in the   RDD.   */  def count (): Long = Sc.runjob (This, utils.getiteratorsize _). Sum


After submission, Sparkcontext will submit runjob to Dagscheduler. Dagscheduler will divide the current DAG into the stage. Then, after generating Taskset, the tasks are submitted through TaskScheduler Submittasks, which in turn invokes Schedulerbackend. Schedulerbackend will send these tasks to executor to run.

How to divide the stage? How do I generate tasks? The next step will be resolved.

In order to go to work tomorrow, take a rest as early as today.

Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

Spark inside: What the hell is an RDD?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.