Spark-rdd (elastic distributed data Set)

Source: Internet
Author: User


Org.apache.spark.rdd
RDD
Abstract class Rdd[t] extends Serializable with Logging

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements, can is operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, Org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as Groupbykey and joins; Org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of doubles; And Org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs the can be saved as Sequencefiles . All operations is automatically available on any RDD of the right type (e.g. rdd[(int, int)] through implicit.
The elastic Distributed Data Set (RDD) is the basic abstraction in spark. Represents a immutable, partitioned collection of elements. One of the elements can be manipulated in parallel. This class contains all the possible operations on the RDD, such as map,filter and persist. In addition, Org.apache.spark.rdd.PairRDDFunctions includes an available operation for an RDD that consists of key-value pairs of elements. For example, Groupbykey and join;org.apache.spark.rdd.doublerddfunctions contain an operation that is available for RDD consisting of doubles type elements. Org.apache.spark.rdd.SequenceFileRDDFunctions includes available actions on the RDD that can be saved as a Hadoop sequencefile. All operations are automatically available for any rdd on the right through implicit invocation. For example rdd[(int, int)]


Internally, each RDD are characterized by five main properties:
Internally, each RDD is primarily characterized by five attributes:
A List of partitions
A list of partitions
A function for computing each split
A function to calculate each split
A list of dependencies on other RDDs
A list of dependencies on other RDD
Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optional, a partition about the key-value rdd, or the RDD as a hash partition (hash partition)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Optionally, a list of suggested locations is used to calculate the location of the node where each tile is located, such as the block on the HDFs file system.


All of the scheduling and execution in Spark are do based on these methods, allowing each RDD-implement its own the Computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Refer to the Spark paper for more details on the RDD internals.
All scheduling and execution in Spark is based on these methods. is to allow each RDD to implement its own method of computing itself. In fact, users can also implement custom RDD by overriding these functions, such as reading data from a new storage system. See the Spark's documentation for more details about the RDD interior.


Linear Supertypes (parent Class)
Logging, Serializable, Serializable, Anyref, any

Known subclasses (known subclass < derived class >)
Cogroupedrdd, Edgerdd, Edgerddimpl, Hadooprdd, Jdbcrdd, Newhadooprdd, Partitionpruningrdd, ShuffledRDD, UnionRDD, Vertexrdd, Vertexrddimpl

(not to be continued)

Reprint Please specify: Original address: http://www.cnblogs.com/suanec/p/4772707.html

Spark-rdd (elastic distributed data Set)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.