RDD is the most basic and fundamental data abstraction of spark. Http://www.cs.berkeley.edu /~ Matei/papers/2012/nsdi_spark.pdf is a thesis about RDD. If you think it is too time-consuming to read English, you can read this article
This article also analyzes the implementation of RDD based on this paper and the source code.
First, what is RDD? Resilient distributed datasets (RDD,) elastic distributed dataset. RDD is a set of read-only and partition records. RDD can only be created based on deterministic operations performed on datasets in stable physical storage and other existing RDDs. These deterministic operations are called conversions, such as map, filter, groupby, and join (conversions are not the operations performed by programmers on RDD ).
RDD does not need to be materialized. RDD contains how to derive (I .e., calculate) the related information (I .e., lineage) of this RDD from other RDD, so that the corresponding RDD partition can be calculated from the physical storage data.
Let's take a look at the internal implementation overview of RDD:
Internally, each RDD is characterized by five main properties:
*
*-A list of partitions
*-A function for computing each split
*-A list of dependencies on other RDDs
*-Optionally, A partitioner for key-value RDDs (e.g. To say that the RDD is hash-partitioned)
*-Optionally, a list of preferred locations to compute each split on (e.g. Block locations
* An HDFS file)
Each RDD has five main attributes:
- A partition is the basic unit of a dataset.
- A function used to calculate each part
- Dependency on parent RDD, which describes the Lineage Between RDD
- For the RDD of key-value, a partitioner
- A list stores the preferred locations for each partition. For an HDFS file, the location of the block where each partition is located is stored.
Org. Apache. Spark. RDD. RDD is an abstract class that defines the basic operations and attributes of RDD. These basic operations include map, filter, and persist. In addition, org. Apache. Spark. RDD. pairrddfunctions defines key-Value Type RDD operations, including groupbykey, join, cecebykey, countbykey, saveashadoopfile, and so on. Org. Apache. Spark. RDD. sequencefilerddfunctions contains saveassequencefile applicable to all RDD.
RDD supports two operations: Transformation creates a new dataset from an existing dataset, and actions returns a value to the driver after running computation on the dataset. For example,MapIt is a type of conversion. It passes every element of the dataset to the function and returns a new distributed dataset to indicate the result. On the other hand,ReduceIs an action that overwrites all elements through some functions and returns the final result to the driver program. (But there is another parallelReducebykeyReturns a distributed dataset)
All conversions in Spark are inert, that is, they do not directly calculate the results. Instead, they only remember the conversion actions applied to the Basic Dataset (such as a file. Only when an action is required to return the result to the driver, these conversions will actually run. This design enables spark to run more efficiently. For example, we can:MapCreate a new dataset andReduceAnd only returnsReduceInstead of the entire large new dataset.
By default, each converted RDD is recalculated when you execute an action on it. However, you can also usePersist(OrCache) Method to persist an RDD in the memory. In this case, spark will save relevant elements in the cluster. The next time you query this RDD, it will be able to access it more quickly. Data sets can be persisted on disks or replicated between clusters.
The following table lists the RDD transformations and actions in spark. Each operation provides an identifier, in which square brackets indicate type parameters. As mentioned above, conversions are delayed operations used to define a new RDD, while actions start computing operations and return values to user programs or write data to external storage.
Table 1 RDD conversions and actions supported by Spark
Conversion |
Map (F: T) U): RDD [T]) RDD [u] Filter (F: T) bool): RDD [T]) RDD [T] Flatmap (F: T) seq [u]): RDD [T]) RDD [u] Sample (fraction: Float): RDD [T]) RDD [T] (deterministic sampling) Groupbykey (): RDD [(k, v)]) RDD [(k, seq [v])] Reducebykey (F: (V; v) V): RDD [(k, v)]) RDD [(k, v)] Union (): (RDD [T]; RDD [T]) RDD [T] Join (): (RDD [(k, v)]; RDD [(k, W)]) RDD [(k, (V, W)] Cogroup (): (RDD [(k, v)]; RDD [(k, W)]) RDD [(k, (SEQ [v], SEQ [w])] Crossproduct (): (RDD [T]; RDD [u]) RDD [(t, u)] Mapvalues (F: V) W): RDD [(k, v)]) RDD [(k, W)] (preserves partitioning) Sort (C: comparator [k]): RDD [(k, v)]) RDD [(k, v)] Partitionby (P: partitioner [k]): RDD [(k, v)]) RDD [(k, v)] |
Action |
Count (): RDD [T]) Long Collect (): RDD [T]) seq [T] Reduce (F: (t; t) T): RDD [T]) T Lookup (K: K): RDD [(k, v)]) seq [v] (on Hash/range partitioned RDDs) Save (Path: string): outputs RDD to a storage system, e.g., HDFS |
Note that some operations are only available for key-value pairs, such as join. In addition, function names match APIs in Scala and other functional languages. For example, map is a one-to-one ing, flatmap maps each input to one or more outputs (similar to map in mapreduce ).
In addition to these operations, you can also request to cache RDD. In addition, you can use the partitioner class to obtain the partition sequence of RDD, and then partition the other RDD in the same way. Some operations will automatically generate a hash or RDD of the range partition, such as groupbykey, reducebykey, and sort.
Starting from an example
The following example is taken from the RDD paper to implement the logic for processing error logs in an HDFS log file.
Lines = spark. textfile ("HDFS ://... ") // lines is a Org. apache. spark. RDD. mappedrdderrors = lines. filter (_. startswith ("error") // errors is a Org. apache. spark. RDD. filteredrdderrors. cache () // persist to errors in memory. count () // trigger action to calculate the number of errors, that is, the number of rows of error. // count errors mentioning MYSQL: errors. filter (_. contains ("MySQL ")). count () // return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors. filter (_. contains ("HDFS ")). map (_. split ('\ t') (3 )). collect ()
Spark is an instance of org. Apache. Spark. sparkcontext. Basically, spark applications start with defining a sparkcontext. Textfile is defined as follows:
/** * Read a text file from HDFS, a local file system (available on all nodes), or any * Hadoop-supported file system URI, and return it as an RDD of Strings. */ def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) }
Hadoopfile creates an org. Apache. Spark. RDD. hadooprdd, while calling map on hadooprdd generates a mappedrdd:
/** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
Errors. cache () is not executed immediately. It is used to cache the results after RDD computing is completed for future computation. This can speed up subsequent computation.
Errors. Count () triggers an action. In this case, you need to submit a job to the cluster:
/** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
After submission, sparkcontext submits the runjob to dagscheduler. dagscheddag divides the current DAG into stages, generates taskset, and submits tasks through submittasks of taskscheduler, which calls schedulerbackend, schedulerbackend sends these tasks to executor for execution.
How to divide stages? How to generate tasks? Next, it will be parsed. I will go to work tomorrow. Have a rest early today.
Spark Technology Insider: What is RDD?