[Original] RDD topics

Last Update:2013-11-15 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is RDD? What is the role of Spark? How to use it? 1. What is RDD? (1) Why does RDD occur? Although traditional MapReduce has the advantages of automatic fault tolerance, load balancing, and scalability, its biggest disadvantage is the adoption of non-circular data stream models, this requires a large number of disk I/O operations in Iterative Computing. RDD is the abstract method to solve this shortcoming (2) RDD's specific description RDD (elastic dataset) is the most important abstract concept provided by Spark, it is a special set with a fault tolerance mechanism. It can be distributed on cluster nodes and perform various parallel operations in the form of a function-based editing operation set. RDD can be understood as a special set with a fault tolerance mechanism. It provides a read-only shared memory that can only be transformed from an existing RDD, load all the data to the memory for reuse. A. It is distributed and can be distributed across multiple machines for computing. B. It is flexible. When an error is not enough in the calculation process, it will exchange data with the disk. C. these restrictions can greatly reduce the automatic error tolerance overhead d. essentially, it is a more general iterative parallel computing framework. Users can control the intermediate computing results and apply them freely to subsequent computing. (3) Fault Tolerance Mechanisms of RDD implement two fault tolerance methods for distributed datasets: Data check points and record updates RDD adopt the record update method: it is costly to record all update points. Therefore, RDD only supports coarse particle transformation, that is, only recording a single operation performed on a single block, and then creating a transformation sequence (lineage) of an RDD to store it. The transformation sequence refers, each RDD contains information about how it is transformed from another RDD and how to reconstruct a piece of data. Therefore, RDD's fault tolerance mechanism is also called "lineage" fault tolerance. To implement this "lineage" Fault Tolerance Mechanism, the biggest challenge is how to express the dependency between the parent RDD and the Child RDD. (4) RDD internal design each RDD must contain the following four parts:. split source data blocks, splits variable B in the source code. the dependencies variable c. A computing function (how can this RDD be calculated using the parent RDD), iterator (split) and compute function d in the source code. metadata about how to block and store data, such as partitioner and preferredLocations in source code, such as. RDD data blocks obtained from files in the distributed file system are obtained by splitting each file. It does not have a parent RDD, its computing function knowledge reads each row of a file and returns it to RDD as an element; B. for the RDD obtained through the map function, it will have the same data block as the parent RDD, its computing function is a function executed by each element in the parent RDD. 2. the position and role of RDD in Spark. (1) Why is there Spark? Because traditional parallel computing models cannot effectively solve iterative and interactive computing, Spark's mission is to solve these two problems, this is also the value and reason for his existence. (2) How does Spark solve Iterative Computing? The main implementation idea is RDD, which stores all the computing data in the distributed memory. Iterative Computing usually performs Iterative Computing On the same dataset repeatedly, and the data will greatly improve the I/O operations in the memory. This is also the core of Spark: memory computing. (3) How does Spark implement interactive computing? Spark is implemented using the scala language, and Spark and scala can be closely integrated. Therefore, Spark can perfectly use the scala interpreter, scala makes it as easy to operate distributed datasets as it operates on local collection objects. (4) What is the relationship between Spark and RDD? It can be understood that RDD is an abstract method of memory-Based Cluster Computing with fault tolerance. Spark is the implementation of this abstract method. 3. How to operate RDD? (1) how to obtain RDDa. obtain from the shared file system (such as HDFS) B. use an existing RDD to convert c. parallelize an existing scala set (as long as it is a Seq object) by calling the parallelize method of SparkContext. changes the validity of existing RDDs. RDD is lazy and short-lived. (RDD solidification: cache to internal error; save to Distributed File System) (2) two actions for operating RDD. actions: returns a value to the driver after computing the dataset. For example, Reduce aggregates all the elements of the dataset with a function and returns the final result to the program. B. Transformation: Create a new dataset Based on the dataset. After calculation, a new RDD is returned. For example, after Map computes each element of the data through a function, a distributed dataset with a surname is returned. (3) details of Actions:

Reduce (Func)	The func function is used to aggregate all elements in a dataset. The Func function accepts two parameters and returns a value. This function must be correlated to ensure that it can be correctly executed concurrently.
Collect ()	In the Driver program, all elements of the dataset are returned in the form of arrays. This usually returns a small subset of data after using the filter or other operations, directly returning the entire RDD set Collect, it is likely that the Driver program OOM
Count ()	Number of elements in the returned Dataset
Take (N)	Returns an array consisting of the First n elements of the dataset. Note: this operation is not performed concurrently on multiple nodes, but on the machine where the Driver program is located and all the elements of the single-host computing. (The Gateway memory pressure will increase, so use it with caution)
First ()	Returns the first element of a dataset (similar to take (1 ))
SaveAsTextFile (path)	Save the dataset elements as textfiles to the local file system, hdfs, or any other file system supported by hadoop. Spark calls the toString method of each element and converts it to a line of text in the file.
SaveAsSequenceFile (path)	Save the dataset elements in sequencefile format to the specified directory, local system, hdfs, or any other file system supported by hadoop. The elements of RDD must be composed of key-value pairs and all implement the Hadoop Writable interface, or can be implicitly converted to Writable (Spark includes conversion of basic types, such as Int, Double, string, etc)
Foreach (Func)	Run the function func on every element of the dataset. This is usually used to update a accumulator variable or interact with an external storage system.

(4) specific Transformation content

Map (Func)	Returns a new distributed dataset consisting of each original element converted by the func function.
Filter (Func)	Returns a new dataset consisting of the original elements whose return value is true after the func Function
FlatMap (Func)	Similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq rather than a single element)
Sample (WithReplacement,Frac,Seed)	Based on the given Random seed, data with a random number of frac is sampled.
Union (otherDataset)	Returns a new dataset, which is composed of the original dataset and parameters.
GroupByKey ([numTasks])	Calls a dataset composed of (K, V) pairs and returns a (K, Seq [V]) pair of datasets. Note: by default, 8 parallel tasks are grouped. You can input an optional numTask parameter to set different tasks based on the data volume. (The combination of groupByKey and filter can implement the Reduce function similar to Hadoop)
ReduceByKey (func, [numTasks])	When used on a (K, V) pair of data sets, A (K, V) pair of data sets with the same key values are all aggregated using the specified reduce function. Similar to the groupbykey, the number of tasks can be configured using the second optional parameter.
Join (otherDataset, [numTasks])	Call a dataset of the (K, V) and (K, W) types to return a (K, (V, W) pair, a dataset in which all elements of a key are together.
GroupWith (otherDataset, [numTasks])	Called on datasets of the (K, V) and (K, W) types, a dataset is returned. The composition element is (K, Seq [V], Seq [W]). tuples. This operation is called CoGroup in other frameworks.
Cartesian (otherDataset)	Cartesian product. However, when the dataset T and U are raised, A (T, U) pair is returned, and all elements interact with each other to perform Cartesian product.
SortByKey ([ascendingOrder])	Called on a dataset of the type (K, V), return the (K, V) pairs sorted by K as the key. The ascending or descending order is determined by the ascendingOrder parameter of the boolean type. (Similar to the Sort in the middle stage of Map-Reduce in Hadoop, Sort by Key)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More