[Bigdata] Spark Rdd Finishing

Last Update:2015-09-16 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is an RDD?
The core concept of Rdd:spark is the RDD (resilient distributed dataset), which refers to a read-only, partitioned, elastic, distributed dataset that can be used in all or part of the data set in memory and can be reused across multiple computations.

2. Why is RDD generated?

(1) The traditional mapreduce has the advantages of automatic fault tolerance, balanced load and extensibility, but its biggest disadvantage is that it uses the non-cyclic data flow model, which makes a lot of disk IO operation in the iterative formula . The RDD is the abstract way to solve this shortcoming.

(2) The RDD is a special set of fault-tolerant mechanism, which can be distributed on the nodes of the cluster, and perform various parallel operations in the way of functional programming operation collection. The RDD can be understood as a special collection of fault-tolerant mechanisms that provide a read-only, shared memory that can only be transformed by an existing RDD, and then load all the data into memory for easy reuse.

A. It is distributed and can be distributed on multiple machines, parallel computing.

B. It is elastic, and when there is not enough memory in the calculation process it will exchange data with the disk.

C. These limits can greatly reduce the automatic fault-tolerant overhead

D. Substance is a more general iterative parallel computing framework in which users can explicitly control the intermediate results of calculations and then freely apply them to subsequent computations.

(3) The fault tolerant mechanism of RDD has two methods to realize the fault tolerance of distributed datasets: data checkpoint and record RDD update transformation sequence (descent).

It is expensive to record all update points in a record-updating manner. Therefore, the RDD only supports coarse-grained transformations, that is, only a single operation performed on a single block is recorded, and then the transformation sequence (descent) of an RDD is created and stored, and the transformation sequence refers to each rdd containing how it is transformed by other rdd and how to reconstruct the information of a piece of data. Therefore, the fault-tolerant mechanism of RDD is also called "descent" fault tolerance. The biggest challenge in implementing this "descent" tolerance mechanism is how to express the dependency between the parent Rdd and the child Rdd.

In fact, dependencies can be divided into two types, narrow dependencies and wide dependencies: narrow dependencies: Each chunk of the child RDD relies only on a finite fixed block of data in the parent Rdd; wide dependency: a block of data in a child rdd depends on all the blocks in the parent RDD. For example: The map transform, the data block in the child Rdd only depends on the corresponding data block in the parent Rdd; Groupbykey transforms, the data blocks in the child rdd depend on the data blocks in all the parent rdd, because a key may exist in any of the data blocks of the parent RDD. Two attributes that classify dependencies: first, a narrow dependency can get a block of data corresponding to a child rdd by calculating a block of data from the parent RDD directly on a compute node, and a wide dependency waits until all the data blocks of the parent RDD have been computed, and the parent RDD evaluates the hash. to the corresponding node before the child Rdd can be calculated. Second, when data is lost, it is only necessary to recalculate the missing piece of data for a narrow dependency, and for a wide dependency, all the blocks of the ancestor Rdd are recalculated to recover. So in the long "descent" chain, especially when there is a wide dependency, you need to set the data checkpoint at the appropriate time. Also, these two features require different task scheduling mechanisms and fault-tolerant recovery mechanisms for different dependency relationships.

(4) The design of the RDD interior. Each RDD needs to contain the following four parts:

A. The data block after the source data is split, the splits variable in the source code

B. Information on "Descent", dependencies variable in source code

C. A calculation function (how the RDD is calculated from the parent rdd), the iterator (split) and compute functions in the source code

D. Meta-information on how to place blocks and data, such as Partitioner and preferredlocations in the source code

For example, a. An rdd from a file in a distributed file system has data blocks that are obtained by slicing individual files, and it has no parent RDD, and its calculation function simply reads each line of the file and returns it to Rdd;b as an element. For an RDD obtained through the map function, it will have the same data block as the parent Rdd, which A calculation function is a function that executes on an element in each parent Rdd

2. The position and role of RDD in Spark

(1) Why is there spark? Because the traditional parallel computing model cannot effectively solve iterative computation (iterative) and Interactive Computing (interactive), the mission of Spark is to solve these two problems, which is the value and reason of his existence.

(2) How does spark solve iterative calculations? The main idea of implementation is the RDD, which stores all the calculated data in distributed memory. Iterative calculations are often iterative computations of the same data set, and the data will greatly increase IO operations in memory. This is also the core of Spark's involvement: memory computing.

(3) How does spark implement interactive computing? Because Spark is implemented in the Scala language, spark and Scala can be tightly integrated, so spark can use Scala's interpreter perfectly, making it easy for Scala to manipulate distributed datasets as if it were a local collection object.

(4) What is the relationship between Spark and RDD? It can be understood that the RDD is a fault-tolerant, memory-based approach to cluster computing abstraction, and Spark is the implementation of this abstract method.

3, how to operate the RDD?

(1) How to get the RDD

A. Obtained from a shared file system (e.g., HDFS)

B. Passing an existing RDD conversion

C. parallelize the existing Scala collection (as long as it is a Seq object) by calling the Parallelize method of the Sparkcontext to implement

D. Changing the permanence of existing rdd; the RDD is lazy and short-lived. (Rdd curing: Cache cached to memory; Save to Distributed File system)

(2) Two actions to operate the RDD

A. Actions: Returns a numeric value to the driver after the data set is computed; For example, reduce aggregates all the elements of a dataset with a function and returns the final result to the program.

B.transformation: Creates a new data set based on the dataset, and returns a new RDD after the calculation; For example, MAP returns a new distributed dataset after each element of the data is evaluated by a function.

(3) Specific contents of the actions:

Reduce (func)	All elements in the dataset are aggregated through the function func. The Func function takes 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently.
Collect ()	In the driver program, all elements of the dataset are returned as an array. This usually returns a subset of data that is small enough to be used after using filter or other operations, directly returning the entire RDD set collect, which is likely to cause the driver program to Oom
Count ()	Returns the number of elements in a data set
Take (N)	Returns an array that consists of the first n elements of a dataset. Note that this operation is not currently performed in parallel on multiple nodes, but rather on the machine where the driver program is located, and all of the elements are computed (the gateway's memory pressure will increase and need to be used with caution)
First ()	Returns the first element of a dataset (similar to take (1)
Saveastextfile (PATH)	Save the elements of the dataset, in the form of Textfile, to a local filesystem, HDFs, or any other Hadoop-supported file system. Spark will invoke the ToString method of each element and convert it to a line of text in the file
Saveassequencefile (PATH)	Save the elements of the dataset in Sequencefile format, to the specified directory, to the local system, to HDFs, or to any other Hadoop-supported file system. The elements of the RDD must be composed of key-value pairs and both implement the writable interface of Hadoop, or implicitly can be converted to writable (spark includes basic types of transformations, such as int,double,string, etc.)
foreach (func)	On each element of the dataset, run the function func. This is typically used to update an accumulator variable, or to interact with an external storage system

(4) Transformation specific content

Map (func)	Returns a new distributed dataset consisting of each original element after the Func function is converted
Filter (func)	Returns a new dataset consisting of the original elements that return a value of true after the Func function
FlatMap (func)	Similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
Sample (Withreplacement, Frac, Seed)	Random sampling of FRAC data based on a given seed seed
Union (Otherdataset)	Returns a new dataset, combined with the original data set and parameters
Groupbykey ([Numtasks])	Called on a dataset composed of (K,V) pairs that returns a (K,seq[v]) pair of data sets. Note: By default, 8 parallel tasks are used to group, you can pass in the Numtask optional parameter, set a different number of task according to the amount of data
Reducebykey (func, [Numtasks])	Used on a (k,v) pair of data sets, returns a (K,V) pair of data sets, key the same value, are aggregated together using the specified reduce function. Similar to Groupbykey, the number of tasks can be configured with a second optional parameter.
Join (Otherdataset, [numtasks])	Called on a dataset of type (K,V) and (k,w) type, returns a DataSet (K, (v,w)) pair, with all the elements in each key together
Groupwith (Otherdataset, [numtasks])	Called on datasets of type (K,V) and (k,w) type, returns a dataset consisting of elements (K, seq[v], seq[w]) tuples. This operation is in other frameworks, called Cogroup
Cartesian (Otherdataset)	Cartesian product. But when called on the dataset T and U, returns a (t,u) pair of data sets, all elements interacting with the Cartesian product.

[Bigdata] Spark Rdd Finishing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More