Spark RDD Operations (2)

Last Update:2016-09-02 Source: Internet

Author: User

Tags random seed spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The transformation operator with data type value can be divided into the following types according to the relationship between input partition and output partition of Rdd transform operator.

1) input partition and output partition one-to-one.

2) input partition and output partition many-to-one type.

3) input partition and output partition Many-to-many types.

4) The output partition is an input partition subset type.

5) There is also a special type of operator for input and output partitioning: cache type. The cache operator caches the RDD partition.

1. input partition and output partition one-to-one

(1) Map

Transforms each data item of the original RDD into a new element through the user-defined function f-map in map. The map operator in the source code is equivalent to initializing an RDD, and the new Rdd is called Mappedrdd (this, Sc.clean (f)).

Each box in Figure 3-4 represents an RDD partition, and the left partition is mapped to the new RDD partition on the right by the user-defined function f:t->u. But only when the action operator is triggered does the F function perform operations on the data in a stage with other functions. V1 input F Conversion output V ' 1.

(2) FlatMap

Converts each element in the original RDD through the function f to the new element and merges the elements from each collection of the resulting RDD into a single collection. Internally created Flatmappedrdd (this, Sc.clean (f)).

Figure 3-5 The small box represents a partition of the RDD, the Flatmap function operation on the partition, the function passed in Flatmap f:t->u,t and U can be any data type. Converts the data in the partition to new data through the user-defined function f. The outer box can be considered an RDD partition, and a small box represents a collection. V1, V2, V3 a data item in a set of cooperation for the RDD, converted to V ' 1, v ' 2, V ' 3, will be combined to form a data item in the RDD.

(3) Mappartitions

The Mappartitions function obtains an iterator to each partition, in which the entire partition's elements are manipulated in the function through an iterator to the whole partition. The internal implementation is to generate MAPPARTITIONSRDD. The box in Figure 3-6 represents an RDD partition.

In Figure 3-6, the user filters all data in the partition via function f (ITER) =>iter.filter (_>=3), >=3 data retention. A block represents an RDD partition, with 1, 2, and 3 partition filters leaving only the element 3.

(4) Glom

The Glom function forms an array for each partition, and the internal implementation is the returned GLOMMEDRDD. Each of the squares in Figure 3-7 represents an RDD partition.
The box in Figure 3-7 represents a partition. The graph indicates that a partition containing V1, V2, V3 is formed by a function Glom an array array[(V1), (V2), (V3)].

2. input partition and output partition many-to-one type

(1) Union

The Union function needs to ensure that the data type of the two RDD elements is the same, the RDD data type returned is the same as the data type of the RDD element being merged, and no deduplication is done to save all elements. If you want to go heavy, you can use distinct (). The + + symbol corresponds to the uion function operation.

The generous box on the left in Figure 3-8 represents two Rdd, and the small box inside the generous frame represents the RDD partition. The right-hand box represents the combined Rdd, and the small box within the generous frame represents the partition. Contain v1,v2 ... U4 's Rdd and contains v1,v8 ... U8 's Rdd merges all elements to form an RDD. V1, V1, V2, V8 form a partition, and other elements merge in the same vein.

(2) Cartesian

Perform a Cartesian product operation on all elements within two rdd. After the operation, the internal implementation returns CARTESIANRDD. The generous box on the left in Figure 3-9 represents two Rdd, and the small box inside the generous frame represents the RDD partition. The right-hand box represents the combined Rdd, and the small box within the generous frame represents the partition.

The generous box in Figure 3-9 represents the Rdd, and the small box in the generous frame represents the RDD partition. For example, W1, W2, Q5 in V1 and another Rdd are formed (V1,W1), (V1,W2), (V1,Q5), in Cartesian product operations.

3. input partition and output partition Many-to-many types

GroupBy: The element is generated by the function of the corresponding key, the data is converted to the Key-value format, and then the same key elements are grouped into a group.
The function is implemented as follows.

The ①sc.clean () function preprocess the user function:
Val cleanf = Sc.clean (f)

② the data map function, and finally the Groupbykey group operations.

This.map (t = (cleanf (t), T)). Groupbykey (P)

Where the number of partitions and the partition function are determined in P, the degree of parallelism is determined. The box in Figure 3-10 represents the Rdd partition.

The box in Figure 3-10 represents an RDD partition with elements of the same key merged into a group. For example, V1,v2 is merged into a Key-value pair, where key is "V" and value is "v1,v2", forming V,seq (V1,V2).

4. Output partition as input partition subset type

(1) Filter

The function of filter is to filter the elements, apply an F function to each element, the elements that return a value of true are preserved in the RDD, and the return of false will be filtered out. The internal implementation is equivalent to generating Filteredrdd (This,sc.clean (f)).

The following code is the intrinsic implementation of the function.

def filter (F:t=>boolean): Rdd[t]=new Filteredrdd (This,sc.clean (f))

Each of the squares in Figure 3-11 represents an RDD partition. T can be of any type. The user-defined filter function f, which operates on each data item, will satisfy the condition and return a data item with a true result of the reservation. For example, filtering out V2, V3 retains the V1, and will be distinguished by the name V1 '.

(2) distinct

Distinct the elements in the RDD to be re-operated. The box in Figure 3-12 represents the Rdd partition.

Each box in Figure 3-12 represents a partition, and the data is de-weighed by the distinct function. For example, duplicate data V1, V1, and only one copy of V1 is retained.

(3) Subtract

Subtract is equivalent to doing a collection of difference operations, and RDD 1 removes all elements of the RDD 1 and the Rdd 2 intersection.

The generous box on the left in Figure 3-13 represents two Rdd, and the small box inside the generous frame represents the RDD partition. The right-hand box represents the combined Rdd, and the small box within the generous frame represents the partition. V1 in two rdd, according to the difference set operation rules, the new RDD does not retain, V2 in the first Rdd has, the second rdd is not, then contains V2 in the new RDD element.

(4) sample

Sample takes a sample of the elements within this set of RDD to get a subset of all the elements. The user can set the sampling, percentage, random seed to be put back, and then decide how to sample.

The internal implementation is to generate Sampledrdd (withreplacement, fraction, seed).

The function parameters are set as follows.

Withreplacement=true, indicating that there is a sample put back;

Withreplacement=false, which represents a sample that is not put back.

Each of the boxes in Figure 3-14 is an RDD partition. Sample 50% of the data through the sample function. V1, V2, U1, U2, U3, U4 sampled data V1 and U1, U2 to form a new rdd.

(5) Takesample

The Takesample () function and the sample function above are a principle, but do not use the relative proportional sampling, but instead of sampling the number of sets of samples, and return the result is no longer an RDD, but the equivalent of the sampled data is collect (), the return result of the collection is a stand-alone array.

The box on the left in Figure 3-15 represents the partition on each node of the distribution, and the right square represents an array of results returned on a single machine. Sample data by Takesample, set to sample a copy of the data, return the result is V1.

5. Cache type

(1) Cache
The cache caches the RDD elements from disk to memory, equivalent to the functionality of the persist (memory_only) function. The box in Figure 3-14 represents the Rdd partition.

Each box in Figure 3-16 represents an RDD partition, and the left equivalent of the data partition is stored on disk, and the cache operator caches the data in memory.

(2) Persist

The PERSIST function caches the RDD operation. Where the data cache is determined by the Storagelevel enumeration type. There are several types of combinations (see figure 3-15), disk stands for disks, memory stands for RAM, and SER represents whether data is serialized and stored.

The following is the function definition, Storagelevel is the enumeration type, represents the storage mode, the user can select on demand by Figure 3-17.

Persist (Newlevel:storagelevel)

Figure 3-17 Lists the patterns that the persist function can cache. For example, Memory_and_disk_ser represents data that can be stored in memory and disk, and stored in a serialized manner. The other same.

The box in Figure 3-18 represents the Rdd partition. Disk stands for storage on disks, and MEM stands for storage in memory. The data is initially stored on disk, and the data is cached to memory through persist (Memory_and_disk), but some partitions do not fit in memory, for example: in Figure 3-18, the rdd containing the v1,v2,v3 is stored to disk, and the RDD containing U1,U2 is still stored in memory.

Spark RDD Operations (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More