Classification of the operators of Apache Spark

Last Update:2016-07-31 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Spark operator can be broadly divided into the following two categories:

1)Transformation Transform/conversion operator : This transformation does not trigger the submission of the job, the completion of the job intermediate process processing.

The transformation operation is deferred, meaning that the conversion from one RDD conversion to another is not performed immediately, and the operation is not actually triggered until there is an action action.

1. Map operator

2, Flatmap operator

3, mappartitions operator

4. Union operator

5, Cartesian operator

6, Grouby operator

7. Filter operator

8. Sample operator

9. Cache operator

10, persist operator

11, mapvalues operator

12, Combinebykey operator

13, Reducebykey operator

14. Join operator

the role of the spark operator, see in detail http://www.cnblogs.com/zlslch/p/5723979.html

2) Actionoperator : Such operators trigger Sparkcontext to submit job jobs.

The Action operator triggers the spark submission job (Job)and outputs the data to the spark system.

　　1. foreach operator

2, saveastextfile operator

3, collect operator

4. Count operator

　　the role of the spark operator, see in detail http://www.cnblogs.com/zlslch/p/5723979.html

1. Transformations operator
(1) Map

Transforms each data item of the original RDD into a new element through the user-defined function f-map in map. The map operator in the source code is equivalent to initializing an RDD, and the new Rdd is called Mappedrdd (this, Sc.clean (f)).

Each box in Figure 2 represents an RDD partition, and the left partition is mapped to the new RDD partition on the right by the user-defined function f:t->u. However, this f function does not actually operate on the data in one stage with other functions until the action operator is triggered. In Figure 1, the first partition, the data record V1 input F, through the F conversion output is the data record in the converted partition V ' 1.

Figure 1 Map operator to RDD conversion Figure 2 Flapmap operator to RDD conversion

(2) FlatMap
Converts each element in the original RDD through function f to a new element and merges the elements from each collection of the generated Rdd into a single collection, internally creating Flatmappedrdd (This,sc.clean (f)).
Figure 2 Represents a partition of the RDD for the FLATMAP function operation, the function passed in FlatMap is F:t->u, T and U can be any data type. Converts the data in the partition to new data through the user-defined function f. The outer box can be considered an RDD partition, and a small box represents a collection. V1, V2, V3 in a set of cooperation for the RDD of a data item, may be stored in an array or other container, converted to V ' 1, v ' 2, V ' 3, the original arrays or containers are combined to break up, the disassembled data formed into the data item in the RDD.
(3) Mappartitions
The Mappartitions function obtains an iterator to each partition, in which the entire partition's elements are manipulated in the function through an iterator to the whole partition. An internal implementation is a build
Mappartitionsrdd. The box in Figure 1-8 represents an RDD partition. In Figure 3, the user filters all data in the partition through function f (ITER) =>iter.f ilter (_>=3), which is greater than and equal to 3 of the data retention. A block represents an RDD partition, with 1, 2, and 3 partition filters leaving only the element 3.

Figure 3 Mappartitions operator-to-RDD conversion

(4) Union
The Union function requires that the data type of the two RDD elements be the same, and the RDD data type returned is the same as the data type of the RDD element being merged. Do not redo the operation, save all elements, if you want to go to heavy
You can use DISTINCT (). Spark also provides a more concise API to use Union, which is equivalent to the Union function operation via the + + symbol.
The left generous box in Figure 4 represents two Rdd, and the small box inside the generous frame represents the RDD partition. The right-hand box represents the combined RDD, and the small box within the generous frame represents the partition. After merging, V1, V2, V3 ... V8
form a partition, and other elements merge in the same vein.

Figure 4 The Union operator-to-RDD conversion

(5) Cartesian
Perform a Cartesian product operation on all elements within two RDD. After the operation, the internal implementation returns CARTESIANRDD. The left generous box in Figure 5 represents two Rdd, and the small box inside the generous frame represents the RDD partition. The right-hand box represents the combined RDD, and the small box within the generous frame represents the partition.
For example: V1 and another RDD W1, W2, Q5 perform Cartesian product operations (V1,W1), (V1,W2), (V1,Q5).

Figure 5 Cartesian operator-to-RDD conversion

(6) GroupBy
GroupBy: The element is generated by the function of the corresponding key, the data is converted to the Key-value format, and then the same Key elements are grouped into a group.
The function is implemented as follows:
1) preprocess the user function:
Val cleanf = Sc.clean (f)
2) function operation on data map, then Groupbykey grouping operation.

This.map (t = (cleanf (t), T)). Groupbykey (P)
where P determines the number of partitions and partition functions, it determines the degree of parallelism.

Figure 6 The Chinese box represents an RDD partition, and the elements of the same key are merged into a group. For example, V1 and V2 merge to V, and Value is v1,v2. Form V,seq (V1,V2).

Figure 6 GroupBy operator-to-RDD conversion

(7) Filter
The function of the filter function is to filter the elements, apply an F function to each element, the elements that return a value of true remain in the RDD, and the element with the return value false will be filtered
Off. The internal implementation is equivalent to generating Filteredrdd (This,sc.clean (f)).
The following code is the intrinsic implementation of the function:
Deffilter (F:t=>boolean): Rdd[t]=newfilteredrdd (This,sc.clean (f))
Each box in Figure 7 represents an RDD partition, and T can be any type. By using the user-defined filter function f, for each data item operation, the data item that satisfies the condition and returns the result to True is preserved. For example
Filtering out V2 and V3 retains the V1, which is named V ' 1 for the distinction.
(8) sample
Sample takes a sample of the elements within this set of RDD to get a subset of all the elements. The user can set the sampling, percentage, random seed to be put back, and then decide how to sample. The internal implementation is to generate Sampledrdd (withreplacement, fraction, seed).
function parameter settings:
? Withreplacement=true, indicating that there is a sample put back.
? Withreplacement=false, which represents a sample that is not put back.
Each of the boxes in Figure 8 is an RDD partition. Sample 50% of the data through the sample function. V1, V2, U1, U2 ... U4 sampled data V1 and U1, U2 form a new RDD.
　　　　

Figure 7 Filter operator to RDD conversion Figure 8 sample operator to RDD conversion

(9) Cache
The cache caches the RDD elements from disk to memory. Equivalent to the function of the persist (memory_only) function.
Each box in Figure 9 represents an RDD partition, and the left equivalent of the data partition is stored on disk, and the cache operator caches the data in memory.

Figure 9 Cache operator-to-RDD conversion

(Ten) Persist
The PERSIST function caches the RDD operation. Where the data cache is determined based on the enumeration type Storagelevel. There are several types of combinations (see figure 1-14), disk represents disks,
The memory represents the RAM, and SER represents whether the data is serialized and stored.

The following is the function definition, Storagelevel is the enumeration type, represents the storage mode, the user can choose on demand by Figure 9.
Persist (Newlevel:storagelevel)
Figure 10 shows the mode in which the PERSIST function can be cached. For example, Memory_and_disk_ser represents data that can be stored in memory and disk and stored in a serialized manner, and other
Similarly.

Figure ten persist operator to RDD conversion

Figure 11 The Chinese box represents the RDD partition. Disk stands for storage on disks, and MEM stands for storage in memory. The data is initially all stored on disk, and the data is cached to memory via persist (Memory_and_disk), but
Some partitions cannot be accommodated in memory, and partitions containing V1, V2, V3 are stored to disk.

(one) mapvalues
Mapvalues: The Map operation is done for Value in the (key, value) type data, not the key.

The box in Figure 12 represents the RDD partition. A=>a+2 represents the Key value data pair (v1,1), the data only adds 2 to the value of 1, and returns the result to 3.

Figure one Persist operator to RDD conversion diagram of mapvalues operator RDD pair conversion

(Combinebykey)
The following code is the definition of the Combinebykey function:
Combinebykey[c] (Createcombiner: (V) C,
Mergevalue: (c, V) C,
Mergecombiners: (c, c) C,
Partitioner:partitioner,
Mapsidecombine:boolean=true,
Serializer:serializer=null): rdd[(K,C)]

Description
? Createcombiner:v = c, C does not exist, such as the SEQ C created by V.
? Mergevalue: (c, V) + C, when C is already present, a merge is required, such as the item V
Add to Seq C, or overlay.
Mergecombiners: (c, c) + = C, merging two C.
? Partitioner:partitioner, Shuff le when the partitioner needed.
? Mapsidecombine:boolean = True, in order to reduce the amount of transmission, many combine can be in the map
End first, such as overlay, you can first in a partition all the same key value superimposed,
And then Shuff le.
? Serializerclass:string = NULL, the transport requires serialization, and the user can customize the serialization class:

For example, an RDD equivalent to an RDD that transforms an element (int, int) into an (int, seq[int]) type element. The box in Figure 13 represents the RDD partition. , by Combinebykey, merge (v1,2), (v1,1) data into (V1,SEQ (2,1)).
　　

Figure Combinebykey operator to RDD conversion

(Reducebykey)
Reducebykey is a simpler case than Combinebykey, except that two values are combined into a single value (int, int V) (int, int C), such as overlays. So Createcombiner Red
Cebykey is very simple, which is to return directly to V, while Mergevalue and mergecombiners logic are the same, no difference.
Function implementation:
def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)]
= {
Combinebykey[v] ((v:v) = V, Func, func, Partitioner)
}
The box in Figure 14 represents the RDD partition. The data of the same key (v1,2) and (v1,1) are added by adding the value of the user-defined function (A, A, A, b), and the result is (v1,3).

Figure Reducebykey operator to RDD conversion

(+) Join
Join Cogroup function Operations on two rdd that need to be connected, the data of the same key can be put into a partition, the new Rdd formed after the cogroup operation
The elements under key perform Cartesian product operations, and the returned results are flattened, and a set is formed for all tuples under the key. Finally return rdd[(K, (V, W))].
The following code is the function implementation of join, the essence is through the Cogroup operator first co-division, and then through the flatmapvalues to break the merged data.
This.cogroup (Other,partitioner). F latmapvalues{case (VS,WS) =
for (V<-VS;W<-WS) yield (v,w)}
Figure 15 is a join operation for two RDD. The generous box represents the Rdd, and the small box represents the partition in the RDD. Functions for elements of the same key, such as V1 for key, result in a connection (V1, ()) and (V1, ()).

Figure the join operator to the RDD conversion

2. Actions operator
Essentially, the runjob operation of the submit job is carried out through Sparkcontext in the action operator, triggering the execution of the Rdd DAG.
　　
For example, the code for the Action operator collect function is as follows, and interested readers can follow this entry
Code anatomy:
/**
* Return An array, contains all of the elements in this RDD.
*/
def collect (): array[t] = {
/* Submit job*/
Val results = Sc.runjob (this, (iter:iterator[t]) = Iter.toarray)
Array.concat (results: _*)
}
(1) foreach
foreach applies an F function operation to each element in the RDD, without returning the RDD and Array, instead returning the uint. Figure 16 shows that the foreach operator uses a user-defined function to
Each data item is manipulated. In this example, the Custom function is println (), and the console prints all the data items.
　　

Fig. foreach Operator for RDD conversion

(2) Saveastextfile
The function stores the data output to the specified directory in HDFS. The following is an internal implementation of the Saveastextfile function, with its internal
Implemented by calling Saveashadoopfile:
This.map (x = (Nullwritable.get (), New Text (x.tostring)))
. Saveashadoopfile[textoutputformat[nullwritable, Text]] (path)
Converts each element mapping in the RDD to (null, x.tostring) and then writes it to HDFS.
The left square in Figure 17 represents the RDD partition, and the right square represents the Block of HDFS. Each partition of the RDD is stored as a Block in HDFS through a function.

Figure Saveashadoopfile operator to RDD conversion

(3) Collect
Collect is equivalent to ToArray, ToArray is deprecated, collect returns the distributed RDD as a single stand-alone Scala array. Use Scala's functional operation on this array.
The left square in Figure 18 represents the RDD partition, and the right square represents an array in the stand-alone memory. The result is returned to the node where the Driver program is located, stored as an array, through a function operation.

Figure Collect operator to RDD conversion

(4) Count
Count returns the number of elements for the entire RDD.
The intrinsic function is implemented as:
Defcount (): Long=sc.runjob (this,utils.getiteratorsize_). Sum
In Figure 19, the number of returned data is 5. A block represents an RDD partition.

Figure Count-to-RDD operator conversion

Classification of the operators of Apache Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More