Spark RDD Operator Classification

Last Update:2020-06-04 Source: Internet

Author: User

Keywords spark spark rdd spark rdd operator

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Transaction operator/action operator

Spark operators can be divided into transaction operators and action operators, and transformation/conversion operators. This transformation does not trigger the submission of jobs and completes the intermediate processing of jobs. It is the transaction operator. Conversely, when this RDD is executed, it triggers the SparkContext to submit the job. Then it is the action operator. The transaction operator will be lazy to execute, and the action will immediately execute the current operator and the operator to and from the previous operator. So why should spark adopt this method? Spark will not perform all the continuous lazy operations immediately, but will produce a directed DAG directed acyclic graph based on the execution plan component. Until the action operator is encountered, the real DAG will be executed. This purpose In the DAG execution, pipeline optimization can be performed (pipeline optimization is described in the following section) to reduce shuffle. Instead of one by one execution and landing, the execution efficiency of spark is improved. So which ones are actions and which are transformation operators?

1) Transformation operator of Value data type:

One-to-one type: map/flatmap/mapPartition/glom() (this method converts elements T in the same partition to Array[T])

Many-to-one type: union (not merged without duplication)/cartesian These two operators are somewhat like the method in mysql. The number of partitions is the addition of the number of partitions of the previous two operators. The sum and product results, look at one from the content Accumulated one is a product, and Descartes is changed from RDD to pariRDD, but this is very lossy. Memory should be used with caution. intersection: This operator takes the intersection of the two and deduplicates

Many-to-many: groupby operator, for example, groupByKey points from multiple partitions to other multiple partitions according to the key

The output partition is an input partition sub-partition set type: filter: filter/distinct: deduplication/subtract: rdd1.subtract(rdd2): the part that is on the left but not on the right is taken â ¢ â â â â â ties â ¢ â ̈ Arrays are a bit different from others.

Cache type: cache/persist operator

2) Transfromation operator of Key-Value data type

One-to-one type: mapValues operator

A single RDD aggregation: reduceByKey/combineByKey(https://blog.csdn.net/dapanbest/article/details/81096279)/partitionBy (partition operation on RDD)

The gathering of two RDDs: Cogroup operator

Connection: join operator/leftOutJoin and rightOutJoin operator

3) Action operator

No output: foreach operator

HDFS: saveAsTextFile (save according to partition and hdfs style)/saveAsObjectFile (serialize RDD into object and save)

The scala collection and data type: collect/

collectAsMap (applicable to K-V elements and if K is repeated after covering the front)

reduce

Lookup (applicable to K-V, lookup (K); see if there is a partitioner in the RDD, and find the result from the partition. If it does not include the partition.)

Count/top/

reduce/fold/:num=sc.parallelize([1,2,3,4]), sum=num.reduce(lambda x,y: x+y); ””

Num.fold(0,lambda x,y:x+y); both need to have the same element type, and the second flod gives an initial value.

2, according to narrow dependence / wide dependence

The execution of the entire DAG is actually the process of transforming one RDD into another RDD. In this process, there is a relationship between the parent RDD and the child RDD. This is called the dependency of the child RDD on the parent RDD. This dependence is through the child RDD saves the blood of the parent class to achieve. The dependence between RDD is divided into narrow dependence and broad dependence. Narrow Dependency: All partitions in the parent RDD face one partition of the child RDD, which is one-to-one or many-to-one. Wide dependency: There are partitions in the parent RDD facing multiple partitions in the child RDD. It's just one-to-many. Narrow dependencies can omit shuffle, while wide dependencies require shuffle, so the efficiency of narrow dependencies will also be high. Therefore, if there are multiple continuous narrow dependencies in the entire DAG, these continuous narrow dependencies can be integrated and executed continuously without shuffle in the middle, thereby greatly improving efficiency. Such optimization is called pipeline optimization. The key to improving performance is to reduce shuffle. Let's sort out the operators with narrow dependence and wide dependence

Narrow dependencies: too many to list

Wide dependence: shuffle can actually be divided into several types of operators

The first type of repartition operator, the RDD custom repartitioning will cause shuffle, which means operator repartition, repartitionAndSortWithinPartitions, â€¢ â â â â â â â â â ¡â â â â â â â â s Coalesce when the shuffle parameter is set to true is repart

The second type of bykey operations are to perform aggregation operations on the keys in each partition, so ensure that the same key on all nodes in the cluster must be processed on the same node, then this will cause shuffle , Representing operators reduceByKey, groupByKey, sortByKey, â€¢ â â â â â â â â â â â â â ¢ â â â ¢ â â â â â â ’s combinedByKey, aggregateByKey, sortBy, takeOrdered, etc.

The third type of join operations: such as join, cogroup, etc., cogroup :: extract the value of the same key on both sides of the RDD to form 2 iterators, one is the value of the first RDD, the other is the value of the first RDD The value of the two RDDs, and then form a tuple (key, (value1, value2)), the key is the original key, the value* is the value of each RDD, it is the iterator; the join operator is actually the value1 , value2 pairwise operation, assuming that the value1 iterator value is (1,2), the value2 iterator is (2,3), then the result of the join is (1,2),(1,3),(2,2) ,(2,3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More