1. Transaction operator/action operator
Spark operators can be divided into transaction operators and action operators, and transformation/conversion operators. This transformation does not trigger the submission of jobs and completes the intermediate processing of jobs. It is the transaction operator. Conversely, when this RDD is executed, it triggers the SparkContext to submit the job. Then it is the action operator. The transaction operator will be lazy to execute, and the action will immediately execute the current operator and the operator to and from the previous operator. So why should spark adopt this method? Spark will not perform all the continuous lazy operations immediately, but will produce a directed DAG directed acyclic graph based on the execution plan component. Until the action operator is encountered, the real DAG will be executed. This purpose In the DAG execution, pipeline optimization can be performed (pipeline optimization is described in the following section) to reduce shuffle. Instead of one by one execution and landing, the execution efficiency of
spark is improved. So which ones are actions and which are transformation operators?
1) Transformation operator of Value data type:
One-to-one type: map/flatmap/mapPartition/glom() (this method converts elements T in the same partition to Array[T])
Many-to-one type: union (not merged without duplication)/cartesian These two operators are somewhat like the method in mysql. The number of partitions is the addition of the number of partitions of the previous two operators. The sum and product results, look at one from the content Accumulated one is a product, and Descartes is changed from RDD to pariRDD, but this is very lossy. Memory should be used with caution. intersection: This operator takes the intersection of the two and deduplicates
Many-to-many: groupby operator, for example, groupByKey points from multiple partitions to other multiple partitions according to the key
The output partition is an input partition sub-partition set type: filter: filter/distinct: deduplication/subtract: rdd1.subtract(rdd2): the part that is on the left but not on the right is taken â ¢ â â â â â ties â ¢ â ̈ Arrays are a bit different from others.
Cache type: cache/persist operator
2) Transfromation operator of Key-Value data type
One-to-one type: mapValues operator
A single RDD aggregation: reduceByKey/combineByKey(https://blog.csdn.net/dapanbest/article/details/81096279)/partitionBy (partition operation on RDD)
The gathering of two RDDs: Cogroup operator
Connection: join operator/leftOutJoin and rightOutJoin operator
3) Action operator
No output: foreach operator
HDFS: saveAsTextFile (save according to partition and hdfs style)/saveAsObjectFile (serialize RDD into object and save)
The scala collection and data type: collect/
collectAsMap (applicable to K-V elements and if K is repeated after covering the front)
reduce
Lookup (applicable to K-V, lookup (K); see if there is a partitioner in the RDD, and find the result from the partition. If it does not include the partition.)
Count/top/
reduce/fold/:num=sc.parallelize([1,2,3,4]), sum=num.reduce(lambda x,y: x+y); ””
Num.fold(0,lambda x,y:x+y); both need to have the same element type, and the second flod gives an initial value.
2, according to narrow dependence / wide dependence
The execution of the entire DAG is actually the process of transforming one RDD into another RDD. In this process, there is a relationship between the parent RDD and the child RDD. This is called the dependency of the child RDD on the parent RDD. This dependence is through the child RDD saves the blood of the parent class to achieve. The dependence between RDD is divided into narrow dependence and broad dependence. Narrow Dependency: All partitions in the parent RDD face one partition of the child RDD, which is one-to-one or many-to-one. Wide dependency: There are partitions in the parent RDD facing multiple partitions in the child RDD. It's just one-to-many. Narrow dependencies can omit shuffle, while wide dependencies require shuffle, so the efficiency of narrow dependencies will also be high. Therefore, if there are multiple continuous narrow dependencies in the entire DAG, these continuous narrow dependencies can be integrated and executed continuously without shuffle in the middle, thereby greatly improving efficiency. Such optimization is called pipeline optimization. The key to improving performance is to reduce shuffle. Let's sort out the operators with narrow dependence and wide dependence
Narrow dependencies: too many to list
Wide dependence: shuffle can actually be divided into several types of operators
The first type of repartition operator, the RDD custom repartitioning will cause shuffle, which means operator repartition, repartitionAndSortWithinPartitions, • â â â â â â â â â ¡â â â â â â â â s Coalesce when the shuffle parameter is set to true is repart
The second type of bykey operations are to perform aggregation operations on the keys in each partition, so ensure that the same key on all nodes in the cluster must be processed on the same node, then this will cause shuffle , Representing operators reduceByKey, groupByKey, sortByKey, • â â â â â â â â â â â â â ¢ â â â ¢ â â â â â â ’s combinedByKey, aggregateByKey, sortBy, takeOrdered, etc.
The third type of join operations: such as join, cogroup, etc., cogroup :: extract the value of the same key on both sides of the RDD to form 2 iterators, one is the value of the first RDD, the other is the value of the first RDD The value of the two RDDs, and then form a tuple (key, (value1, value2)), the key is the original key, the value* is the value of each RDD, it is the iterator; the join operator is actually the value1 , value2 pairwise operation, assuming that the value1 iterator value is (1,2), the value2 iterator is (2,3), then the result of the join is (1,2),(1,3),(2,2) ,(2,3)