Today, let's talk about the DAG in spark and the contents of the RDD.
1.DAG: Directed acyclic graph: Has direction, no closed loop, represents the flow of data, the DAG's boundary is the action method execution
2. How to divide a dag stage,stage the basis for slicing: When you have wide dependencies to be sliced (shuffle,
That is, when the data is transmitted by the network), a wordcount has two stages,
One is reducebykey before, one thing after Reducebykey (Figure 1),
Then we can understand that when we are going to commit the upstream data,
At this point we can think of the stage being submitted, but strictly speaking, we are submitting a task
Sets (a collection of tasks) that may have the same business logic as the data being processed differently
3. Process
Build the Rdd to form a dag when it encounters an action, the previous stage is submitted, and the submission is completed before handing it over to
Downstream data, in the face of TaskScheduler, this time when we encounter action method, we
will let master decide which workers to execute this dispatch, but in the end we really pass the
Time, we use driver to pass the data to the worker (in fact, it is passed to excutor inside, this inside executes
Real business logic), the excutor in the worker will not have much to do with master as long as it starts.
4. Narrow dependence
The relationship between the RDD and the parent RDD (s) It relies on has two different types, narrow dependencies (narrow dependency), and
Wide dependency (wide dependency).
Narrow partitioning is based on, if one of the following RDD, the front of an RDD has a unique corresponding RDD,
This is the narrow dependency, which is equivalent to a function, y corresponds to an x, and a wide dependency is similar to, the preceding
An RDD, when an rdd corresponds to more than one rdd, is equivalent to two functions, and a Y corresponds to multiple x values
Generation of 5.DAG
A DAG (Directed acyclic graph) is called a directed acyclic graph, and the original Rdd is formed by a series of transformations
Dag, which divides the DAG into different stages based on the dependencies between the RDD, for narrow dependencies,
Partition conversion processing is done in the stage, for a wide dependency, due to the existence of shuffle, can only be
Partentrdd processing is completed before the next calculation can begin, so wide dependency is the basis for dividing the stage
Generally we think of join as a wide dependency, but for a join that is already well-divided, we can now think of this
When the join IS narrow dependent
Rdd in Spark and Dag