The RDD and Dag in Spark

Source: Internet
Author: User
Tags shuffle

Today, let's talk about the DAG in spark and the contents of the RDD.

1.DAG: Directed acyclic graph: Has direction, no closed loop, represents the flow of data, the DAG's boundary is the action method execution

  

2. How to divide a dag stage,stage the basis for slicing: When you have wide dependencies to be sliced (shuffle,

That is, when the data is transmitted by the network), a wordcount has two stages,

One is reducebykey before, one thing after Reducebykey (Figure 1),

Then we can understand that when we are going to commit the upstream data,

At this point we can think of the stage being submitted, but strictly speaking, we are submitting a task

Sets (a collection of tasks) that may have the same business logic as the data being processed differently

3. Process

Build the Rdd to form a dag when it encounters an action, the previous stage is submitted, and the submission is completed before handing it over to

Downstream data, in the face of TaskScheduler, this time when we encounter action method, we

will let master decide which workers to execute this dispatch, but in the end we really pass the

Time, we use driver to pass the data to the worker (in fact, it is passed to excutor inside, this inside executes

Real business logic), the excutor in the worker will not have much to do with master as long as it starts.

4. Narrow dependence

The relationship between the RDD and the parent RDD (s) It relies on has two different types, narrow dependencies (narrow dependency), and

Wide dependency (wide dependency).

  

Narrow partitioning is based on, if one of the following RDD, the front of an RDD has a unique corresponding RDD,

This is the narrow dependency, which is equivalent to a function, y corresponds to an x, and a wide dependency is similar to, the preceding

An RDD, when an rdd corresponds to more than one rdd, is equivalent to two functions, and a Y corresponds to multiple x values

Generation of 5.DAG

A DAG (Directed acyclic graph) is called a directed acyclic graph, and the original Rdd is formed by a series of transformations

Dag, which divides the DAG into different stages based on the dependencies between the RDD, for narrow dependencies,

Partition conversion processing is done in the stage, for a wide dependency, due to the existence of shuffle, can only be

Partentrdd processing is completed before the next calculation can begin, so wide dependency is the basis for dividing the stage

Generally we think of join as a wide dependency, but for a join that is already well-divided, we can now think of this

When the join IS narrow dependent

Rdd in Spark and Dag

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.