International - English

Cart Console

Topic Center

Contact Sales

Home > Others

The RDD and Dag in Spark

Last Update:2017-01-11 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, let's talk about the DAG in spark and the contents of the RDD.

1.DAG: Directed acyclic graph: Has direction, no closed loop, represents the flow of data, the DAG's boundary is the action method execution

2. How to divide a dag stage,stage the basis for slicing: When you have wide dependencies to be sliced (shuffle,

That is, when the data is transmitted by the network), a wordcount has two stages,

One is reducebykey before, one thing after Reducebykey (Figure 1),

Then we can understand that when we are going to commit the upstream data,

At this point we can think of the stage being submitted, but strictly speaking, we are submitting a task

Sets (a collection of tasks) that may have the same business logic as the data being processed differently

3. Process

Build the Rdd to form a dag when it encounters an action, the previous stage is submitted, and the submission is completed before handing it over to

Downstream data, in the face of TaskScheduler, this time when we encounter action method, we

will let master decide which workers to execute this dispatch, but in the end we really pass the

Time, we use driver to pass the data to the worker (in fact, it is passed to excutor inside, this inside executes

Real business logic), the excutor in the worker will not have much to do with master as long as it starts.

4. Narrow dependence

The relationship between the RDD and the parent RDD (s) It relies on has two different types, narrow dependencies (narrow dependency), and

Wide dependency (wide dependency).

Narrow partitioning is based on, if one of the following RDD, the front of an RDD has a unique corresponding RDD,

This is the narrow dependency, which is equivalent to a function, y corresponds to an x, and a wide dependency is similar to, the preceding

An RDD, when an rdd corresponds to more than one rdd, is equivalent to two functions, and a Y corresponds to multiple x values

Generation of 5.DAG

A DAG (Directed acyclic graph) is called a directed acyclic graph, and the original Rdd is formed by a series of transformations

Dag, which divides the DAG into different stages based on the dependencies between the RDD, for narrow dependencies,

Partition conversion processing is done in the stage, for a wide dependency, due to the existence of shuffle, can only be

Partentrdd processing is completed before the next calculation can begin, so wide dependency is the basis for dividing the stage

Generally we think of join as a wide dependency, but for a join that is already well-divided, we can now think of this

When the join IS narrow dependent

Rdd in Spark and Dag

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

spark and cassandra kafka and spark spark in hadoop tutorial rdd tool apache spark and scala tutorial kafka and spark streaming example rdd meaning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The RDD and Dag in Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support