A thorough research and reflection on the generation life cycle of Spark streaming source code interpretation

Last Update:2016-05-24 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

A thorough study of the relationship between Dstream and Rdd
A thorough study on the generation of RDD in streaming

The question is raised:

1, how the RDD is generated, depends on what generated

2. Is execution different from the RDD on the spark core?

3. How do we deal with it after operation?

Why there is a 3rd: Because the spark streaming with the relevant trigger conditions, window Windows will continue to produce an rdd when sliding,

From the most basic level of consideration, the RDD is also the basic object, each second will produce RDD, memory can be fully accommodated, how to manage after each processing is completed?

First, the entire spark streaming operation of the inputdstream process source

　　There are two ways to produce foreachdstreams:

1, one is Dstreams action, this is the production and execution of the job

2, Foreachrdd will also produce foreachdstreams, if there is no action-level operation in Foreachrdd will not execute the job,

Foreachdstreams does not necessarily trigger job execution, but it will trigger job creation, which is false, because it is required to generate the timer time and business logic code.

　　The relationship between Foreachdstreams and job:

1, Foreachdstreams and job execution is actually not related, does not necessarily trigger the execution of the job

2, there is foreachdstreams time will produce job, this sentence is false, in the absence of Foreachdstreams will continue to produce job

Job generation does not have anything to do with business logic code, just with frame scheduling, frame timer time will generate job

　　Foreachrdd is the rear door of the spark RDD because it operates directly on the RDD, but it is encapsulated as a foreachstream, actually manipulating the RDD directly in the flow, but it generates dstreams itself. in this spark streaming logic operation, what we see is to operate on the dstreams, in fact, dstreams operation is to operate on the RDD, Dstreams is a set of Rdd template, The dstreams behind is dependent on the front dstreams.

Why does the dstreams behind the Dstreams have a dependency on the front? The source code is as follows:

Dstreams relies on other dstreams, except for the first dstreams, because it is generated by the data source.

Based on how the Dstreams is produced, the RDD is a template for the RDD that is generated by the time-of-day function.

To study how the RDD is generated, to see the entire dstreams operation, there must be a place to trigger the creation of the RDD, according to the path of the source code to track how the RDD is generated?

　　the life cycle of the RDD: all back-dependent, each step produces a dstreams instance, and Dstreams is the template for the RDD

　　Why does Dstreams rely on the front from the back? Dstreams must be backward-forward dependent, with three points of purpose:

1. is a business logic operation representing the spark streaming level

2, the purpose is to generate the RDD according to this, and the RDD is from the back to the forward dependence

3. Dstreams is the lazy level, and the lazy level is based on the back-forward dependence laid

The most important reason is 2nd, that the dependency of dstreams must be kept highly consistent with the dependency of the RDD, since the RDD is generated based on the time interval

　　Process Summary:

Understanding from the generation level, each RDD corresponds to a job, which is the last rdd of the dstreams operation, and the last Rdd has a dependency on the front, as long as the last Rdd can be used to export all the RDD

Every instance of Dstreams has a generatedrdd, and there are hashmap, in fact, we only need to focus on the last one when we actually calculate the actual calculation is to push forward from the back.

logic Level: There is one after another Dstreams object, through the operation of map and so will produce Dstreams object, dstreams template will produce a series of rdd over time, with time instance, time injects will produce RDD.

Actual execution: The Spark streaming operation looks at the last dstreams, from behind to find the Rdd dependency, which is equivalent to a matrix, plus a space-time dimension.

How G-Eneratdrdd is obtained:

There is a Getorcompute method in Dstream that generates an RDD based on time, which may be obtained from the buffer level, or calculated.

If there is no dependence, it will be self-reliance:

Map of the Dstreams, is dependent,Getorcompute produce Rdd, see a lot of dstreams is actually a dstreams, Dstreams is the logical level of the presentation, are pushed forward from behind.

The map will operate on the RDD, and the calculation of the dstreams is actually a calculation of the RDD.

Getorcompute returns an Rdd, and another is foreachdstreams:

Generatejob is controlled by the scheduler:

Generatejob will call Dstreams and then dispatch to Generatejob:

　　Note:

- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains

A thorough research and reflection on the generation life cycle of Spark streaming source code interpretation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More