Contents of this issue:
- A thorough study of the relationship between Dstream and Rdd
- A thorough study on the generation of RDD in streaming
The question is raised:
1, how the RDD is generated, depends on what generated
2. Is execution different from the RDD on the spark core?
3. How do we deal with it after operation?
Why there is a 3rd: Because the spark streaming with the relevant trigger conditions, window Windows will continue to produce an rdd when sliding,
From the most basic level of consideration, the RDD is also the basic object, each second will produce RDD, memory can be fully accommodated, how to manage after each processing is completed?
First, the entire spark streaming operation of the inputdstream process source
There are two ways to produce foreachdstreams:
1, one is Dstreams action, this is the production and execution of the job
2, Foreachrdd will also produce foreachdstreams, if there is no action-level operation in Foreachrdd will not execute the job,
Foreachdstreams does not necessarily trigger job execution, but it will trigger job creation, which is false, because it is required to generate the timer time and business logic code.
The relationship between Foreachdstreams and job:
1, Foreachdstreams and job execution is actually not related, does not necessarily trigger the execution of the job
2, there is foreachdstreams time will produce job, this sentence is false, in the absence of Foreachdstreams will continue to produce job
Job generation does not have anything to do with business logic code, just with frame scheduling, frame timer time will generate job
Foreachrdd is the rear door of the spark RDD because it operates directly on the RDD, but it is encapsulated as a foreachstream, actually manipulating the RDD directly in the flow, but it generates dstreams itself. in this spark streaming logic operation, what we see is to operate on the dstreams, in fact, dstreams operation is to operate on the RDD, Dstreams is a set of Rdd template, The dstreams behind is dependent on the front dstreams.
Why does the dstreams behind the Dstreams have a dependency on the front? The source code is as follows:
Dstreams relies on other dstreams, except for the first dstreams, because it is generated by the data source.
Based on how the Dstreams is produced, the RDD is a template for the RDD that is generated by the time-of-day function.
To study how the RDD is generated, to see the entire dstreams operation, there must be a place to trigger the creation of the RDD, according to the path of the source code to track how the RDD is generated?
the life cycle of the RDD: all back-dependent, each step produces a dstreams instance, and Dstreams is the template for the RDD
Why does Dstreams rely on the front from the back? Dstreams must be backward-forward dependent, with three points of purpose:
1. is a business logic operation representing the spark streaming level
2, the purpose is to generate the RDD according to this, and the RDD is from the back to the forward dependence
3. Dstreams is the lazy level, and the lazy level is based on the back-forward dependence laid
The most important reason is 2nd, that the dependency of dstreams must be kept highly consistent with the dependency of the RDD, since the RDD is generated based on the time interval
Process Summary:
Understanding from the generation level, each RDD corresponds to a job, which is the last rdd of the dstreams operation, and the last Rdd has a dependency on the front, as long as the last Rdd can be used to export all the RDD
Every instance of Dstreams has a generatedrdd, and there are hashmap, in fact, we only need to focus on the last one when we actually calculate the actual calculation is to push forward from the back.
logic Level: There is one after another Dstreams object, through the operation of map and so will produce Dstreams object, dstreams template will produce a series of rdd over time, with time instance, time injects will produce RDD.
Actual execution: The Spark streaming operation looks at the last dstreams, from behind to find the Rdd dependency, which is equivalent to a matrix, plus a space-time dimension.
How G-Eneratdrdd is obtained:
There is a Getorcompute method in Dstream that generates an RDD based on time, which may be obtained from the buffer level, or calculated.
If there is no dependence, it will be self-reliance:
Map of the Dstreams, is dependent,Getorcompute produce Rdd, see a lot of dstreams is actually a dstreams, Dstreams is the logical level of the presentation, are pushed forward from behind.
The map will operate on the RDD, and the calculation of the dstreams is actually a calculation of the RDD.
Getorcompute returns an Rdd, and another is foreachdstreams:
Generatejob is controlled by the scheduler:
Generatejob will call Dstreams and then dispatch to Generatejob:
Note:
-
- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains
A thorough research and reflection on the generation life cycle of Spark streaming source code interpretation