The main contents of this section:
first, Dstream and A thorough study of the RDD relationship
A thorough study of the generation of Streamingrdd
Spark streaming Rdd think three key questions:
The RDD itself is the basic object, according to a certain time to produce the Rdd of the object, with the accumulation of time, not its management will lead to memory overflow, so in batchduration time after performing the Rdd operation, the RDD needs to be managed.
1, Dstream generate Rdd process, dstream in theend is how to generate Rdd?
2. What is the relationship between Dstream and Rdd?
3, how to handle the Rdd after operation?
So studying the rdd of spark streaming, the full life cycle of the RDD, is especially important when it's generated, run, and run.
SOURCE Interpretation:
Warm tip: Broadcasts and counters are not as simple as they seem, and in practical best practices, very complex algorithms can be implemented with broadcasts and counters.
Look at the code logic, logic is an idea, the sockettextstream of the above code , you can imagine the data input? Data processing? How does the data come?
After obtaining the data, a series of transformations and final Foreachrdd operations are performed.
1, directly with FOreachrdd in here directly define the action operation, you can directly write to the RDD processing operation function,
2, from the angle of the RDD, Operation DStream of the print function, in fact, is turned over the operation of Foreachrdd print:
Manipulating an action in the RDD does not result in a new rdd,dstream and it is completely corresponding, and manipulating the action in Dstream does not result in a new dstream.
Foreachdstream is a transformation operation, Foreachdstream does not necessarily trigger job execution throughout the spark streaming operation
, but it triggers the creation of the job.
Job generation is generated by a timer, generated according to business logic code, and foreachdstream nothing to do with it.
1, Foreachdstream and job execution is not related, does not trigger job execution.
2, there is foreachdstream execution will produce job is wrong, only according to the framework to dispatch job execution.
The operation of the RDD in the Foreachrdd code does not take action if there is no action action.
Foreachrdd is the rear door of the spark streaming, which operates directly on the RDD and is encapsulated into Foreachrdd.
Summarize:
All the logic operations in spark streaming are operations on DStream, the operation of DStream is actually the operation of the Rdd, DStream is the template of the RDD.
The following Dstream are dependent on the previous dstream:
DStream map operation to map :
Based on how does dstream produce Rdd? Through batchinterval. Study how Dstream is generated, and see how Dstream 's operations trigger the creation of an rdd.
RDDs is generated based on the time instance , and batchduration is aligned, such as:Timer instance is 1 seconds,1 seconds to generate an Rdd,
Each RDD corresponds to a Job because the Rdd is the last rdd of the time interval of the dstream operation , the back of the RDD to the front
Rdd has dependencies, and the back-to-front dependencies can be pushed out to the entire dependency chain.
Look at the official:
The calculation is pushed forward from the back, and the calculation only needs to get the handle of the last Rdd. Based on time, find out from behind .
The Rdd dependency, thus finding the corresponding spatial relationship.
See How the Generaterdd is obtained?
After the rdd and batchduration corresponding to the Rdd,Dstream has a Getorcomputer method, according to Batchduration generated Rdd, can be
The cache or calculation level is calculated.
Here, therdd variable is generated, but not executed, just at the logical level of the code, which can be managed at the framework level for optimal management.
Note:sparkstreaming actually generates an RDD when no data is entered, so you can modify the source code here to improve performance.
Thank Liaoliang teacher for their knowledge sharing
Liaoliang Teacher's card:
China Spark first person
Thank Liaoliang teacher for their knowledge sharing
Sina Weibo: Http://weibo.com/ilovepains
Public Number: Dt_spark
Blog: http://blog.sina.com.cn/ilovepains
Mobile: 18610086859
qq:1740415547
Email: [Email protected]
YY classroom: Daily 20:00 live teaching channel 68917580
Spark Release Note 8: Interpreting the full life cycle of the spark streaming RDD