Contents of this issue:
- Updatestatebykey decryption
- Mapwithstate decryption
Spark Streaming is a state-management factor:
01, Spark streaming is in accordance with the entire Bachduration division job, each bachduration will produce a job, in order to meet the needs of business operations,
need to calculate data for the last one hours or a week, However, because the amount of data is greater than bachduration, it is unavoidable to maintain the state at this time
02, Spark's state management actually has many functions, Compare typical Updatestatebykey, mapwithstate methods to complete the core steps
First, Updatestatebykey:
Update the status in the existing historical data, depending on the Updatefunc function and return a dsteam type
Eventually, using Dsteam, the data is constantly generated.
Process of generating Rdd, calculation method
For incoming data, a collection of all data by K:
Pros: Each time you need to calculate the RDD, you really need to calculate the RDD, the Rdd how to calculate, it is Cogroup
Cons: Performance issues, because all the data needs to be scanned every time, eventually become Cogroupedrdd, as the amount of data increases the speed of the more slowly
Second, Mapwithstate:
when the Dstreams is returned, the status update and maintenance history State is based on K, and the function of the update, the time-out, the initial state, etc. are obtained by STATESPEC (which encapsulates the update function) .
Update, delete, The equivalent of recording in a table, which key in the table is manipulated using historical data, State is the table name or index, gets, updates data, maintains status.
All partition are represented by Mapwithstaterddrecord, the data structure is statemap, and the maintenance is based on the state of K
Note:
-
- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains
Spark Streaming source interpretation of state management Updatastatebykey and Mapwithstate decryption