Since the module calculation of the project relies on spark, the use of spark needs to be based on data of different sizes and forms, so as to maximize the stability of data transformation and model calculation. This is also the bottleneck that elemental needs to optimize at present. Here, we discuss some of the problems encountered in the following scenario:
In the data size is too large, unable to cache to memory Dataframe after transform many times, after the action, the resulting physical plan too long Dataframe and RDD after transform many times, after the action, Generated dag over-length theory Cache When we do a cache operation on Dataframe, we perform a single execution of the Logicalplan to the currently generated dataframe, The results of each partition calculation are saved in Cachedbatch format and eventually saved to the Cacheddata list array. The corresponding Rdd also becomes the PERSISTRDD format. And after the cache of Dataframe, in the subsequent calculation, under normal circumstances, in the case of small data size, we only need to dataframe and RDD cache operation, you can solve the problem mentioned earlier. Of course, we can also choose cache to disk to deal with large data size. CheckPoint when we perform CheckPoint operation on the RDD, we only temporarily add a tag to indicate that the RDD needs to be CheckPoint. In the subsequent action operation, after runjob calculation of the RDD, the docheckpoint activity, that is, the specific RDD checkpoint the actual process. In this process, the RDD generation process is actually going to be calculated for the second time. Dataframe in the checkpoint operation, the default parameter eager is true, that is, after the checkpoint function, the corresponding INTERNALRDD will default to a simple count action. This completes the checkpoint of the Dataframe data and, of course, cleans up the corresponding pre-order dependencies to reduce the DAG and Physicalplan complexity. Explore
The test steps are as follows:
Val df1 = Df.withcolumn
val df2 = df1.groupBy.sum
val df3 = Df2.withcolumn
We use the process of count as a sample of the DAG and Plan analysis (for checkpoint operations, after the DF1,DF2,DF3 is checkpoint, the count process is performed)
1.DataFrame for checkpoint comparison
Without the use of checkpoint, Logicplan is transformed into
Aggregate [Count (1) as count#22l]
+-Project [Id#3, Double#4, plusone#5, (id#3% 9) as idtype#10]
+-Logicalrdd [i D#3, Double#4, plusone#5]
Aggregate [Count (1) as count#41l]
+-Aggregate [idtype#10], [idtype#10, Sum (CAST ( Double#4 as bigint) as SUM (double) #33L]
+-Project [Id#3, Double#4, plusone#5, (id#3% 9) as idtype#10]
+-Logica Lrdd [Id#3, Double#4, plusone#5]
Aggregate [Count (1) as count#56l]
+-Project [idtype#10, SUM (double) #33L, (cast (sum (double) #33L as Double)/cast (as Double)) as rst#46]
+-Aggregate [idtype#10], [idtype#10, Sum (CAST (Double#4 a s bigint) as SUM (double) #33L]
+-Project [Id#3, Double#4, plusone#5, (id#3% 9) as idtype#10]
+-Logicalrdd [id# 3, Double#4, plusone#5]
Because it is a dependency, the above situation is reasonable.
Well, after using checkpoint
Aggregate [Count (1) as count#83l]
+-Project [id#64, Double#65, plusone#66, (id#64% 9) as idtype#71]
+-logicalrd D [Id#64, double#65, plusone#66]
Aggregate [Count (1) as count#108l]
+-Aggregate [idtype#71], [idtype#71, SUM ( Cast (double#65 as bigint) as SUM (double) #100L]
+-Logicalrdd [id#64, double#65, plusone#66, idtype#71]
Aggregate [Count (1) as count#129l]
+-Project [idtype#71, SUM (double) #100L, (CAST (sum (double) #100L as Double)/cast ( (as Double)) as rst#119]
+-Logicalrdd [idtype#71, SUM (double) #100L]
Apparently, Logicalplan was suppressed. The corresponding Physicalplan will also be reduced.
Dag changes, the only way to enumerate df3 here is to illustrate the problem
Figure 1: Df3 to count Dag without checkpoint case
Figure 2: Df3 the DAG with count in DF2 checkpoint case
In contrast, it is known that the stage has been reduced (Figure 1 is only 3 stages after Physicalplan optimization, actually Logicalplan is already 4 stages), and Figure 1 is the process from the most open DF. and Figure 2 is directly from the front of a DF2 checkpoint point out.
2.RDD for checkpoint comparison
Using the RDD for similar operations, the DAG reduction is also consistent, where we can look at the recursive dependencies information comparison of the RDD
(4) mappartitionsrdd[64] at map at alextestjob.scala:115 []
| SHUFFLEDRDD[63] at groupBy at alextestjob.scala:115 []
+-(4) mappartitionsrdd[62] at GroupBy : []
| MAPPARTITIONSRDD[61] at map at alextestjob.scala:109 []
| PARALLELCOLLECTIONRDD[60] at parallelize at alextestjob.scala:106 []
(4) mappartitionsrdd[71] at map at alextestjob.scala:141 []
| RELIABLECHECKPOINTRDD[72] at count at alextestjob.scala:147 []
This is a comparison of whether checkpoint is used after the RDD2 process todebugstring
3.DataFrame comparison of checkpoint in the loop body
Here we use the following logic code to test
var df
(0 until 5). foreach {idx=>
df = df.withcolumn (S "Addcol_$idx", Df.col ("id") +idx)
}
Logicalplan Comparative analysis
' Project [*, (id#97 + 4) as idtype_4#134]
+-Project [id#97, double#98, plusone#99, idtype_0#104, idtype_1#110, Idtype _2#117, (id#97 + 3) as idtype_3#125]
+-Project [id#97, double#98, plusone#99, idtype_0#104, idtype_1#110, (id#97 + 2) As idtype_2#117]
+-Project [id#97, double#98, plusone#99, idtype_0#104, (id#97 + 1) as idtype_1#110]
+-Project [Id#97, Double#98, plusone#99, (id#97 + 0) as idtype_0#104]
+-Logicalrdd [id#97, double#98, plusone#99]
After each iteration is checkpoint
' Project [*, (id#3 + 4) as idtype_4#75]
+-Logicalrdd [Id#3, Double#4, plusone#5, idtype_0#15, idtype_1#27, idtype_2#4 1, idtype_3#57]
This reduction has practical implications for multiple iterations of the DAG reduction process during model calculation
4.checkpoint and Cache (disk_only)
The cache only exists disk_only can be understood as the Localcheckpoint process conclusion either the cache or the checkpoint operation, essentially partial preservation of intermediate results, reducing the subsequent process of repeated calculations. Caches tend to store more frequently used data with smaller data sizes, and are more likely to be stored in memory. Checkpoint has no limits on the size of data compared to words. Checkpoint once, it will be calculated 2 times, which is an additional cost. The cache is calculated 1 times on disk, but the result of the cache is only called by programs running on the driver. In fact, it is used in elemental. Just consider whether the disk space of the machine where the corresponding driver is located is sufficient. Checkpoint to Alluxio, is convenient unified management. The greater advantage of checkpoint is its Sparkstream advantage and recoverability. An RDD is marked as already checkpoint, whether or not it is marked as checkpoint, as long as the actual action is performed. For example:
Rdd.checkpoint
Rdd.count
This can be successfully checkpoint, but:
Rdd.count
rdd.checkpoint
rdd.count
Cannot be checkpoint. Therefore, select checkpoint and then action immediately. The best scenarios are:
Rdd.checkpoint
rdd.persist (disk_only)