Spark Fast Big Data analytics
8.4 Key Performance Considerations
Degree of parallelism
the logical representation of an RDD is actually a collection of objects . During physical execution, theRdd is divided into a series of partitions,
Each partition is a subset of the entire data. When Spark dispatches and runs a task, spark makes the data in each partition
Creates a task that, by default, requires a compute node in the cluster to execute.
Spark also automatically infers the appropriate degree of parallelism for the RDD directly, which is sufficient for most use cases.
The input rdd typically chooses the degree of parallelism based on its underlying storage system. For example, the input rdd for reading data from HDFs
A partition is created for each file chunk of data on HDFs. The Rdd derived from the RDD after data wash
will take the same degree of parallelism as the parent RDD.
The degree of parallelism affects the performance of the program in two ways.
First, when the degree of parallelism is too low, there is a situation where the spark cluster will have idle resources .
For example, if your app has 1000 compute nodes that you can use, but you run only 30 tasks, you should increase the degree of parallelism
To take advantage of more compute nodes.
When parallelism is too high, the overhead generated by each partition will accumulate more . Criteria for judging whether parallelism is too high include
Whether the task was completed almost instantaneously (in milliseconds) , or whether the task was not read and written to the task data.
Spark provides two methods for tuning the parallelism of the operation.
The first approach is to specify the degree of parallelism for the mixed-wash rdd using parameters in the data-blending operation .
The second method is for any existing RDD, which can be re-partitioned to get more or fewer partitions.
The repartitioning operation is implemented by repartition () , which randomly scrambles the rdd and divides it into the number of partitions set.
If you are sure you want to reduce the number of partitions, you can use the coalesce () Operation . The operation is more efficient than repartition () because it does not disrupt the data.
If you think the current parallelism is high or too low, you can use these methods to readjust the partition.
For a chestnut, suppose we read a large amount of data from the S3 and then immediately perform a fileter () operation to filter the data in the set
Most of the data. By default, filter () returns an RDD with the same number of partitions as its parent RDD, which produces a lot of empty partitions
Or a partition with only a small amount of data. At this point, you can increase the performance of your application by merging an RDD with fewer partitions.
def testcoalesce = { val conf = new sparkconf (). Setmaster ("local"). Setappname ("Testcoalesce") val sc = new SPARKC Ontext (conf) val input = sc.parallelize (1 to 9999, +) Logger.warn (S "Rdd[input] partitioncount[${ Input.partitions.length}] ") val test = input.filter {x = x = 0 } logger.warn (S" rdd[test] par Titioncount[${test.partitions.length}] ") val test2 = Test.coalesce (2, True). Cache () Logger.warn (S" rdd[ TEST2] Partitioncount[${test2.partitions.length}] ") val result = Test2.collect () Logger.warn (S" Result [${ Result.mkstring (",")}] ") Thread.Sleep (int.maxvalue) }
Execution Results
00:47:21 831 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:19): RDD[input] partitionCount[1000]00:47:22 009 [main] WARN Test.scala.spark.testspark2$.testcoalesce (testspark2.scala:22): rdd[test] partitioncount[1000] 00:47:22 122 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:25): RDD[test2] partitionCount[2] [Stage 0:===> (58 + 1) / 1000][Stage 0:=====> (95 + 1) / 1000][Stage 0:=======> (131 + 1) / 1000][Stage 0:============> (238 + 24) / 1000][Stage 0:============> (243 + 19) / 1000][Stage 0:================> (314 + 1) / 1000][stage 0 :=================> (330 + 1) / 1000][Stage 0:=====================> (390 + 1) / 1000][stage 0:=======================> (443 + 1) / 1000][Stage 0:===========================> (500 + 2) / 1000][Stage 0:==============================> (557 + 1) / 1000][Stage 0:=================================> (618 + 1) / 1000][Stage 0:===================================> (662 + 1) / 1000][Stage 0:=======================================> (724 + 1) / 1000][stage 0:============================= =============> (791 + 1) / 1000][Stage 0:==============================================> (855 + 1) / 1000][stage 0:=============================================== => (895 + 1) / 1000][stage 0:======================== ===========================> (953 + 1) / 1000] 00:47:30 466 [ Main] warn test.scala.spark.testspark2$.testcoalesce (testspark2.scala:29): result [ 2015,4030,6045,8060] |
You can see the task execution statistics by opening http://localhost:4040/jobs/.
The degree of parallelism of the RDD key performance considerations