The degree of parallelism of the RDD key performance considerations

Source: Internet
Author: User

Spark Fast Big Data analytics

8.4 Key Performance Considerations

Degree of parallelism

the logical representation of an RDD is actually a collection of objects . During physical execution, theRdd is divided into a series of partitions,

Each partition is a subset of the entire data. When Spark dispatches and runs a task, spark makes the data in each partition

Creates a task that, by default, requires a compute node in the cluster to execute.

Spark also automatically infers the appropriate degree of parallelism for the RDD directly, which is sufficient for most use cases.

The input rdd typically chooses the degree of parallelism based on its underlying storage system. For example, the input rdd for reading data from HDFs

A partition is created for each file chunk of data on HDFs. The Rdd derived from the RDD after data wash

will take the same degree of parallelism as the parent RDD.

The degree of parallelism affects the performance of the program in two ways.

First, when the degree of parallelism is too low, there is a situation where the spark cluster will have idle resources .

For example, if your app has 1000 compute nodes that you can use, but you run only 30 tasks, you should increase the degree of parallelism

To take advantage of more compute nodes.

When parallelism is too high, the overhead generated by each partition will accumulate more . Criteria for judging whether parallelism is too high include

Whether the task was completed almost instantaneously (in milliseconds) , or whether the task was not read and written to the task data.

Spark provides two methods for tuning the parallelism of the operation.

The first approach is to specify the degree of parallelism for the mixed-wash rdd using parameters in the data-blending operation .

The second method is for any existing RDD, which can be re-partitioned to get more or fewer partitions.

The repartitioning operation is implemented by repartition () , which randomly scrambles the rdd and divides it into the number of partitions set.

If you are sure you want to reduce the number of partitions, you can use the coalesce () Operation . The operation is more efficient than repartition () because it does not disrupt the data.

If you think the current parallelism is high or too low, you can use these methods to readjust the partition.

For a chestnut, suppose we read a large amount of data from the S3 and then immediately perform a fileter () operation to filter the data in the set

Most of the data. By default, filter () returns an RDD with the same number of partitions as its parent RDD, which produces a lot of empty partitions

Or a partition with only a small amount of data. At this point, you can increase the performance of your application by merging an RDD with fewer partitions.

def testcoalesce = {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Testcoalesce")    val sc = new SPARKC Ontext (conf)    val input = sc.parallelize (1 to 9999, +)    Logger.warn (S "Rdd[input] partitioncount[${ Input.partitions.length}] ")     val test = input.filter {x = x = 0  }    logger.warn (S" rdd[test]  par Titioncount[${test.partitions.length}] ")     val test2 = Test.coalesce (2, True). Cache ()    Logger.warn (S" rdd[ TEST2] Partitioncount[${test2.partitions.length}] ")     val result = Test2.collect ()     Logger.warn (S" Result [${ Result.mkstring (",")}] ")     Thread.Sleep (int.maxvalue)  }

  

Execution Results

00:47:21 831 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:19):  RDD[input] partitionCount[1000]00:47:22 009 [main] WARN  Test.scala.spark.testspark2$.testcoalesce (testspark2.scala:22):  rdd[test]  partitioncount[1000] 00:47:22 122 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:25):  RDD[test2] partitionCount[2] [Stage 0:===>                                                       (58 + 1)  / 1000][Stage 0:=====>                                                     (95 + 1)  / 1000][Stage 0:=======>                                                  (131 + 1)  / 1000][Stage 0:============>                                            (238 + 24)  / 1000][Stage 0:============>                                           (243 + 19)  / 1000][Stage 0:================>                                        (314 + 1)  / 1000][stage 0 :=================>                                       (330 + 1)  / 1000][Stage 0:=====================>                                   (390 + 1)  /  1000][stage 0:=======================>                                (443 + 1)  / 1000][Stage 0:===========================>                             (500 + 2)  / 1000][Stage 0:==============================>                          (557 + 1)  / 1000][Stage 0:=================================>                       (618 + 1)  / 1000][Stage 0:===================================>                    (662 + 1)  / 1000][Stage  0:=======================================>                (724 + 1)  / 1000][stage 0:============================= =============>            (791 + 1)  /  1000][Stage 0:==============================================>         (855 + 1)  / 1000][stage 0:=============================================== =>      (895 + 1)  / 1000][stage 0:======================== ===========================>   (953 + 1)  / 1000] 00:47:30 466 [ Main] warn test.scala.spark.testspark2$.testcoalesce (testspark2.scala:29):  result [ 2015,4030,6045,8060]
You can see the task execution statistics by opening http://localhost:4040/jobs/.

The degree of parallelism of the RDD key performance considerations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.