The degree of parallelism of the RDD key performance considerations

Last Update:2015-11-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Fast Big Data analytics

8.4 Key Performance Considerations

Degree of parallelism

the logical representation of an RDD is actually a collection of objects . During physical execution, theRdd is divided into a series of partitions,

Each partition is a subset of the entire data. When Spark dispatches and runs a task, spark makes the data in each partition

Creates a task that, by default, requires a compute node in the cluster to execute.

Spark also automatically infers the appropriate degree of parallelism for the RDD directly, which is sufficient for most use cases.

The input rdd typically chooses the degree of parallelism based on its underlying storage system. For example, the input rdd for reading data from HDFs

A partition is created for each file chunk of data on HDFs. The Rdd derived from the RDD after data wash

will take the same degree of parallelism as the parent RDD.

The degree of parallelism affects the performance of the program in two ways.

First, when the degree of parallelism is too low, there is a situation where the spark cluster will have idle resources .

For example, if your app has 1000 compute nodes that you can use, but you run only 30 tasks, you should increase the degree of parallelism

To take advantage of more compute nodes.

When parallelism is too high, the overhead generated by each partition will accumulate more . Criteria for judging whether parallelism is too high include

Whether the task was completed almost instantaneously (in milliseconds) , or whether the task was not read and written to the task data.

Spark provides two methods for tuning the parallelism of the operation.

The first approach is to specify the degree of parallelism for the mixed-wash rdd using parameters in the data-blending operation .

The second method is for any existing RDD, which can be re-partitioned to get more or fewer partitions.

The repartitioning operation is implemented by repartition () , which randomly scrambles the rdd and divides it into the number of partitions set.

If you are sure you want to reduce the number of partitions, you can use the coalesce () Operation . The operation is more efficient than repartition () because it does not disrupt the data.

If you think the current parallelism is high or too low, you can use these methods to readjust the partition.

For a chestnut, suppose we read a large amount of data from the S3 and then immediately perform a fileter () operation to filter the data in the set

Most of the data. By default, filter () returns an RDD with the same number of partitions as its parent RDD, which produces a lot of empty partitions

Or a partition with only a small amount of data. At this point, you can increase the performance of your application by merging an RDD with fewer partitions.

def testcoalesce = {    val conf = new sparkconf (). Setmaster ("local"). Setappname ("Testcoalesce")    val sc = new SPARKC Ontext (conf)    val input = sc.parallelize (1 to 9999, +)    Logger.warn (S "Rdd[input] partitioncount[${ Input.partitions.length}] ")     val test = input.filter {x = x = 0  }    logger.warn (S" rdd[test]  par Titioncount[${test.partitions.length}] ")     val test2 = Test.coalesce (2, True). Cache ()    Logger.warn (S" rdd[ TEST2] Partitioncount[${test2.partitions.length}] ")     val result = Test2.collect ()     Logger.warn (S" Result [${ Result.mkstring (",")}] ")     Thread.Sleep (int.maxvalue)  }

Execution Results

00:47:21 831 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:19): RDD[input] partitionCount[1000]00:47:22 009 [main] WARN Test.scala.spark.testspark2$.testcoalesce (testspark2.scala:22): rdd[test] partitioncount[1000] 00:47:22 122 [main] warn test.scala.spark.testspark2$.testcoalesce (TestSpark2.scala:25): RDD[test2] partitionCount[2] [Stage 0:===> (58 + 1) / 1000][Stage 0:=====> (95 + 1) / 1000][Stage 0:=======> (131 + 1) / 1000][Stage 0:============> (238 + 24) / 1000][Stage 0:============> (243 + 19) / 1000][Stage 0:================> (314 + 1) / 1000][stage 0 :=================> (330 + 1) / 1000][Stage 0:=====================> (390 + 1) / 1000][stage 0:=======================> (443 + 1) / 1000][Stage 0:===========================> (500&NBSP;+&NBSP;2) / 1000][Stage 0:==============================> (557 + 1) / 1000][Stage 0:=================================> (618 + 1) / 1000][Stage 0:===================================> (662 + 1) / 1000][Stage 0:=======================================> (724 + 1) / 1000][stage 0:============================= =============> (791 + 1) / 1000][Stage 0:==============================================> (855 + 1) / 1000][stage 0:=============================================== => (895 + 1) / 1000][stage 0:======================== ===========================> (953 + 1) / 1000] 00:47:30 466 [ Main] warn test.scala.spark.testspark2$.testcoalesce (testspark2.scala:29): result [ 2015,4030,6045,8060]

You can see the task execution statistics by opening http://localhost:4040/jobs/.

The degree of parallelism of the RDD key performance considerations

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The degree of parallelism of the RDD key performance considerations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The degree of parallelism of the RDD key performance considerations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support