Improve the performance of spark by partitioning (partitioning)

Last Update:2016-07-07 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At sortable, much of the data processing work is done using spark. In the process of using spark, they discovered a technique that would improve the performance of Sparkjob, that is, to modify the number of partitions in the data, and this article will give an example and describe in detail how to do it.

Find prime numbers

For example, we need to look for all prime numbers from 2 to 2000000. It's natural to think of finding all the non-prime numbers first, and all the remaining numbers are the prime numbers we're looking for.

We first iterate through each number between 2 and 2000000, and then find all the multiples of these numbers that are less than or equal to 2000000, and there may be a lot of duplicate data in the result of the calculation (for example, 6 is also a multiple of 2 and 3) but it has no effect.

We calculate in the spark shell:

Welcome to ______ / __/_____ _____/ /__ _\ \/ _ \/ _ `/ __/ ‘_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.1 /_/Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)Type in expressions to have them evaluated.Type :help for more information.Spark context available as sc.SQL context available as sqlContext.scala> val n = 2000000n: Int = 2000000scala> val composite = sc.parallelize(2 to n, 8).map(x => (x, (2 to (n / x)))).flatMap(kv => kv._2.map(_ * kv._1))composite: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at flatMap at <console>:29scala> scala> val prime = sc.parallelize(2 to n, 8).subtract(composite)prime: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at subtract at <console>:31 scala> prime.collect()res0: Array[Int] = Array(563249, 17, 281609, 840761, 1126513, 1958993, 840713, 1959017, 41, 281641, 1681513, 1126441, 73, 1126457, 89, 840817, 97, 1408009, 113, 137, 1408241, 563377, 1126649, 281737, 281777, 840841, 1408217, 1681649, 281761, 1408201, 1959161, 1408177, 840929, 563449, 1126561, 193, 1126577, 1126537, 1959073, 563417, 233, 281849, 1126553, 563401, 281833, 241, 563489, 281, 281857, 257, 1959241, 313, 841081, 337, 1408289, 563561, 281921, 353, 1681721, 409, 281993, 401, 1126897, 282001, 1126889, 1959361, 1681873, 563593, 433, 841097, 1959401, 1408417, 1959313, 1681817, 457, 841193, 449, 563657, 282089, 282097, 1408409, 1408601, 1959521, 1682017, 841241, 1408577, 569, 1408633, 521, 841273, 1127033, 841289,617, 1408529, 1959457, 563777, 841297, 1959473, 577, 593, 563809, 601,...

The answer looks reliable, but let's take a look at the performance of this program. If we look inside the spark UI, we can see that spark has used 3 stages throughout the calculation process, the DAG (Directed acyclic graph) Visualization of this computational process in the UI, which shows the different rdd calculations in the DAG graph.

In Spark, as long as the job needs data interaction between partitions, a new stage will be generated (if you use spark terminology, the data interaction between partitions is actually shuffle). Each partition in the Spark stage will have a task to calculate, and these tasks are responsible for converting the data of this RDD partition (transform) into another RDD partition. Let's simply look at the task run of stage 0:

We Duration Shuffle Write Size / Records are very interested in and two columns. sc.parallelize(2 to n, 8)1999999 Records has been generated, and this write record is evenly distributed across 8 partitions, and each task's calculation takes almost the same amount of time, so this stage is no problem.

Stage 1 is the more important stage because it runs map and flatMap transformation, so let's look at how it works:

As you can see, this stage is not working well because the workload is not balanced into all the tasks! 93% of the data is concentrated in one task, and the task's calculation costs 14s; the other slow task takes 1s. However, we provide 8 cores for calculation, and 7 of them are waiting for the stage to complete within 13s. This use of resources is very not efficient.

If you want to keep abreast of spark, Hadoop, or hbase-related articles, please follow the public account: Iteblog_hadoopWhy does this happen?

When we run sc.parallelize(2 to n, 8) the statement, Spark uses a partitioning mechanism to divide the data nicely into 8 groups. It is most likely to use range partitioner, which means that 2-250000 is divided into the first partition, 250001-500000 to the second partition, and so on. However, our map function turns these numbers into (key,value) pairs, and the data size in value varies greatly (when the key is small, value is more, and thus larger). Each value is a list of values that we need to multiply by key and less than 2000000, and more than half of the value of the key pair (all keys greater than 1000000) is empty, and key equals 2 corresponds to the value of the most, All data from 2 to 1000000 are included! That's why the first partition has almost all of the data, it takes the most time to compute, and the last four partitions have little data!

How to Solve

We can repartition the data. Invoking the function on the RDD will .repartition(numPartitions) cause spark to trigger shuffle and distribute the data to the number of partitions we specify, so let's try adding this to our code.

In addition .map .flatMap to adding between and functions .repartition(8) , the other code does not change. Our RDD now also has 8 partitions, but now the data will be re-distributed in these partitions, the modified code is as follows:

/** * User: 过往记忆 * Date: 2016年6月24日 * Time: 下午21:16 * bolg: http://www.iteblog.com * 本文地址：http://www.iteblog.com/archives/1695 * 过往记忆博客，专注于hadoop、hive、spark、shark、flume的技术博客，大量的干货 * 过往记忆博客公共帐号：iteblog_hadoop */val composite = sc.parallelize(2 to n, 8).map(x => (x, (2 to (n / x)))).repartition(8).flatMap(kv => kv._2.map(_ * kv._1))

The new DAG visualization looks more complex than before because the repartition operation will have shuffle operations, all adding a stage.

Stage 0, as before, the new Stage 1 looks similar to stage 0, with each task processing about 250,000 records and spending 1s of time. Stage 2 is the more important stage, and here are the following:

As can be seen, the current stage 2 is much better than the previous Stage 1 performance, the stage we are dealing with the same data as before the old Stage 1, but this time each task spends about 5s, and each core has been effectively used.

The last stage of the two-version code is probably run at 6s, so the first version of the code runs approximately, 0.5 + 14 + 6 = ~21s and after the data is re-distributed, the run time is about 0.5 + 1 + 5 + 6 = ~13s . Although the modified code needs to do some extra calculations (redistribute the data), this change reduces the overall run time because it allows us to use our resources more efficiently.

Of course, if your goal is to look for prime numbers, there are more efficient algorithms than the one presented here. But this article is just about how important it is to consider the distribution of spark data. Adding .repartition functions will increase the overall spark work, but the benefits can be significantly greater than the cost

This article translated from: improving Spark performance with partitioning

This blog post except special statement, all are original!
Respect the original, reproduced please specify: Reproduced from the past memory (http://www.iteblog.com/)
This article links: "Improving the performance of spark by partitioning (partitioning)" (http://www.iteblog.com/archives/1695)

Improve the performance of spark by partitioning (partitioning)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More