Spark Performance Optimization-------Development tuning _

Spark Performance Optimization-------Development tuning __spark-rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Source Analysis

Reproduced:

Http://blog.sina.com.cn/s/articlelist_2628346427_2_1.html

Http://blog.sina.com.cn/s/blog_9ca9623b0102webd.html

Spark Performance Optimization-------Development tuning reprint 2016-05-15 12:58:17

Development tuning, know spark basic development principles, including: RDD lineage design, operator rational use, special operation optimization.

Principle one: Do not repeat the creation of RDD for the same data.

Principle two: Try to reuse the same rdd.

Principle three: Persistent rdd for multiple use

Rdd is persisted for multiple uses, at which point Spark will save the RDD data to memory or disk based on your persistence measurements.

Spark levels of persistence: Memory_only, Memory_and_disk, Memory_only_ser, Memory_and_disk_ser, Disk_only, Memory_only_2, MEORY_AND_ Disk_2, ...

A suffix of 2, you must copy all the data and send it to other nodes, and data replication and network transmission can result in significant performance overhead. Do not use this level unless you have high availability requirements for your job.

Principle four: Avoid using the shuffle class operator as far as possible

Principle five: Shuffle operations using map-side pre-aggregation

The so-called map-size pre-aggregation, which aggregates the same key at each node, is similar to the local combiner in MapReduce. Where possible, the Reducebykey or Aggregatebykey operator is used to replace the groupbykey operator. The Groupbykey operator does not perform a pre aggregation.

Principle VI: Using high-performance operators

Use Reducebykey/aggregatebykey instead of Groupbykey

Use mappartitions instead of ordinary map: When a function handles a single parition of all data and does not appear oom, performance is high.

Use Foreachpartitions instead of foreach: Writing mysql,foreachpartitions, for example, is useful for connection pooling overhead.

Coalesce after using filter: Use the COALESCE operator to manually reduce the partition number of RDD.

Use repartitionandsortwithinpartitions instead of repartition and sort classes: if you want to sort after repartition partitions, It is recommended that you use the repartitionandsortwithinpartitions operator directly to perform a shuffle operation on one side of the partition, shuffle and sort two operations at the same time.

Principle VII: Broadcast large variables

When an external variable is used in an operator function, by default, spark copies multiple copies of the variable across the network to a task, at which point each task has a copy of the variable, and a large number of variable copies are transmitted over the network performance overhead, and the frequent GC resulting from excessive memory consumption in the executor of each node can greatly affect performance.

Broadcast the variable to ensure that only one copy of the variable resides in the memory of each executor, and that the copy of the executor is shared when the task in executor is executed.

Principle Eight: Optimizing serialization performance using Kryo

In Spark, there are three main areas involved in serialization:

1. When an external variable is used in an operator function, the variable is serialized and then transmitted over the network.

2. When you use a custom type as a RDD generic type (such as Javardd), all custom type objects are serialized because the custom class must implement the Serializable interface.

3. When using a serializable persistence strategy (such as Memory_only_ser), spark serializes each partition in Rdd into a large byte array.

Set the serializer to Kryoserializero

Conf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer")

Registering the custom type to serialize

Conf.registerkryoclasses (Array (Classof[myclass1],classof[myclass2]))

Principle IX: Optimizing Data structure

In Java, there are three types that are more memory intensive:

1. Objects, each Java object has additional information such as the header, reference, and so on, so the memory space is more occupied.

2. Strings, each string contains an array of characters and additional information such as length.

3. Collection types, such as HashMap, LinkedList, and so on, because internal classes are often used within collection types to encapsulate collection elements, such as Map.entry.

In the spark coding implementation, try not to use the above three kinds of data structures, try to use string substitution objects, use the original type (such as Int,long) instead of strings, using arrays instead of collection types.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Performance Optimization-------Development tuning __spark-rdd

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Performance Optimization-------Development tuning __spark-rdd

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support