Spark Performance Optimization-------Development tuning __spark-rdd

Source: Internet
Author: User
Tags class operator connection pooling serialization shuffle

Spark Source Analysis

Reproduced:

Http://blog.sina.com.cn/s/articlelist_2628346427_2_1.html

Http://blog.sina.com.cn/s/blog_9ca9623b0102webd.html


Spark Performance Optimization-------Development tuning reprint 2016-05-15 12:58:17

Development tuning, know spark basic development principles, including: RDD lineage design, operator rational use, special operation optimization.

Principle one: Do not repeat the creation of RDD for the same data.

Principle two: Try to reuse the same rdd.

Principle three: Persistent rdd for multiple use

Rdd is persisted for multiple uses, at which point Spark will save the RDD data to memory or disk based on your persistence measurements.

Spark levels of persistence: Memory_only, Memory_and_disk, Memory_only_ser, Memory_and_disk_ser, Disk_only, Memory_only_2, MEORY_AND_ Disk_2, ...

A suffix of 2, you must copy all the data and send it to other nodes, and data replication and network transmission can result in significant performance overhead. Do not use this level unless you have high availability requirements for your job.

Principle four: Avoid using the shuffle class operator as far as possible

Principle five: Shuffle operations using map-side pre-aggregation

The so-called map-size pre-aggregation, which aggregates the same key at each node, is similar to the local combiner in MapReduce. Where possible, the Reducebykey or Aggregatebykey operator is used to replace the groupbykey operator. The Groupbykey operator does not perform a pre aggregation.

Principle VI: Using high-performance operators

Use Reducebykey/aggregatebykey instead of Groupbykey

Use mappartitions instead of ordinary map: When a function handles a single parition of all data and does not appear oom, performance is high.

Use Foreachpartitions instead of foreach: Writing mysql,foreachpartitions, for example, is useful for connection pooling overhead.

Coalesce after using filter: Use the COALESCE operator to manually reduce the partition number of RDD.

Use repartitionandsortwithinpartitions instead of repartition and sort classes: if you want to sort after repartition partitions, It is recommended that you use the repartitionandsortwithinpartitions operator directly to perform a shuffle operation on one side of the partition, shuffle and sort two operations at the same time.

Principle VII: Broadcast large variables

When an external variable is used in an operator function, by default, spark copies multiple copies of the variable across the network to a task, at which point each task has a copy of the variable, and a large number of variable copies are transmitted over the network performance overhead, and the frequent GC resulting from excessive memory consumption in the executor of each node can greatly affect performance.

Broadcast the variable to ensure that only one copy of the variable resides in the memory of each executor, and that the copy of the executor is shared when the task in executor is executed.

Principle Eight: Optimizing serialization performance using Kryo

In Spark, there are three main areas involved in serialization:

1. When an external variable is used in an operator function, the variable is serialized and then transmitted over the network.

2. When you use a custom type as a RDD generic type (such as Javardd), all custom type objects are serialized because the custom class must implement the Serializable interface.

3. When using a serializable persistence strategy (such as Memory_only_ser), spark serializes each partition in Rdd into a large byte array.

Set the serializer to Kryoserializero

Conf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer")

Registering the custom type to serialize

Conf.registerkryoclasses (Array (Classof[myclass1],classof[myclass2]))

Principle IX: Optimizing Data structure

In Java, there are three types that are more memory intensive:

1. Objects, each Java object has additional information such as the header, reference, and so on, so the memory space is more occupied.

2. Strings, each string contains an array of characters and additional information such as length.

3. Collection types, such as HashMap, LinkedList, and so on, because internal classes are often used within collection types to encapsulate collection elements, such as Map.entry.

In the spark coding implementation, try not to use the above three kinds of data structures, try to use string substitution objects, use the original type (such as Int,long) instead of strings, using arrays instead of collection types.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.