Spark Source Analysis
Reproduced:
Http://blog.sina.com.cn/s/articlelist_2628346427_2_1.html
Http://blog.sina.com.cn/s/blog_9ca9623b0102webd.html
Spark Performance Optimization-------Development tuning reprint 2016-05-15 12:58:17
Development tuning, know spark basic development principles, including: RDD lineage design, operator rational use, special operation optimization.
Principle one: Do not repeat the creation of RDD for the same data.
Principle two: Try to reuse the same rdd.
Principle three: Persistent rdd for multiple use
Rdd is persisted for multiple uses, at which point Spark will save the RDD data to memory or disk based on your persistence measurements.
Spark levels of persistence: Memory_only, Memory_and_disk, Memory_only_ser, Memory_and_disk_ser, Disk_only, Memory_only_2, MEORY_AND_ Disk_2, ...
A suffix of 2, you must copy all the data and send it to other nodes, and data replication and network transmission can result in significant performance overhead. Do not use this level unless you have high availability requirements for your job.
Principle four: Avoid using the shuffle class operator as far as possible
Principle five: Shuffle operations using map-side pre-aggregation
The so-called map-size pre-aggregation, which aggregates the same key at each node, is similar to the local combiner in MapReduce. Where possible, the Reducebykey or Aggregatebykey operator is used to replace the groupbykey operator. The Groupbykey operator does not perform a pre aggregation.
Principle VI: Using high-performance operators
Use Reducebykey/aggregatebykey instead of Groupbykey
Use mappartitions instead of ordinary map: When a function handles a single parition of all data and does not appear oom, performance is high.
Use Foreachpartitions instead of foreach: Writing mysql,foreachpartitions, for example, is useful for connection pooling overhead.
Coalesce after using filter: Use the COALESCE operator to manually reduce the partition number of RDD.
Use repartitionandsortwithinpartitions instead of repartition and sort classes: if you want to sort after repartition partitions, It is recommended that you use the repartitionandsortwithinpartitions operator directly to perform a shuffle operation on one side of the partition, shuffle and sort two operations at the same time.
Principle VII: Broadcast large variables
When an external variable is used in an operator function, by default, spark copies multiple copies of the variable across the network to a task, at which point each task has a copy of the variable, and a large number of variable copies are transmitted over the network performance overhead, and the frequent GC resulting from excessive memory consumption in the executor of each node can greatly affect performance.
Broadcast the variable to ensure that only one copy of the variable resides in the memory of each executor, and that the copy of the executor is shared when the task in executor is executed.
Principle Eight: Optimizing serialization performance using Kryo
In Spark, there are three main areas involved in serialization:
1. When an external variable is used in an operator function, the variable is serialized and then transmitted over the network.
2. When you use a custom type as a RDD generic type (such as Javardd), all custom type objects are serialized because the custom class must implement the Serializable interface.
3. When using a serializable persistence strategy (such as Memory_only_ser), spark serializes each partition in Rdd into a large byte array.
Set the serializer to Kryoserializero
Conf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer")
Registering the custom type to serialize
Conf.registerkryoclasses (Array (Classof[myclass1],classof[myclass2]))
Principle IX: Optimizing Data structure
In Java, there are three types that are more memory intensive:
1. Objects, each Java object has additional information such as the header, reference, and so on, so the memory space is more occupied.
2. Strings, each string contains an array of characters and additional information such as length.
3. Collection types, such as HashMap, LinkedList, and so on, because internal classes are often used within collection types to encapsulate collection elements, such as Map.entry.
In the spark coding implementation, try not to use the above three kinds of data structures, try to use string substitution objects, use the original type (such as Int,long) instead of strings, using arrays instead of collection types.