Liaoliang on Spark performance optimization fourth season!

Source: Internet
Author: User

Content:

1, serialization;

2, JVM performance tuning;

==========spark serialization ============ of performance tuning

1, the reason for serialization, the most important reasons are: limited memory space (reduce GC pressure, maximize the avoidance of full GC, once the complete GC, the entire task is in a stopped state), reduce the pressure of disk IO, reduce the pressure of network io;

2. When will it be necessary to generate serialization and deserialization? Serialization and deserialization occur when disk IO and network traffic occur, and there are two other scenarios when serializing and deserializing are more important to consider:

1) Persist (Checkpoint) must consider serialization and deserialization, for example, when the cache to memory only use the JVM allocated 60% of memory space, the good serialization mechanism is critical;

2) When you are programming! function operations that use operators must be serialized and deserialized if they are passed in externally ;

For example:

Val person = new Person ()

Rdd.map (Item=>person.add (item))

Is it OK to write like this? ( can be used to interview others!!!) )

The code is absolutely not possible because the code is in driver, and the code must be passed to executor, which must be serialized and deserialized

Using Kryo's notation, the right thing to do is this:

Conf.set (Spark.serializer

Org.apache.spark.serializer.KryoSerializer)

Conf.registrykryoclass (Array (Classof[person))

Val person = new Person ()

Rdd.map (Item=>person.add (item))

In fact, in terms of efficiency, with broadcast more efficient, here knowledge said serialization and deserialization!!!

3, it is strongly recommended to use the Kryo serializer for serialization and deserialization, spark by default is not the use of Kryo, Instead of the Java-brought serializer ObjectInputStream and Objectoutputsteam (mostly for convenience or versatility), by default, if you customize the type of data elements in the RDD, you must implement serializable. Of course you can also implement your own serialization interface externalizable to implement a more efficient Java serialization algorithm, but with the default ObjectInputStream and Objectoutputsteam Can cause the serialized data to consume a large amount of memory or disk or a large amount of consumption of the network, and in the serialization and deserialization compared to the CPU consumption;

4, strong use of Kryo serialization mechanism, using the Kryo serialization mechanism under Spark, will be more space-saving than the default Java serialization mechanism (nearly 10 times times the space saved) and a smaller CPU consumption, the personal strongly recommend that everyone in all circumstances use the Kryo as much as possible serialization mechanism ;

5. Two ways to use Kryo:

1) in the spark-defaults.conf configuration;

2) configuration in the sparkconf of the program;

Conf.set (Spark.serializer

Org.apache.spark.serializer.KryoSerializer)

The use of Kryo enables faster, lower storage footprint, and higher performance angles for serialization;

6. The types of Scala commonly used in spark are automatically registered to Kryo for serialization management by Allscalaregisty;

7. If a custom type must be registered to the serializer, for example 2nd;

8, Kryo at the time of serialization, the default size of the cache space is 2MB, you can adjust the size according to the specific business model, the specific way: set Spark.kyroserializer.buffer to 10MB, because sometimes there may be;

9, in the use of Kryo strongly recommended to write the completion of the package name and class name, otherwise, each time the serialization will save a copy of the entire package name and class name complete information, which will unnecessarily consume memory space;

==========spark JVM Performance Tuning ============

1. The good news is that spark's tungsten wire program is designed to address JVM performance issues, Bad news, at least before Spark2.0. Tungsten filament program is not stable and imperfect, and can only be used in certain circumstances, that is, spark and its previous versions including Spark1.6.0. We do not use the tungsten filament program in most cases, so we must focus on JVM performance tuning at this time;

2, the key of JVM performance tuning is tuning gc!!! Why is GC so important? Mainly because Spark is passionate about the persistence of RDD!!! The performance cost of the GC itself is proportional to the amount of data;

3, the preliminary can be considered as much as possible using array and string, and in the serialization mechanism as far as possible to adopt Kryo, so that each partition is a byte array;

4. There are two ways to monitor GC:

1) configuration

Spark.executor.extraJavaOptions

-verbose:gc-xx:+printgcdetails-xx:+ Printgcdatetimesatmps

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

2) Sparkui

5, by default, Spark uses 60% of the space to cache the contents of the Rdd, that is, the task in the execution, only use the remaining 40% of the space, if the space is not enough, it will trigger (frequent) GC

You can set the Spark.memory.fraction parameter to adjust the use of space, such as reducing the cache space, allowing the task to use more space to create objects and complete calculations;

Again, it is strongly recommended to use Kryo to serialize the cache from the RDD to allow the task to allocate more space to perform the calculation successfully (avoid frequent GC);

6, because in the old age zone full GC operation will occur, and the old belt space is basically living relatively long objects (experienced GC still exist), this time will stop all the program threads, into full GC, to organize the objects in the older area, seriously affect performance, At this point you can consider:

1) Set the Spark.memory.fraction parameter to adjust the use of space to give younger generation more space for storing short-lived objects;

2)-xmn Adjust Eden Area: Evaluate the size of the objects and data manipulated in the RDD, if the general volume can become about 3 times times the original volume after decompression on HDFs, set Eden based on the size of the data, if there are 10 tasks, The data on the HDFs processed by each task is 128MB, you need to set the size of-xmn to 10*128*3*4/3;

3)-xx:supervisorratio;

4)-xx:newratio;

Supervisorratio, newratio Under normal circumstances do not need to adjust, the premise is that the JVM very understanding of the premise

But after the data level to PB level, it is not such a thing at all, it is necessary to study the JVM!!!

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]


This article from "a Flower proud Cold" blog, declined reprint!

Liaoliang on Spark performance optimization fourth season!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.