Spark Performance Optimizations

Source: Internet
Author: User
Tags cassandra

1. Memory

Spark.storage.memoryFraction: Obviously, the size of the spark cache, the default scale of 0.6

Spark.shuffle.memoryFraction: Managing the Executor in the Rdd and when running tasks in the object creation memory ratio, default 0.2

A common scenario for setting up these two parameters is to manipulate the relational database

Spark can operate a relational database through JDBC, but if there is no data spread, then all data is read to the driver node, it is strongly recommended to look at the amount of data in the table and the memory setting parameters of the spark in the cluster

Suppose the executor memory size is set to 2G, which means spark.shuffle.memoryFraction available memory is 2g*0.2=400m , assuming that the 5W data size is 1M, which means that the 400*50w=2000w bar can be read

When you read more than 2000W on a single node and are not able to process it in a timely manner, there is a great possibility of oom

Memory settings

Spark.shuffle.memoryFraction 0.4 #适当调高

Spark.storage.memoryFraction 0.4 #适当调低

2, open the outside sort

Spark.sql.planner.externalSort true

3. Modify the serialization tool

Spark.serializer Org.apache.spark.serializer.KryoSerializer

4. Limit application Application Core number

Spark.cores.max 15

5, parallel number

Spark.default.parallelism 90

6, join the third party common class library

spark.executor.extraclasspath/opt/spark/current/lib/sqljdbc41.jar:/opt/spark/current/lib/ Postgresql-9.4-1202-jdbc41.jar:spark-cassandra-connector-full.jar

spark.driver.extraclasspath/opt/spark/current/lib/sqljdbc41.jar:/opt/spark/current/lib/ Postgresql-9.4-1202-jdbc41.jar:spark-cassandra-connector-full.jar

Spark Performance Optimizations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.