1. Memory
Spark.storage.memoryFraction: Obviously, the size of the spark cache, the default scale of 0.6
Spark.shuffle.memoryFraction: Managing the Executor in the Rdd and when running tasks in the object creation memory ratio, default 0.2
A common scenario for setting up these two parameters is to manipulate the relational database
Spark can operate a relational database through JDBC, but if there is no data spread, then all data is read to the driver node, it is strongly recommended to look at the amount of data in the table and the memory setting parameters of the spark in the cluster
Suppose the executor memory size is set to 2G, which means spark.shuffle.memoryFraction available memory is 2g*0.2=400m , assuming that the 5W data size is 1M, which means that the 400*50w=2000w bar can be read
When you read more than 2000W on a single node and are not able to process it in a timely manner, there is a great possibility of oom
Memory settings
Spark.shuffle.memoryFraction 0.4 #适当调高
Spark.storage.memoryFraction 0.4 #适当调低
2, open the outside sort
Spark.sql.planner.externalSort true
3. Modify the serialization tool
Spark.serializer Org.apache.spark.serializer.KryoSerializer
4. Limit application Application Core number
Spark.cores.max 15
5, parallel number
Spark.default.parallelism 90
6, join the third party common class library
spark.executor.extraclasspath/opt/spark/current/lib/sqljdbc41.jar:/opt/spark/current/lib/ Postgresql-9.4-1202-jdbc41.jar:spark-cassandra-connector-full.jar
spark.driver.extraclasspath/opt/spark/current/lib/sqljdbc41.jar:/opt/spark/current/lib/ Postgresql-9.4-1202-jdbc41.jar:spark-cassandra-connector-full.jar
Spark Performance Optimizations