==> cache data in memory
---> Performance tuning is primarily about putting data into memory operations
---> Usage examples:
// reading data from Oracle database, generating  DATAFRAMEVAL ORACLEDF = spark.read.format ("jdbc") .option ("url", " Jdbc:oracle:thin:@192.168.10.100:1521/orcl.example.com ") . Option ("DBTable", "Scott.emp") .option ("User", "Scott ") .option (" Password ", " Tiger ") .load // will DataFrame register as a table oracledf.registertemptable ("EMP")// execute the query, And through Web Console monitoring the execution time Spark.sql ("Select * from emp") .show// the table to cache, and query two times, // empty The cache by monitoring execution time spark.sqlContext.cacheTable ("emp") through Web Console . spark.sqlContext.cacheTable ("emp") Spark.sqlContext.clearCache
==> Optimization Related parameters
---> spark.sql.inMemoryColumnarStorage.compressed
---- Default value: true
---- Spark SQL will automatically select a compression encoding for each column based on statistics
---> Spark.sql.inMemoryColumnarStorage.batchSize
----Default value: 10000
---- cache batch size, larger batches can increase memory utilization and compression rates, but also bring the risk of OOM (out of memory)
---> Spark.sql.files.maxPartitionBytes
----Default value: 128M
maximum number of bytes that a single partition can hold when----reading a file
---> Spark.sql.files.openCostinBytes
----Default value: 4M
---- The estimated cost of opening a file, measured by the number of bytes that can be scanned at the same time, is used when writing multiple files to a partition, which is relatively good to overestimate, so that small file partitions will be faster than large file partitions (priority scheduling)
---> Spark.sql.autoBroadcastJoinThreshold
----Default value:10M
The----is used to configure the maximum byte size that a table can broadcast to all worker nodes when a join operation is performed, setting this value to 1 to disable broadcasting .
----NOTE: Current data statistics only support Hive Metastore tables that have run ANALYZE table <tablename> COMPUTE STATISTICS noscan commands
---> Spark.sql.shuffle.partitions
----Default value:
---- the number of partitions used to configure the join or aggregation operation (shuffle) data
Spark SQL Performance Optimization