Query optimization is the most important part of traditional database, and this technology is already mature in traditional database. In addition to query optimization, spark SQL is optimized for storage, and some of the optimization strategies for spark SQL are viewed from the following points.
(1) in-memory columnstore and memory cache tables
Spark SQL can convert data storage to columnstore through cachetable, while loading data into memory for caching. Cachetable is equivalent to the in-memory materialized view of a distributed cluster, which caches data so that iterative or interactive queries do not have to read data from HDFS, and reading data directly from memory greatly reduces I/o overhead. The advantage of Columnstore is that Spark SQL only needs to read the columns the user needs, without needing to read all the columns every time, like row storage, to significantly reduce memory cache data
More efficient use of in-memory data caching while reducing network transmission and I/O overhead. Data is stored in columns, and because data is continuously stored in the same data type, it is possible to use serialization and compression to reduce memory footprint.
(2) Column storage compression
To reduce memory and hard disk space consumption, Spark SQL uses a number of compression strategies to compress memory columnstore data. Spark SQL is much more compressed than Shark, for example it supports passthrough,runlengthencoding, dictionaryencoding, Booleanbitset, Intdelt A, Longdelta and many other compression methods. This can significantly reduce memory footprint and network transport overhead and I/O overhead.
(3) Logical query optimization
Spark SQL supports logical query optimization methods such as column pruning, predicate compression, and attribute merging on the logical query Optimization (1). Column pruning in order to reduce the need to read unnecessary attribute columns, reduce data transfer and computational overhead, the optimization of column pruning is performed during the conversion of the query optimizer.
Figure 1 Logical Query optimization
A logical optimization example is described below:
Select Class from (select Id,name,class from STUDENT) S WHERE s.id=1
Catalyst presses the original query through the predicate, id=1 the selection operation first, filtering the majority of the data, and using the property merge to make the final projection only once to the final reserved Class attribute column.
(4) Join optimization
Spark SQL deeply draws on the essence of traditional database query optimization technology, and also makes specific optimization strategy adjustment and innovation in distributed environment. Spark SQL optimizes joins to support a variety of connection algorithms that are now
In the connection algorithm is already richer than Shark, and many of the original Shark elements are gradually migrated over. For example: Broadcasthashjoin, Broadcastnestedloopjoin, Hashjoin, Leftsemijoin, and so on.
Here is an idea of the broadcasthashjoin algorithm. Broadcasthashjoin the small table into broadcast variables to broadcast, so as to avoid shuff le overhead, and finally make a Hash connection within the partition. Here is the idea of a Map Side Join in Hive. At the same time, the Hash connection algorithm in DBMS is used to make the connection.
With the development of Spark SQL, more query optimization strategies will be added in the future. At the same time, subsequent sparksql will support servers like Shark Server, JDBC interfaces, and more persistent layers such as
NoSQL, traditional DBMS and so on. A powerful structured Big data query engine is on the rise.
Spark SQL Optimization Policy