In-depth understanding of SparkSQL running plan and optimization tutorial, sparksql Optimization

Source: Internet
Author: User
Tags ssh secure file transfer

In-depth understanding of SparkSQL running plan and optimization tutorial, sparksql Optimization
1.1 running environment description 1.1.1 hardware and software environment

L host operating system: Windows 64-bit, dual-core 4-thread, clock speed 2.2 GB, 10 Gb memory

L virtual software: VMware Workstation 9.0.0 build-812388

L virtual machine operating system: CentOS6.5 64-bit, single-core

L virtual machine running environment:

JDK: 1.7.0 _ 55 64-bit

Hadoop: 2.2.0 (64-bit compilation required)

Scala: 2.10.4

Spark: 1.1.0 (compilation required)

Hive: 0.13.1 (for source code compilation, see 1.2)

1.1.2 cluster network environment

In this experiment, only one hadoop1 machine is required. The network environment configuration is as follows:

Serial number

IP address

Machine name

Type

User Name

Directory

1

192.168.0.61

Hadoop1

NN/DN

Hadoop

/Path of the app

/App/scala -...

/App/hadoop

/App/complied

1.2 compile Hive1.2.1 to download the Hive source code package

Here we select the downloaded version hive-0.13.1. This version needs to be downloaded from the apache archive server:

1.2.2 upload the Hive source code package

Upload the downloaded hive-0.13.0.tar.gz installation package to the/home/hadoop/upload directory using the SSH Secure File Transfer tool (see section 1.3.1 Spark compilation and deployment (on)-Basic Environment setup.

1.2.3 decompress and move to the compiling directory

To the upload directory, run the following command to decompress the hive installation file:

$ Cd/home/hadoop/upload

$ Tar-zxf apache-hive-0.13.1-src.tar.gz

Rename and move it to the/app/complied directory:

$ Sudo mv apache-hive-0.13.1-src/app/complied/hive-0.13.1-src

$ Ll/app/complied

1.2.4 compile Hive

When compiling Hive source code, you need to download the dependency package from the Internet. Therefore, the machine must be in the Network State throughout the compilation process. Compile and execute the following script:

$ Cd/app/complied/hive-0.13.1-src/

$ Export MAVEN_OPTS = "-Xmx2g-XX: MaxPermSize = 512 M-XX: ReservedCodeCacheSize = 512 m"

$ Mvn-Phadoop-2, dist-Dmaven. test. skip = true clean package

During the compilation process, the compilation may be slow or interrupted. You can start the compilation again. The compilation program will continue to compile at the last compilation interruption. the time consumed during the compilation process is closely related to the network speed, it may take about one hour for a fast network speed (the time is the last successful page after multiple compilation ). The final compilation result is $ HIVE_HOME/packaging/target/apache-hive-0.13.1-bin.tar.gz

Run the following command to check the size of the final compiled directory. The size is about 353.6M.

$ Du-s/app/complied/hive-0.13.1-src

[Note] the compiled Hive package is available in this series of related resources/install/6.hive-0.13.1-src.tar.gz, which is directly used by readers.

1.3 first run the hive-console1.3.1 to get Spark Source Code

The first time you run hive-console, You need to compile the Spark Source Code. For more information about how to obtain the Spark source code, see Spark compilation and deployment (II) -- Spark compilation and installation, the connection address is success.

1.3.2 Configure/etc/profile Environment Variables

Step 1 use the following command to open the/etc/profile file:

$ Sudo vi/etc/profile

Step 2: set the following parameters:

Export HADOOP_HOME =/app/hadoop/hadoop-2.2.0

Export HIVE_HOME =/app/complied/hive-0.13.1-src

Export HIVE_DEV_HOME =/app/complied/hive-0.13.1-src

Step 3: Configure and verify

$ Sudo vi/etc/profile

$ Echo $ HIVE_DEV_HOME

1.3.3 run sbt for compilation

To run hive/console, you do not need to start Spark. You need to go to the Spark root directory and use sbt/sbt hive/console for the first run and compilation. After compilation, you can start Spark directly next time. When compiling Spark source code, you need to download the dependency package from the Internet. Therefore, the machine must be in the Network State throughout the compilation process. The compilation command is as follows:

$ Cd/app/complied/spark-1.1.0-hive

$ Sbt/sbt hive/console

Compilation takes a long time. During the compilation process, the compilation may be slow or interrupted. You can start the compilation again. The compilation program will continue to compile at the last compilation interruption, the duration of the entire compilation process is closely related to the network speed.

Run the following command to check the size of the final compiled directory. The size is about 267.9M.

$ Du-s/app/complied/spark-1.1.0-hive

[Note] the compiled Spark for hive-console package can be directly used in this series of related resources/install/6.spark-1.1.0-hive.tar.gz.

1.4 start hive-console with hive-console1.4.1

Go to the spark root directory and run the following command to start hive-console:

$ Cd/app/complied/spark-1.1.0-hive

$ Sbt/sbt hive/console

1.4.2 auxiliary commands Help and Tab

You can use: help to view help content.

Scala>: help

You can use the tab key to view all available commands and functions.

1.4.3 Common Operations

First, define the Person class, define the name, age, and state columns in the class, register the class as the people table, load the data, and store the data in the query.

Scala> case class Person (name: String, age: Int, state: String)

Scala> sparkContext. parallelize (Person ("Michael", 29, "CA"): Person ("Andy", 30, "NY"): Person ("Justin", 19, "CA"): Person ("Justin", 25, "CA"): Nil ). registerTempTable ("people ")

Scala> val query = SQL ("select * from people ")

1.4.3.1 view the queried schema

Scala> query. printSchema

Scala> query. collect ()

1.4.3.2 view the queried overall running plan

Scala> query. queryExecution

1.4.4 view the queried Unresolved LogicalPlan

Scala> query. queryExecution. logical

1.4.4.1 view the queried Analyzed LogicalPlan

Scala> query. queryExecution. analyzed

1.4.4.2 view the optimized LogicalPlan

Scala> query. queryExecution. optimizedPlan

1.4.4.3 view the physical plan

Scala> query. queryExecution. sparkPlan

1.4.4.4 view the RDD conversion process

Scala> query. toDebugString

1.4.5 run plans for different data sources

The data from RDD is described in the common operations above. SparkSQL can also be derived from multiple data sources: jsonFile, parquetFile, and Hive.

1.4.5.1 reading data in Json format

Step 1: Json Test Data

Json files support nested tables. SparkSQL can also read nested tables, such as Json data in the following format. You can use jsonFile to read SparkSQL. This file can be found in the supporting resources/data/class6. In the following test, put the file in the/home/hadoop/upload/class6 path.

{

"Fullname": "Sean Kelly ",

"Org": "SK Consulting ",

"Emailaddrs ":[

{"Type": "work", "value": "kelly@seankelly.biz "},

{"Type": "home", "pref": 1, "value": "kelly@seankelly. TV "}

],

"Telephones ":[

{"Type": "work", "pref": 1, "value": "+ 1 214 555 1212 "},

{"Type": "fax", "value": "+ 1 214 555 1213 "},

{"Type": "mobile", "value": "+ 1 214 555 1214 "}

],

"Addresses ":[

{"Type": "work", "format": "us ",

"Value": "1234 Main StnSpringfield, TX 78080-1216 "},

{"Type": "home", "format": "us ",

"Value": "5678 Main StnSpringfield, TX 78080-1316 "}

],

"Urls ":[

{"Type": "work", "value": "http://seankelly.biz /"},

{"Type": "home", "value": "http://seankelly. TV /"}

]

}

Step 2: Read Json data

Use jsonFile to read data and register it as a table jsonPerson. Then define a jsonQuery

Scala> jsonFile ("/home/hadoop/upload/class6/nestjson. json"). registerTempTable ("jsonPerson ")

Scala> val jsonQuery = SQL ("select * from jsonPerson ")

Step 3 view jsonQuery schema

Scala> jsonQuery. printSchema

Step 4: view the entire jsonQuery running plan

Scala> jsonQuery. queryExecution

1.4.5.2 read data in Parquet format

Parquet data is stored in the supporting resources/data/class6/wiki_parquet. In the following test, put the file in the/home/hadoop/upload/class6 path.

Step 1 read Parquet data

The parquet file is read into and registered as the table parquetWiki, and then defines a query parquetQuery

Scala> parquetFile ("/home/hadoop/upload/class6/wiki_parquet"). registerTempTable ("parquetWiki ")

Scala> val parquetQuery = SQL ("select * from parquetWiki ")

An error is reported, but this does not affect usage.

Step 2: query the schema of parquetQuery

Scala> parquetQuery. printSchema

Step 3: query the entire running plan of parquetQuery

Scala> parquetQuery. queryExecution

Step 4 query sampling

Scala> parquetQuery. takeSample (false, 10, 2)

1.4.5.3 read hive Built-in Test Data

The TestHive class has defined a large number of hive0.13 test data table types, such as src and sales, which can be directly used in the hive-console, hive-console is loaded once. Next we will use the sales table to see its schema and the entire running plan.

Step 1 read test data and define a hiveQuery

Scala> val hiveQuery = SQL ("select * from sales ")

Step 2: view the hiveQuery schema

Scala> hiveQuery. printSchema

Step 3: view the entire hiveQuery running plan

Scala> hiveQuery. queryExecution

Step 4 run plans of other SQL statements

Scala> val hiveQuery = SQL ("select * from (select * from src limit 5) a limit 3 ")

Scala> val hiveQuery = SQL ("select * FROM (select * FROM src) ")

Scala> hiveQuery. where ('key = 100). queryExecution. toRdd. collect

1.4.6 different query run plans 1.4.6.1 Aggregate Query

Scala> SQL ("select name, age, state as location from people"). queryExecution

Scala> SQL ("select name from (select name, state as location from people) a where location = 'CA'"). queryExecution

Scala> SQL ("select sum (age) from people"). queryExecution

Scala> SQL ("select sum (age) from people"). toDebugString

Scala> SQL ("select state, avg (age) from people group by state"). queryExecution

Scala> SQL ("select state, avg (age) from people group by state"). toDebugString

 

1.4.6.2Join operation

Scala> SQL ("select a. name, B. name from people a join people B where a. name = B. name"). queryExecution

Scala> SQL ("select a. name, B. name from people a join people B where a. name = B. name"). toDebugString

1.4.6.3Distinct operation

Scala> SQL ("select distinct a. name, B. name from people a join people B where a. name = B. name"). queryExecution

Scala> SQL ("select distinct a. name, B. name from people a join people B where a. name = B. name"). toDebugString

1.4.7 optimization 1.4.7.1CombineFilters

CombineFilters is used to merge filters. When multiple filters are contained, the following query is performed:

SQL ("select name from (select * from people where age> = 19) a where a. age <30"). queryExecution

In the preceding query, in the Optimized process, the filters age> = 19 and age <30 are merged into (age> = 19) & (age <30 )). In fact, another optimization is made above, that is, project push-down. The subquery uses all columns of the table, and the primary query uses the column name, when querying data, the subquery is optimized to query only the column name.

1.4.7.2PushPredicateThroughProject

PushPredicateThroughProject is the project push-down, which is the same as the project in the preceding example.

SQL ("select name from (select name, state as location from people) a where location = 'CA'"). queryExecution

1.4.7.3ConstantFolding:

ConstantFolding is a constant superposition used for expressions. Example:

SQL ("select name, 1 + 2 from people"). queryExecution

2. SparkSQL Optimization

Spark is a fast memory computing framework and a parallel computing framework. when optimizing computing performance, in addition to the well-known Barrel Principle, we also need to consider the Amdahl theorem of parallel operations.

The barrel principle, also known as the theory of short board, is the core idea: the amount of water in a bucket does not depend on the highest block on the bucket wall, but on the shortest block on the bucket wall. This theory is applied to system performance optimization. The final performance of the system depends on the components with the worst performance in the system. For example, if the system has sufficient memory resources and CPU resources but the disk I/O performance is low, the overall performance of the system depends on the current slowest disk I/O speed, instead of the most advantageous CPU or memory. In this case, it is useless to optimize the memory or CPU resources if you need to further improve the system performance. The overall system performance can be optimized only when the disk I/O performance is improved.

Amdahl's theorem, an empirical Law of the computer scientific community, is named after Jean amdal. It represents the ability to improve efficiency after parallel computing of the processor. In parallel computing, the acceleration ratio is expressed by the ratio of the execution speed before parallelism and the execution speed after parallelism. It indicates the efficiency improvement after parallelism. Amdal's law is a quantitative standard for fixed loads (when the total computing volume remains unchanged. Available formula: to represent. Represents the serial component of the problem scale (the part that cannot be parallelized in the problem) and the parallel component, and p represents the number of processors. At that time, the upper limit was, among them. This means that no matter how much we increase the number of processors, the acceleration ratio cannot exceed this number.

As a component of Spark, SparkSQL must fully consider the two principles mentioned above during optimization. It must also consider how to make full use of hardware resources, we also need to consider how to make good use of parallel computing in distributed systems. Due to the limited test environment conditions, this article does not provide more detailed experimental data, but can only be explained theoretically.

2.1 Parallelism

SparkSQL runs in a cluster. It splits a query Task into a large number of tasks and assigns them to each node in the cluster for running. Generally, the number of tasks is greater than the degree of parallelism of the cluster. For example, when you query data in Chapter 6 and Chapter 7, the default spark. SQL. shuffle. partitions are used during shuffle, that is, 200 partitions, that is, 200 tasks:

However, in the Cluster Environment of the experiment, only three tasks can be concurrently run, that is, only three tasks can be run at the same time:

Now you should understand that 200 = 67 batches are required to run the 200/3 tasks. How can we reduce the number of running batches? It is necessary to increase the parallel degree of the query task as much as possible. The parallelism of query tasks is determined by the cluster processing capability and the Cluster's effective processing capability.

L for a Spark Standalone cluster, the processing capacity of the cluster is determined by the SPARK_WORKER_INSTANCES parameter and SPARK_WORKER_CORES parameter in conf/spark-env. SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES cannot exceed the actual CPU core of the physical;

L The effective processing capability of a cluster refers to the idle cluster resources in the cluster. Generally, it refers to the -- total-executor-cores specified when spark-submit or spark-shell is used. Generally, we do not need to specify this parameter. In this case, the Spark Standalone cluster will allocate all idle cores to the query, and during Task Round-Robin, the Standalone Cluster also allocates idle cores of other spark applications to running queries.

In summary, the query concurrency of SparkSQL is mainly related to the number of cores in the cluster. The rational configuration of the core of each node can improve the cluster concurrency and query efficiency.

2.2 Efficient Data Format

Efficient data formats accelerate Data Reading and reduce memory consumption. Efficient data formats include:

2.2.1 local data

The essence of the distributed computing system is mobile computing rather than mobile data. However, in the actual computing process, there is always a situation of mobile data, unless a copy of data is stored on all nodes of the cluster. Moving data from one node to another for computing not only consumes network I/O, but also disk I/O, reducing the overall computing efficiency. To improve the data locality, in addition to the optimization algorithm (I .e. modifying spark memory, it is a little difficult), it is to reasonably set the data copy. Set a copy of the data, which requires you to configure parameters and observe the running status for a long time to obtain an experience value.

The following figure shows the Spark webUI monitoring Stage:

LPROCESS_LOCAL refers to reading data cached on the local node.

LNODE_LOCAL refers to reading the hard disk data of the local node.

LANY refers to reading data from non-local nodes.

L generally, PROCESS_LOCAL> NODE_LOCAL> ANY is used to read data in PROCESS_LOCAL or NODE_LOCAL mode. PROCESS_LOCAL is also related to cache.

 

2.2.2 suitable data types

It is also necessary to define an appropriate data type for the data to be queried. For a data column that can be used by a tinyint, you do not need to define it as an int type. The data of a tinyint occupies 1 byte, while the int occupies 4 bytes. That is to say, once the data is cached, the memory consumption will increase several times. In SparkSQL, defining appropriate data types can save limited memory resources.

2.2.3 suitable data Columns

For the data to be queried, when writing an SQL statement, try to write the name of the column to be queried, such as Select a, B from tbl, instead of using Select * from tbl; this not only reduces disk IO, but also reduces memory consumption during cache.

2.2.4 excellent data storage format

During the query, you still need to read the files stored in the file system. Using a better data storage format will facilitate data reading speed. View the SparkSQL Stage and you can find that Data Reading consumes a large proportion in many cases. For sqlContext, textFiile, SequenceFile, ParquetFile, and jsonFile are supported. For hiveContext, AvroFile, ORCFile, Parquet File, and various types of compression are supported. Testing and selecting an appropriate data storage format based on your business needs will help improve the query efficiency of SparkSQL.

2.3 memory usage

The most tangled part of spark applications is the memory usage, which best reflects the "details are the Devil. Spark has many memory configuration items, of which the following are important:

LSPARK_WORKER_MEMORY, configure SPARK_WORKER_MEMORY and SPARK_WORKER_INSTANCES in conf/spark-env.sh, you can make full use of node memory resources, SPARK_WORKER_INSTANCES * SPARK_WORKER_MEMORY should not exceed the memory capacity of the node itself;

Lexecutor-memory: The amount of memory applied for when submitting a spark application in spark-shell or spark-submit. Do not exceed the SPARK_WORKER_MEMORY value of the node;

The ratio of the lspark. storage. memoryFraction spark application in the applied memory resources that can be used for cache

The ratio of the lspark. shuffle. memoryFraction spark application in the applied memory resources that can be used for shuffle

In actual use, you can change the last two parameters based on the memory consumption of common queries. In addition, there are several suggestions for SparkSQL:

L cache tables or queries that are frequently used, and do not cache tables that are only used once;

L for join operations, tables with smaller cache priority;

L pay more attention to Stage monitoring and think more about how to use PROCESS_LOCAL for more tasks;

L pay more attention to Storage monitoring and think more about how to increase the percentage of Fraction cached

2.4 proper Task

For SparkSQL, another important parameter is the number of tasks during shuffle, which can be adjusted through spark. SQL. shuffle. partitions. The basic adjustment is the processing capability of the spark cluster and the data volume to be processed. The default value of spark is 200. If there are too many tasks, there will be a lot of startup overhead, and the number of tasks. The processing time of each Task is too long, so it is easy to straggle.

2.5 Other suggestions

There are a lot of Optimization content, but most of it is detailed. Below is a simple mention:

L to get a better expression query speed, set spark. SQL. codegen to true;

L for the calculation results of large datasets, do not use collect (), collect () will return the results to the driver, it is easy to crack the driver's memory; generally directly output to the Distributed File System;

L for Worker skews, setting spark. speculation = true will remove nodes that are not robust;

L for data skew, some intermediate steps are used, such as aggregation cache. The specific situation is analyzed;

L appropriate sequence-based solutions and compression solutions;

L be good at using the cluster monitoring system to maintain the cluster running status in a reasonable and stable State;

L be good at solving key conflicts, observe the tasks in the Stage, view the most time-consuming tasks, find the reasons and improve;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.