Maxcomputesql Performance Tuning

Source: Internet
Author: User
Tags joins

Reprinted from Xiaorui

Some users (especially the external output) when using Maxcompute (original ODPs), due to the use of the product level and level of understanding of the level of implementation, resulting in the implementation of the task is too long, the occupation of more cluster resources, serious will lead to failure, not only need to invest support students to help solve, Also affect the user's normal business. The method of improving the performance of merged parts is convenient to support users to query and optimize SQL and improve the efficiency. Some of the original tuning settings, such as Mapjoin, PPD predicate push-down, and so on, have realized the automatic tuning in the continuously evolving products. Different stages of product tuning parameters and details will be inconsistent, but familiar with the tuning ideas and methods can be done extrapolate, and gradually in-depth.

I. Holistic approach

Overall, tuning from the bottom to the SQL can have multiple levels of tuning, with the continuous improvement of products, some tuning has been achieved automation. If you are familiar with the common tuning mechanisms and execution principles, you can improve performance by developing execution SQL, business logic, and related parameter setting adjustments to achieve less.
1. Hardware and operating system level tuning: including disk I/O tuning (multiplexing, etc.), network tuning (buffer size, number of connections amplification, etc.), memory tuning (virtual memory settings, memory control, etc.);
2. Distributed computing platform and storage plane tuning: Storage format settings, compression format settings, RPC call settings, connection number control settings, scheduling mechanism settings, block and shard settings, execution resource settings, etc.
3. Business logic level and parameter adjustment, in addition to the overall implementation of the tuning, the different types of operation of the parameter-level tuning, for aggregation, connection, read-write and other modifications to different SQL or set different parameters can greatly improve performance;
4. SQL level and application level tuning, refactoring SQL writing, merging SQL, size table connection modified to Mapjoin, etc., in odps2.0 has been processed in the automatic Mapjoin, etc., currently not upgraded users and external output users still need to modify SQL to support;
Different levels of tuning and advantages and disadvantages see below:

Two. Scenario and Tuning Enumeration

1. Size Table Association modified to Mapjoin, add mapjoin hint
Methods and Precautions:

    • Total memory consumed by all specified small tables does not exceed 512M
    • When multiple table joins, the leftmost two tables cannot be mapjoin at the same time
    • Different association modes (Left/right/inner), which are required for the order of the tables
      • Left OUTER join table must be a large table
      • Right outer join table must be a large table
      • Inner join around table can be a large table
    • Full outer join cannot be used directly with Mapjoin, modified to Mapjoin + UNION ALL
    • Automapjoin is already supported on the latest release version, which automatically optimizes the small table to mapjoin based on the size of the join table, and the SQL that does not display an increase in hint and the intermediate results for small tables.

2. Data skew
Data Skew performance:
Task progress is maintained at 99% for a long time, viewing monitoring, only a small number of reduce subtasks are not completed
The difference between the number of single reduce records and the average number of records, the longest time is longer than the average length
Optimization Methods and Precautions:

    • Aggregation skew, setting parameters: Set Odps.sql.groupby.skewindata=true
    • Associative skew, setting parameters: Set ODPS.SQL.SKEWINFO=TAB1: (Col1,col2) [(V1,V2), (v3,v4),... | (V1,V2), (V3,V4),...]; Set Odps.sql.optimize.skewjoin=true
    • Associative tilt, simultaneous size table, consider modifying to Mapjoin
    • Specific problem specific analysis: Because the aggregation key value NULL caused by the data skew, you can modify null to random values, scatter data distribution to different instance execution

3. Too many small files-dynamic partition generation, reduce the generation of small files
Methods and Precautions:

    • Setting Parameters: Set Odps.sql.reshuffle.dynamicpt=true
    • Static partitioning is recommended when you can use static partitioning
    • Opening this parameter option increases the first level of reduce and consumes additional resources
    • Be aware of the data skew that may occur during the use of parameters
    • It is recommended to close this parameter when the number of dynamic partitions is low (False)

4. Too many small files to merge
Methods and Precautions:

    • To merge Small file methods:
      • Alter MERGE mode: ALTER TABLE ... [Partition] Merge smallfiles;
      • SQL Merge mode: odps.task.merge.enable=true
    • Adjust parameters to control the effect of merging
      • Allow cross-path merge (false): Odps.merge.cross.paths
      • Small file Threshold (32M): Odps.merge.smallfile.filesize.threshold
      • Merge output file size (500M): Odps.merge.maxmerged.filesize.threshold
      • Single instance allowable number of merged files: odps.merge.max.filenumber.per.instance
      • Combined maximum number of small files (50000): Odps.merge.max.filenumber.per.instance
      • Time-out for small file merges (600S): Odps.task.merge.wait.for.fuxi.timeout
      • The maximum number of partitions that can be merged in a single commit sql: Odps.merge.max.partition.count
      • If both Sessiion and project have the above parameters set, the session setting will prevail

5. Map End Read-write
Scenarios and Optimizations:
Read the same physical table multiple times, perform different operations, write multiple tables, consider the connection and difference between Multiinsert, and whether it is appropriate to make changes
Creating temporal tables for parallelization of temporal tables

    • Advantages: Significant savings in cluster compute resources and disk I/O resources
    • Cons: Write more times, may lead to performance problems, affecting the overall execution efficiency of the task

6. Partition cropping
Scenario: Fact table many partitions, large amount of data
Optimization: Avoid the whole table scan, reduce the waste of resources, pay attention to the partition cropping has no effect, see the following considerations: from the table design, use attention, try to let the partition cut into effect

    • Partition clipping takes effect when there is UDF on the partition column in the filter condition
    • When a table is associated, the associated condition contains a partition column:
      • Occurs in the on condition, the partition clipping takes effect
      • Occurs in the Where condition, the primary table partition is clipped and the remaining may fail

7. SQL Merge
1. Read the same data multiple times with large amount of source data data, poor performance, high cost
2. Unify the business process before and after SQL or statistics multiple metrics, filter different data SQL
Optimization methods and considerations: by modifying SQL, combined into 1 SQL execution, minimizing the number of reads of the same data source, to achieve a scan to calculate multiple basic statistics, one scan, processing multiple filter conditions; The following adjustments are enumerated:

    • Case ... When ... : Merging the associations of different subqueries of the same data source
    • Dynamic partitioning & Multi-Channel insertion: Insert member statistics that meet different criteria into a different table or table partition
    • Pre-and post-process SQL merge for 1 SQL executions

8. Using window functions to optimize SQL
Window functions:
1. Flexible analysis and processing work is possible
2. Use partition by to open windows, order by sort
3. The window range can be specified with rows
4. Rich open Window function

Optimization and Precautions: reasonable use of window functions, you can reduce the number of joins, improve performance; No window functions are needed to write complex SQL functions, and the open window function can efficiently execute the desired results.

Three. Relevant parameter meaning and default value

Welcome to JoinMaxComputeNail Group Discussion

To read the original, click

Maxcomputesql Performance Tuning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.