Abstract: Reproduced from Xiaorui some users (especially the external output) when using Maxcompute (original ODPs), due to the use of the product level and the level of understanding of the implementation of different degrees, resulting in the implementation of the task is too long, the occupation of more cluster resources; serious causes failure, Not only need to input support students to help solve the problem, but also affect the user's normal business. Consolidation part performance improvement method side
Reprinted from Xiaorui
Some users (especially the external output) when using Maxcompute (original ODPs), due to the use of the product level and level of understanding of the level of implementation, resulting in the implementation of the task is too long, the occupation of more cluster resources, serious will lead to failure, not only need to invest support students to help solve, Also affect the user's normal business. The method of improving the performance of merged parts is convenient to support users to query and optimize SQL and improve the efficiency. Some of the original tuning settings, such as Mapjoin, PPD predicate push-down, and so on, have realized the automatic tuning in the continuously evolving products. Different stages of product tuning parameters and details will be inconsistent, but familiar with the tuning ideas and methods can be done extrapolate, and gradually in-depth.
I. Holistic approach
Overall, tuning from the bottom to the SQL can have multiple levels of tuning, with the continuous improvement of products, some tuning has been achieved automation. If you are familiar with the common tuning mechanisms and execution principles, you can improve performance by developing execution SQL, business logic, and related parameter setting adjustments to achieve less.
- Hardware and operating system level tuning: including disk I/O tuning (multiplexing, etc.), network tuning (buffer size, number of connections amplification, etc.), memory tuning (virtual memory settings, memory control, etc.);
- Distributed computing platform and storage plane tuning: Storage format settings, compression format settings, RPC call settings, connection number control settings, scheduling mechanism settings, block and shard settings, execution resource settings, etc.
- Business logic level and parameter adjustment, in addition to the overall implementation of the tuning, the different types of operation of the parameter-level tuning, for aggregation, connection, read-write and other changes to different SQL or set different parameters can greatly improve performance;
- SQL level and application level tuning, refactoring SQL writing, merging SQL, size table connection modified to Mapjoin, etc., in the odps2.0 has been processed in the automatic mapjoin, and so on, the user is not upgraded and external output users still need to modify SQL to support;
Different levels of tuning and advantages and disadvantages see below:
Two. Scenario and Tuning Enumeration
1. Size Table Association modified to Mapjoin, add mapjoin hint
Methods and Precautions:
Total memory consumed by all specified small tables does not exceed 512M
When multiple table joins, the leftmost two tables cannot be mapjoin at the same time
Different association modes (Left/right/inner), which are required for the order of the tables
Left OUTER join table must be a large table
Right outer join table must be a large table
Inner join around table can be a large table
Full outer join cannot be used directly with Mapjoin, modified to Mapjoin + UNION ALL
Automapjoin is already supported on the latest release version, which automatically optimizes the small table to mapjoin based on the size of the join table, and the SQL that does not display an increase in hint and the intermediate results for small tables.
2 Data skew
Data Skew performance:
Task progress is maintained at 99% for a long time, viewing monitoring, only a small number of reduce subtasks are not completed
The difference between the number of single reduce records and the average number of records, the longest time is longer than the average length
Optimization Methods and Precautions:
Aggregation skew, setting parameters: Set Odps.sql.groupby.skewindata=true
Associative skew, setting parameters: Set ODPS.SQL.SKEWINFO=TAB1: (Col1,col2) [(V1,V2), (v3,v4),... | (V1,V2), (V3,V4),...]; Set Odps.sql.optimize.skewjoin=true
Associative tilt, simultaneous size table, consider modifying to Mapjoin
Specific problem specific analysis: Because the aggregation key value NULL caused by the data skew, you can modify null to random values, scatter data distribution to different instance execution
3. Map End Read-write
Scenarios and Optimizations:
Read the same physical table multiple times, perform different operations, write multiple tables, consider the connection and difference between Multiinsert, and whether it is appropriate to make changes
Creating temporal tables for parallelization of temporal tables
Precautions:
Advantages: Significant savings in cluster compute resources and disk I/O resources
Cons: Write more times, may lead to performance problems, affecting the overall execution efficiency of the task
4 Partition Clipping
Scenario: Fact table many partitions, large amount of data
Optimization: Avoid the whole table scan, reduce the waste of resources, pay attention to the partition cropping has no effect, see the following considerations: from the table design, use attention, try to let the partition cut into effect
Precautions:
Partition clipping takes effect when there is UDF on the partition column in the filter condition
When a table is associated, the associated condition contains a partition column:
Occurs in the on condition, the partition clipping takes effect
Occurs in the Where condition, the primary table partition is clipped and the remaining may fail
5. SQL Merge
Scene:
- Reads the same data multiple times and has a large amount of source data data, poor performance, and high cost
- Unify the business process before and after SQL or statistics multiple metrics, filter different data SQL
Optimization methods and considerations: by modifying SQL, combined into 1 SQL execution, minimizing the number of reads of the same data source, to achieve a scan to calculate multiple basic statistics, one scan, processing multiple filter conditions; The following adjustments are enumerated:
Case ... When ... : Merging the associations of different subqueries of the same data source
Dynamic partitioning & Multi-Channel insertion: Insert member statistics that meet different criteria into a different table or table partition
Pre-and post-process SQL merge for 1 SQL executions
6 Optimizing SQL Using Window functions
Window functions:
- can perform flexible analysis and processing work
- Use partition by to open windows, order by sort
- You can specify the window range with rows
- Rich window-opening functions
Optimization and Precautions: reasonable use of window functions, you can reduce the number of joins, improve performance; No window functions are needed to write complex SQL functions, and the open window function can efficiently execute the desired results.
Original link
Read more about dry goods, please scan the following two-dimensional code:
Maxcomputesql Performance Tuning