Some sharing of ETL tuning

Source: Internet
Author: User
Tags file size

Original link Address: http://www.transwarp.cn/news/detail?id=173

ETL is an important link in building data Warehouse. Through this process the user extracts the required data and imports the data warehouse according to the defined model. Because ETL is the necessary process of building data Warehouse, its efficiency will affect the construction of the whole data warehouse, so its effective tuning is of high importance. In practical applications, we usually suggest that ETL business tuning into a number of ideas, so as to ensure full and orderly tuning, to avoid omissions, to maximize the performance of ETL implementation efficiency.

We will be divided into two articles on the ETL business of tuning means. This article will first introduce the following three: check whether the resource is configured efficiently, collect data characteristics, identify the split bucket, and task run collection and monitoring. The principle of tuning and the points for attention in each step are explained, and the corresponding solutions and ideas are put forward. It then helps readers deepen their understanding through case studies. 1. Check that the resource is configured effectively

As for resources, the general principle is to maximize the effective allocation of resources according to the requirements scenario. The POC can be considered for maximum configuration. The standards in the implementation project can be properly conserved to ensure the stability of the system.

If this cluster is primarily used for batch processing, it is generally recommended that inceptor use about 50% of the CPU resources, and the other individual executor not exceed the core. If the hardware configuration is good, such as the server has more than 40 CPU core, it is recommended to use a multi-executor way to deploy, such as 2 executor each of the ten core way, will be better than 1 executor each core.

If the cluster deploys other services at the same time, make sure that you deploy as many compute resources to inceptor as possible with the resources of the other services. The increase in computational resources typically results in near-linear analysis performance gains. 2. Collection of data characteristics, determination of the separation of barrels

Note that a reasonable DDL design is a very critical process for batch projects and needs to be identified by a comprehensive analysis of business and data characteristics. We recommend that developers devote enough time and experience to design the DDL and fully demonstrate the effectiveness of the design. 1. Partitioning

Partition field Selection

The general principle is to divide the partitioning field according to the business type of the system. Generally speaking, the fact table is that the data all contain the time attribute, and the report business also does the statistic analysis in a certain time range, then partitioning according to the time field is a common choice. And if the business more in accordance with the Department of statistical analysis, it is more appropriate to follow the department code, Geographical Code zoning. So close to the business features select partition is the first element.

In addition, the distribution characteristics of the data are important factors to be considered in the partitioning field selection. If the user expects to partition according to a field a, but the distribution of this field A is absolutely skewed (such as field a a total of 1000 different values, but the value of 50% is 0, if the partition according to this field, then the corresponding partition will occupy the table 50% of the data, which will lead to inefficient SQL business), Selecting field A at this time is not a reasonable choice.

However, as long as the data distribution is not absolute tilt, we can use the range partition to specify a different size way to effectively circumvent the tilt problem, so you can still use this field for data partitioning. For example, enterprise user data, if the number of users in 2016 is 3 times times 2015, we select the partition in 2016 according to 1 months as a partition, and 2015 every three months for a partition, rather than a simple numeric equivalent of the segmentation range. Another common situation is to partition according to geography, we propose to place a large number of areas of business alone in one area, the rest of the region to consolidate the small amount of traffic, so as to ensure that the amount of data between the regions roughly balanced, to avoid the majority of areas without data or very little data. If you find that some area data is particularly skewed, you need to consider further slicing or changing the partition field.

Number of partitions

The selection of the number of partitions requires a comprehensive consideration of the characteristics of the data, after selecting a good partition field, we need to determine the number of partitions according to the characteristics of the data range.

The number of partitions should not be too small, need to be determined according to business characteristics, and ensure that the partition does not have too much hot and cold data mixing. For example, most of the SQL business is operating 2-3 months of data, then we try to do a partition according to 3 months, if we choose 1 years to make a partition, each SQL business actually executes the time will read the previous 9 months of data, this part of the resources and IO overhead is not necessary.

The fact that we see more DDL designs is that there are too many partitions, such as the amount of data in a single partition not exceeding 1GB, or partitioning by day, which are inappropriate designs. The disadvantage of too many partitions is that it can lead to excessive system resource usage (for example, full table sweep table business, Inceptor will start the "number of partitions x buckets" task to complete the SQL task, the number of partitions or a large number of buckets will cause the cluster resources in a relatively long period of time occupied by this task). Generally we recommend that the number of partitions is less than dozens of.

Reiterate: Do not partition by day. 2 cent barrels

Selection of bucket fields

Generally follow the principle of selecting a field with high degree of dispersion to be divided into buckets. A reference can be made by collecting the data characteristics, such as distinct value, and the higher the value, the more the object can be considered as a priority. When selecting a bucket field, be careful to keep the records evenly distributed and avoid data skew.

Number of buckets

For different storage types, the standard for the number of buckets varies slightly. For the Orc table, for the common Orc table, the single-barrel size can be within 200M, and for ORC transaction tables, the standard is properly reduced, the file size is limited to 100M, and the number of records is limited to about millions of bars.

Also, when considering the number of buckets, consider whether the area has been divided. For a table that has been sub-divided, the number of buckets is estimated according to the size and number of bars, rather than the original table. For example, in a runner project, we find that some tables are divided into buckets, the number of buckets is too large, resulting in a file size of only dozens of k per bucket, so that the execution of a single task is very inefficient, while the overall task is too large also to the system resources caused a great waste, the concurrency does not go. Adjustment scheme is based on the actual situation, the number of barrels to reduce the order of magnitude.

When separating buckets, you should also be aware of the following factors:

Because most of the transactional business is built with a range-separated bucket table. In this case, the number of map tasks = number of partitions × the number of buckets, so when considering the number of partitions and the number of buckets to consider comprehensively, MAP task can not be too large, such as thousands of tens of thousands, which may make the execution of a single task is too inefficient, and consumes system resources, reducing system concurrency. 3.Task Operating condition collection and monitoring

It is mainly monitored from the following three aspects on the 4040 interface:
1) Number of tasks

There are too many and too few tasks, and you can check the number of tasks by sorting them on the 4040 interface.

For too many situations, such as finding a stage with more than thousands of tasks, you should see if it is reasonable. The large number of tasks caused by the partitioned buckets as described above is an example. Also, if there is a filter for the partition field in SQL, but the number of tasks observed on 4040 is still many, you need to check if the partition pruning is successful.

However, too few tasks can cause a single task to be too large, take too long, and need to be adjusted.

2) Task skew caused by trailing.

The main cause of the general situation is the skew caused by certain keys, such as a null value or a hotspot region causing a tilt, and other individual cases such as a window function that causes a single task task to be too large to cause a trailing. For example, we encountered in a customer business, because the Row_number window function does not add partition by, resulting in a single task full ordering situation. After locating the problem, the corresponding measures can be taken to solve it.

Case Study:

As observed by the 4040 interface, task 0 has a single task execution time of more than 10mins. Excludes the classic task 0 tilt problem caused by a null value, and the final positioning is caused by the global ordering of the window function row_number without partition by.

3) The number of stages concurrently executed simultaneously.

In general, given the CPU resources and the DDL design of the table, we can probably determine that the number of concurrent tasks for 4040 pages is always within a reasonable range. For example, the current inceptor configuration of the core, and the general SQL operation of a partition of the data, if the partition has 20 buckets, then overall we should be able to see the concurrent stage number should be "Total resources% SQL task average number of tasks", In this case, 400%20=20 is the one. If we observe that the number of concurrent tasks is lower than expected, we need to do some checking at this point. On the 4040 interface, you can observe the point at which each stage is submitted to the end of execution, and you can monitor this information to ensure that the execution is justified.

Case Study:

The following illustration shows a test scenario where the interval between the last stage and the next stage submission exceeds 3mins. Analysis Hive-server2.log found that abnormal lock time-out phenomenon, through the effective avoidance of lock competition and other problems to avoid this problem, but also found that the system performance and concurrency has been greatly improved.

4. Summary

This paper introduces three ideas about ETL business tuning, and expounds and analyzes the specific considerations and tuning principles in each tuning step.

In the ETL business tuning, we should be careful to grasp the key points in the tuning step, master each step of the tuning principle. On the basis of the general principles, the specific problems of specific analysis, understanding and grasp each step may appear problems and causes, timely troubleshooting, the ETL tuning as far as possible to apply in the key position.

Next article we will pass several cases, continue to introduce the other several ideas, hope can help the reader to fully master the specific steps of ETL business tuning and deepen understanding, know when to use what means to achieve ETL tuning.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.