Introduction of approximate algorithm in Computecolstats UDF

Source: Internet
Author: User

one, the front.

Table and column statistics have a significant impact on the results of the CBO, and it is extremely important to be able to collect statistical information efficiently and accurately. But efficiency and accuracy are contradictory, more accurate statistics often require more computation, and what we can do is to find a better balance between efficiency and accuracy. The next section is about some of the approximate algorithms currently used in Computecolstats.

Second, the content of the collection

Currently, the following statistics are primarily collected for columns:

Cntrows: The total number of data in the column, including the NULLL value

Avgcollen: Average length of column

Maxcollen: Maximum length of column

MinValue: Minimum value of column

MaxValue: The maximum value of the column

Numnulls: Number of NULL values in column

Numfalses: If Boolean, the number of false values

Numtrues: If Boolean, the number of true values

COUNTDISTINCT: Number of different values

Number of TOPK:TOPK values, flag of data skew

In general, except for CountDistinct and TOPK, the statistical information is largely resource-intensive (MinValue and maxvalue a large number of comparisons, also consumes a lot of resources), the problem is mainly focused on countdistinct and TOPK. The approximate algorithm to be described below is also mainly for these two points.

Thirdly, CountDistinct realizes

Algorithm: Flajolet-martin

See the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.3869&rep=rep1&type=pdf

Brief introduction

For n objects, if the maximum value for the length of the end (or beginning) of the 0 consecutive lengths is M, then the data amount of the unique object can be estimated to be 2^m.

Suppose there is a very good hash function that can hash an object into a binary number 0101 ... and scatter it evenly into the binary space. If there are 8 unique objects, after they all hash, the result according to probability should have 4 object hash value to end with 0, the 4 hash value should have 2 end is 00, 2 of which 1 end is 000.

Using multiple independent hash functions, each hash function calculates the maximum 0-bit sequence, and then averages it to reduce the error.

The number of hash functions basically determines the efficiency and accuracy of the flajolet-martin algorithm, followed by the test results for the number of different hash functions.

Four, TOPK realization

Algorithm: space-saving

Pseudo code:

Five, basic performance test

Conclusion:

1,base Stats also has an impact on performance, mainly in the calculation of MinValue and MaxValue, especially in the case of Collen longer

2, generally distinct relative topk will be slower, unless in Collen longer time, TOPK is based on the comparison

3, as the number of columns increases, the time to collect stats consumption increases linearly

The calculation of 4,DISTINCT is based on the hash, and the TOPK calculation is based on the comparison, so the former is not sensitive to Collen

Six, the test of execution efficiency of the number of different hash functions

Conclusion:

Basically with the increase of the number of hash function linear growth

Seven, test of the accuracy of the number of different hash functions

Conclusion:

The number of hash functions increased to 32, the accuracy of the basic can meet the demand

Eight, the test summary of the number of different hash functions

Conclusion: Select 32 hash functions to calculate distinct, balance execution efficiency and accuracy

Nine, selection of the sample algorithm

1, Necessity:

Based on the previous test of execution efficiency, in order to avoid too much impact on the task, sample must be done

Requirements for the 2,sample algorithm:

Efficiency, random

3,sample's Choice:

Implementation of the sample function using Buildin

Assuming that the data distribution is random.

Effects of 4,sample:

Has no effect on some stats, such as Avgcollen,maxcollen,minvalue,maxvalue.

Some effects on some stats, such as Cntrows, NUMNULLS,NUMFALSES,NUMTRUES,TOPK

The impact on COUNTDISTINCT is relatively large, and countdistinct is also more important, requiring special attention

5,sample Post-countdistinct treatment:

Predict complete data countdistinct, sampling, fitting, according to the countdistinct of sample

Basic ideas such as:

It is hoped that by sampling the data in sample, we use these sampling points to depict the shape of all the data, so as to achieve the basic accurate prediction of all the data distinct results. This is a good wish, when the data of sample is relatively small, there are some cases under the pattern of the sample with the complete data form there is a large difference, at this time the error will be relatively large.

Ten, the test of the efficiency of different sample proportion execution

The sampling scale is less than the time difference after 1/100, when the maximum consumption is on the data read, not on the distinct calculation.

11. Test for accuracy of different sample ratios

Some tests were done for the column project_name,odps_inst_id in the table Meta.m_fuxi_instance table, as shown above. It seems that 1/50 of the results are acceptable.

To say more, for the distinct, do not need to be completely correct, 10 times times the gap within the current is acceptable, which is we can be sampled to improve the efficiency of the premise.

12, according to sample ratio of 1/25 as an example of the calculation results

The execution time and the accuracy rate basically can meet the present demand

13, follow-up work

For the accuracy of the promotion is one of the next things to do, the key is how to find a more representative in the sample of the point to predict all the shape of the data. However, in order to prepare psychologically, for some scenarios, it may not be possible to find such a method, the need to accept a certain range of errors.

Original link

Read more about dry goods, please scan the following two-dimensional code:


Introduction of approximate algorithm in Computecolstats UDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.