one, the front.
Table and column statistics have a significant impact on the results of the CBO, and it is extremely important to be able to collect statistical information efficiently and accurately. But efficiency and accuracy are contradictory, more accurate statistics often require more computation, and what we can do is to find a better balance between efficiency and accuracy. The next section is about some of the approximate algorithms currently used in Computecolstats.
Second, the content of the collection
Currently, the following statistics are primarily collected for columns:
Cntrows: The total number of data in the column, including the NULLL value
Avgcollen: Average length of column
Maxcollen: Maximum length of column
MinValue: Minimum value of column
MaxValue: The maximum value of the column
Numnulls: Number of NULL values in column
Numfalses: If Boolean, the number of false values
Numtrues: If Boolean, the number of true values
COUNTDISTINCT: Number of different values
Number of TOPK:TOPK values, flag of data skew
In general, except for CountDistinct and TOPK, the statistical information is largely resource-intensive (MinValue and maxvalue a large number of comparisons, also consumes a lot of resources), the problem is mainly focused on countdistinct and TOPK. The approximate algorithm to be described below is also mainly for these two points.
Thirdly, CountDistinct realizes
Algorithm: Flajolet-martin
See the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.3869&rep=rep1&type=pdf
Brief introduction
For n objects, if the maximum value for the length of the end (or beginning) of the 0 consecutive lengths is M, then the data amount of the unique object can be estimated to be 2^m.
Suppose there is a very good hash function that can hash an object into a binary number 0101 ... and scatter it evenly into the binary space. If there are 8 unique objects, after they all hash, the result according to probability should have 4 object hash value to end with 0, the 4 hash value should have 2 end is 00, 2 of which 1 end is 000.
Using multiple independent hash functions, each hash function calculates the maximum 0-bit sequence, and then averages it to reduce the error.
The number of hash functions basically determines the efficiency and accuracy of the flajolet-martin algorithm, followed by the test results for the number of different hash functions.
Four, TOPK realization
Algorithm: space-saving
Pseudo code:
Five, basic performance test
Conclusion:
1,base Stats also has an impact on performance, mainly in the calculation of MinValue and MaxValue, especially in the case of Collen longer
2, generally distinct relative topk will be slower, unless in Collen longer time, TOPK is based on the comparison
3, as the number of columns increases, the time to collect stats consumption increases linearly
The calculation of 4,DISTINCT is based on the hash, and the TOPK calculation is based on the comparison, so the former is not sensitive to Collen
Six, the test of execution efficiency of the number of different hash functions
Conclusion:
Basically with the increase of the number of hash function linear growth
Seven, test of the accuracy of the number of different hash functions
Conclusion:
The number of hash functions increased to 32, the accuracy of the basic can meet the demand
Eight, the test summary of the number of different hash functions
Conclusion: Select 32 hash functions to calculate distinct, balance execution efficiency and accuracy
Nine, selection of the sample algorithm
1, Necessity:
Based on the previous test of execution efficiency, in order to avoid too much impact on the task, sample must be done
Requirements for the 2,sample algorithm:
Efficiency, random
3,sample's Choice:
Implementation of the sample function using Buildin
Assuming that the data distribution is random.
Effects of 4,sample:
Has no effect on some stats, such as Avgcollen,maxcollen,minvalue,maxvalue.
Some effects on some stats, such as Cntrows, NUMNULLS,NUMFALSES,NUMTRUES,TOPK
The impact on COUNTDISTINCT is relatively large, and countdistinct is also more important, requiring special attention
5,sample Post-countdistinct treatment:
Predict complete data countdistinct, sampling, fitting, according to the countdistinct of sample
Basic ideas such as:
It is hoped that by sampling the data in sample, we use these sampling points to depict the shape of all the data, so as to achieve the basic accurate prediction of all the data distinct results. This is a good wish, when the data of sample is relatively small, there are some cases under the pattern of the sample with the complete data form there is a large difference, at this time the error will be relatively large.
Ten, the test of the efficiency of different sample proportion execution
The sampling scale is less than the time difference after 1/100, when the maximum consumption is on the data read, not on the distinct calculation.
11. Test for accuracy of different sample ratios
Some tests were done for the column project_name,odps_inst_id in the table Meta.m_fuxi_instance table, as shown above. It seems that 1/50 of the results are acceptable.
To say more, for the distinct, do not need to be completely correct, 10 times times the gap within the current is acceptable, which is we can be sampled to improve the efficiency of the premise.
12, according to sample ratio of 1/25 as an example of the calculation results
The execution time and the accuracy rate basically can meet the present demand
13, follow-up work
For the accuracy of the promotion is one of the next things to do, the key is how to find a more representative in the sample of the point to predict all the shape of the data. However, in order to prepare psychologically, for some scenarios, it may not be possible to find such a method, the need to accept a certain range of errors.
Original link
Read more about dry goods, please scan the following two-dimensional code:
Introduction of approximate algorithm in Computecolstats UDF