The attempt of statistic histogram in MySQL 8.0

Source: Internet
Author: User
Tags mssql cpu usage mysql index

A histogram is a description of the data distribution of a field on a table after a certain percentage and regularity, and one of the most important functions is to estimate the amount of data that meets the criteria based on the query criteria and provide an important basis for the generation of the SQL execution plan.

In versions prior to MySQL 8.0, MySQL had only one simple statistic but no histogram, and no histogram statistics could be said to be meaningless.
One of the new features of MySQL 8.0 is to start supporting the histogram of statistics, a concept that came up early, and took the time to make a concrete attempt at how to use it.

One thing that has previously written about MSSQL's statistics is consistent in principle,

As usual, go directly to the example, build the data, create a test environment

create table test
    id int auto_increment primary key,
    name varchar(100),
    create_date datetime ,
    index (create_date desc)

USE `db01`$$

DROP PROCEDURE IF EXISTS `insert_test_data`$$

CREATE DEFINER=`root`@`%` PROCEDURE `insert_test_data`()
    DECLARE v_loop INT;
    SET v_loop = 100000;
    WHILE v_loop>0 DO
        INSERT INTO test(NAME,create_date)VALUES (UUID(),DATE_ADD(NOW(),INTERVAL -RAND()*100000 MINUTE) );
        SET v_loop = v_loop - 1;


The creation of statistics in MySQL, unlike the mssql,mysql statistics, is not dependent on the index and needs to be created separately, with the following syntax

--Create statistical histogram information on a field
ANALYZE TABLE test UPDATE histogram on create_date,name with BUCKETS;
--Delete statistical histogram information on a field
ANALYZE TABLE test DROP histogram on create_date

1, you can create statistics of more than one field at a time, the system will create statistics on the fields listed, the statistics are not dependent on the index, which is different from MSSQL (of course, MSSQL can also be left independent index to create statistics)
The 2,buckets value is a must-have parameter, the default value is 1000, the range is 1-1024, this is different from the MSSQL is not the same, MSSQL is a similar to the maximum value of 200 step (step) field
3, in general, a large amount of data, for non-repetitive or repetitive data, the larger the buckets value, the more detailed the statistical information described
4, the specific content of statistical information in the Information_schema.column_statistics, but the readability is not good, can be self-analysis according to the needs (out of a favorite format)

As with statistics in SQL Server, theoretically, the accuracy is proportional to the percentage of sampling (BUCKETS), and of course the cost of generating statistics is greater,

As for the percentage of buckets and statistical information, as well as the comprehensive cost, I have not found the relevant information for the time being.

The following is a histogram of statistics created by analyze TABLE test UPDATE histogram on create_date with 4 BUCKETS;

You can see that the histogram field of the histogram is a JSON-formatted string that is not readable.

Think of the histogram information of DBCC SHOW_STATISTICS in SQL Server, such as the following format, the data distribution in the histogram looks very clear and intuitive

So I did a MySQL histogram format conversion, which is plainly parsing the JSON content in the Histogram field in the Information_schema.column_statistics table

As follows, a simple stored procedure for parsing the histogram statistics JSON data


USE `db01`$$

DROP PROCEDURE IF EXISTS `parse_column_statistics`$$

CREATE DEFINER=`root`@`%` PROCEDURE `parse_column_statistics`(
    IN `p_schema_name` VARCHAR(200),
    IN `p_table_name` VARCHAR(200),
    IN `p_column_name` VARCHAR(200)
    DECLARE v_histogram TEXT;
    -- get the special HISTOGRAM
    FROM   information_schema.column_statistics
    WHERE schema_name =  p_schema_name 
    AND table_name = p_table_name 
    AND column_name = p_column_name; 
    -- remove the first and last [ and ] char

    DROP TABLE IF EXISTS t_buckets ;
        buckets_content VARCHAR(500)
    -- split by "]," and get single bucket content    
    WHILE (INSTR(v_histogram,‘],‘)>0) DO
        INSERT INTO t_buckets(buckets_content)
        SELECT SUBSTRING(v_histogram,1,INSTR(v_histogram,‘],‘));
        SET v_HISTOGRAM = SUBSTRING(v_histogram,INSTR(v_histogram,‘],‘)+2,LENGTH(v_histogram));    
    INSERT INTO t_buckets(buckets_content) 
    SELECT v_histogram;
    -- get the basic statistics data
    WITH cte AS
        HISTOGRAM->>‘$."last-updated"‘ AS last_updated,
        HISTOGRAM->>‘$."number-of-buckets-specified"‘ AS number_of_buckets_specified
        WHERE schema_name =  p_schema_name 
        AND table_name = p_table_name 
        AND column_name = p_column_name
        CASE WHEN id = 1 THEN p_schema_name ELSE ‘‘ END AS schema_name,
        CASE WHEN id = 1 THEN p_table_name ELSE ‘‘ END AS table_name,
        CASE WHEN id = 1 THEN p_column_name ELSE ‘‘ END AS column_name, 
        CASE WHEN id = 1 THEN last_updated ELSE ‘‘ END AS last_updated,
        CASE WHEN id = 1 THEN number_of_buckets_specified ELSE ‘‘ END AS ‘number_of_buckets_specified‘ ,
        id AS buckets_specified_index,
        SELECT * FROM cte,t_buckets


So, the result in the first one is converted to the following format
Here deliberately according to the 4 buckets generated histogram, should be simple enough, familiar with the MSSQL histogram classmate, should be able to see the meaning of the histogram at a glance (test data volume is 400,000)
Take the first bucket for example, ["2018-06-15 04:57:48.000000", "2018-07-02 15:13:04.000000", 0.25, 95311]
Obviously, "2018-06-15 04:57:48.000000" and "2018-07-02 15:13:04.000000" are similar to the lower and upper values in the histogram of SQL Server, 0.25 is the sample rate of 25%, 95311 the number of lines in this interval should be the correct thing to guess.
To the last bucket, the sampling rate must be 1, or 100%.

It is important to note that the histogram update time is standard time, not the current time of the server.

It is important to note that the histogram in MySQL 8.0 is basically consistent with the histogram of SQL Server, based on a single-column sampling estimate, but there is no field selectivity similar to that in SQL Server in MySQL histograms.
However, this field selectivity itself is not significant, SQL Server in the composite index, two field totals in a piece of statistics, unless the two fields are evenly distributed at the same time, otherwise, multi-field index field selectivity reference is not very significant.
This is why the index cannot be more accurately estimated.

The question of existence?

Before writing a little bit of MySQL statistics, but under the MySQL5.7, there is no concept of the histogram

The variable that triggered the update of the statistics is also set global Innodb_stats_on_metadata = 1, but the histogram of the statistics has not been updated as a result of the test.
Innodb_stats_on_metadata in MySQL5.7 is the statistical information on the MySQL index, which is purely a histogram of statistics (there is no definite relationship between histogram and index in MySQL 8.0).
Moreover, after repeated tests, it is found that the amount of buckets data is not significantly related to the efficiency of generating histograms, as is unclear, and how the number of buckets is related to the percentage of sampling.

About the consumption of resources when generating histograms

The histogram generation is a resource-intensive process, as shown in the process of repeatedly testing the creation of histograms, Zabbix monitors the CPU usage of the server, and, of course, only observes the CPU usage problem.

Therefore, the histogram is good, really want to use large-scale applications, or to consider comprehensively, when to perform the update, and how to trigger its update.

Here is just a superficial attempt, there is inevitably a lot of lack of understanding of the place.


The attempt of statistic histogram in MySQL 8.0

Related Article

Cloud Intelligence Leading the Digital Future

Alibaba Cloud ACtivate Online Conference, Nov. 20th & 21st, 2019 (UTC+08)

Register Now >

Starter Package

SSD Cloud server and data transfer for only $2.50 a month

Get Started >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.