Deep understanding of Oracle histogram ____oracle

Source: Internet
Author: User
Tags create index

A histogram is a method of classifying storage by the frequency at which data appears. In Oracle, histograms are used to describe the distribution of column data in a table. Each SQL is optimized before it is executed then the optimizer of an optimal execution plan in the optimizer should know the details of the underlying object referenced in the SQL statement .

The object that the histogram describes includes the number of different values in the column and the frequency with which they occur. It is not feasible to store each of the different values and how often it appears, especially for large tables with tens of thousands of different values. Oracle uses histograms to store useful information about the distribution of data in a column and the Oracle CBO uses histogram information to compute an optimal execution plan.

CBO and histogram histograms
Estimate the percentage of rows returned from a row source this is the selection rate, the selection rate plays an important role in CBO's query optimization. The selection rate ranges from 0 to 1. Roughly speaking, if only a small number of row records are satisfied with predicate conditions, then the CBO prefers to use index scans, If the predicate condition is to fetch a large amount of data from the table, then the CBO prefers to use full table scans. For example, the following query gets all employee information Deptno equals 10 If you return a small number of record queries, you will prefer to use an index scan:
SELECT * from EMP where deptno=10;

To evaluate the selection rate (or, in other words, to calculate the optimal execution plan), the CBO uses various forms of statistical information, configuration parameters, and so on. The CBO collects the following statistics in terms of the columns in the table:
The number of different values in the column is NDV
Minimum/maximum value in a column
Number of NULL values in the column
Data distribution or histogram information

When there is no histogram, the optimizer uses the number of different values in the columns that are recorded in the base table. Calculates statistics for the minimum/maximum value in the column and the number of NULL values in the column. Use these information optimizer to assume that the data is evenly distributed between the minimum and maximum values in the column, or that each of the different values in the column is the same number of occurrences.
Here is a list of the following. Create a test table T1 it has 10000 rows of records, two columns, and column all_distinct contains different values ranging from 1 to 10000. Column skew the values for the first 10 rows are 1 to 10, and the remaining 9990 rows are 10000.

 [Oracle@jingyong ~]$ sqlplus/as sysdba sql*plus:release 11.2.0.1.0 on Sat 4 Production 06:05:14 2014  HT (c) 1982, 2009, Oracle.


All rights reserved. Connected to:oracle Database 11g Enterprise Edition release 11.2.0.1.0-production with the partitioning, OLAP, Data Min ing and real application testing options sql> CREATE table t1 as select RowNum all_distinct,10000 skew from dual Conne CT by Level < = 10000; Table created. sql> Update T1 set skew=all_distinct where rownum< = 10; Ten rows updated.

Sql> commit;

Commit complete.

      Sql> Select Skew,count (*) from the T1 group by the skew order by skew;          SKEW COUNT (*)--------------------1 1 2 1 3 1 4
        1 5 1 6 1 7 1 8 1 9 1

Ten 1 10000 9990 rows selected. 

Using Dbms_stata.gather_table_stats to collect statistics is the syntax for generating a histogram that is controlled by a parameter method_opt is composed of several parts. The first two parts are mandatory:
For all [INDEXED | HIDDEN] COLUMNS [Size_clause]
For COLUMNS [size clause] column [Size_clause] [, column ...]

The main part of the METHOD_OPT syntax controls which this column will collect the column's statistics (min,max,ndv,nulls). The default is for all columns, which collects basic column statistics on all columns in the table, including hidden columns.

For the all indexed columns collects column statistics for only those Liegen that contain indexes.

For all hidden Columns will only collect column statistics for which virtual columns. This means that the real column does not generate column statistics when statistics are collected on the table. This value cannot be used for common statistical collection. It can only be used to create a new virtual column in a table when the statistics for the base table column are collected precisely. Use it when you collect column statistics for a new virtual column .

Note If the column is not in the Statistics collection list, then only the average length of the column is collected.

Size is used to specify the number of buckets for the histogram size {integer | REPEAT | AUTO | Skewonly}
Auto: Collect histograms based on column usage information (sys.col_usage$) and whether there is data skew
Integer: artificial Specifies the number of buckets to create a histogram is 1 to 254, if size 1 means not creating a histogram
Repeat: histograms are only regenerated for columns that already have histograms. In the case of a partitioned table, repeat ensures that the histogram is regenerated for columns that have histograms at the global level. This is not recommended. The number of buckets in the current histogram will be the maximum number of buckets used to regenerate the histogram. The number of buckets in the current histogram is 5, then the maximum number of histograms generated is 5, the straightforward point is to refresh the existing histogram of the column statistics.
Skewonly: Automatically create histograms of skew columns for any data distribution

Now to collect statistics on table T1 without creating histograms

sql> exec dbms_stats.gather_table_stats (user, ' T1 ',method_opt=> ' for all columns size 1 ');

Pl/sql procedure successfully completed.

Sql> Select column_name,num_distinct,density from User_tab_col_statistics where
  2  table_name= ' T1 ';

column_name                    num_distinct    density
----------------------------------------------------
All_ DISTINCT                          10000      . 0001
SKEW                                     11.090909091

If there is no histogram, the density statistic of the column represents its selection rate, which is calculated by 1/num_distinct=1/11=0.09090901. In the case of a histogram, Density calculations depend on the type of histogram and Oracle version. The range of density values is between 0 and 1. When a query uses this column as a predicate condition, the optimizer will use the density statistics of this column to evaluate the number of rows that will be returned. So cardinality ( Cardinality) =selectivity (selection) * Number of rows (table rows)

Here's how to evaluate the cardinality of a column in a predicate condition where the data distribution is skewed without histograms:

Sql> explain plan for SELECT * from T1 where skew=1;

explained.

Sql> select * FROM table (dbms_xplan.display); Plan_table_output--------------------------------------------------------------------------------Plan Hash value : 3617692013--------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) |
Time |   --------------------------------------------------------------------------
| 0 |      SELECT STATEMENT |   |  909 |     6363 | 7 (15) |  00:00:01 | |* 1 | TABLE ACCESS full|   T1 |  909 |     6363 | 7 (15) |
00:00:01 | --------------------------------------------------------------------------predicate information (identified by Operation ID):---------------------------------------------------plan_table_output----------------------------- ---------------------------------------------------1-filter ("SKEW" =1) sql> explain plan for SELECT * from t1

where skew=10000; Explained.

Sql> select * FROM table (dbms_xplan.display); Plan_table_output--------------------------------------------------------------------------------Plan Hash value : 3617692013--------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) |
Time |   --------------------------------------------------------------------------
| 0 |      SELECT STATEMENT |   |  909 |     6363 | 7 (15) |  00:00:01 | |* 1 | TABLE ACCESS full|   T1 |  909 |     6363 | 7 (15) |
00:00:01 | --------------------------------------------------------------------------predicate information (identified by Operation ID):---------------------------------------------------plan_table_output-----------------------------
 ---------------------------------------------------1-filter ("SKEW" =10000)

Because Oracle assumes that the data in the column skew is evenly distributed so the cardinality evaluates to cardinality=density*num_rows=0.09090901*10000=909.09, rounding is 909 lines. But we know skew=. 1 records have only 1 lines and skew=10000 records have 9990 rows. This assumption inevitably leads to the wrong execution plan. For example, if we create a B-tree index on a column skew, Oracle uses an index scan on the predicate skew=10000 line and returns 909 rows of records.

Sql> CREATE index Skew_idx on T1 (skew);

Index created.

sql> exec dbms_stats.gather_index_stats (user, ' skew_idx ');

Pl/sql procedure successfully completed.

Sql> explain plan for SELECT * from T1 where skew=10000;

explained.
Sql> select * FROM table (dbms_xplan.display); Plan Hash value:3994350891-------------------------------------------------------------------------------------- --
| Id | Operation | Name | Rows | Bytes | Cost (%CPU) |
Time |   ----------------------------------------------------------------------------------------
| 0 |          SELECT STATEMENT |   |  909 |     6363 | 4 (0) |   00:00:01 | |  1 | TABLE ACCESS by INDEX rowid|   T1 |  909 |     6363 | 4 (0) |   00:00:01 | |* 2 | INDEX RANGE SCAN |   Skew_idx |       909 |     | 2 (0) |
00:00:01 | ----------------------------------------------------------------------------------------Predicate information ( identified by Operation ID):---------------------------------------------------2-access ("SKEW" =10000)
 

Because we know that there is no additional information about the data distribution, the CBO assumes that the data in the column is evenly distributed between the minimum and maximum values, so the wrong execution plan is selected.

Oracle Histogram
Once you create a histogram on a column, it tells the CBO how often the data appears. So in the above example, if you create a histogram on the column skew, it tells you that the value of the skew=1 is only once, and the skew=10000 value appears 9,990 times. So it allows the optimizer to choose the optimal execution plan.

There are two types of histograms in Oracle. The first is that Oracle chooses each different value in the storage column and how often it appears. This is called the width balance histogram or the frequency histogram. This is an efficient and possible way to have a small number of different values for a column. However, it is not possible to store each of the different values and the frequency of their occurrence when the column has a large number of different values. Of course, in the case of unlimited resources (storage space and computational power and parsing time), You can store the frequency of each different value in any case to provide the final information to the optimizer. But in the real world this is impossible. So Oracle uses a highly balanced histogram to store such data. Oracle automatically determines the type of histogram to be created based on the number of different values in the column , the information described by different types of histograms is different.

Frequency histogram (frequence histograms)
The different values in the frequency histogram column are divided into the same number of buckets. Each bucket is stored in the same value, which means that the number of buckets in the frequency histogram equals the number of different values in the column. BUCKETS=NDV

The following chart represents the data distribution of the column skew. The following information can be seen from the chart:
There are 11 barrels on the x-axis, each barrel represents a different value.
The y-axis shows how often each different value appears. Skew 1 to 10 of the frequency is 1, the value of 10000 of the frequency is 9990
By looking at such information, it's easy to say how often a particular value appears.

Next, create a frequency histogram for column skew and see how the data is stored in the data dictionary view. Now use ' for column column_name size n ' to create a histogram of the specified number of buckets for the parameter method_opt.

 sql> exec dbms_stats.gather_table_stats (user, ' T1 ',method_opt=> ' for columns size 11 ');


Pl/sql procedure successfully completed. Sql> Select Column_name,endpoint_number,endpoint_value from User_tab_histograms where 2 table_name= ' T1 ' and column

_name= ' SKEW ';              column_name endpoint_number endpoint_value-----------------------------------------SKEW 1                       1 SKEW 2 2 SKEW 3 3 SKEW                       4 4 SKEW 5 5 SKEW 6 6 SKEW                       7 7 SKEW 8 8 SKEW 9 9 SKEW Ten SKEW 10000 10000 

The first statement creates a histogram with 11 buckets for the column skew. Because we know that the column skew has 11 different values. The second statement shows the histogram data stored in the data dictionary view. The information stored in the histogram depends on the number of buckets in the histogram that are less than the number of different values in the column or the equivalence can be explained differently. This means that the information stored in the histogram depends on the type of histogram that can be interpreted differently. The following explains the information that the frequency histogram represents.

Endpoint_value Displays the true column value, Endpoint_number shows the cumulative number of rows or the cumulative frequency. To calculate the frequency of a particular column value, use the Endpoint_number value associated with it to subtract its cumulative value.
For example, for a value of Endpoint_value 5, its endpoint_number is 5, the previous endpoint_number is 4, and the skew=5 record is only 5-4 = 1 rows. Similar to Endpoint_ Value 10000 is endpoint_number to 10000 its previous endpoint_number is 10, so skew=10000 's record has 10000=10=9990 line.

Use the following SQL to explain the histogram information stored in the data dictionary:

 sql> Select Endpoint_value as Column_value, 2 Endpoint_number as cummulative_frequency, 3 endpoint_number- Lag (endpoint_number,1,0) over (order by Endpoint_number) as frequency 4 from User_tab_histograms 5 where table_name

= ' T1 ' and column_name= ' SKEW ';                     Column_value cummulative_frequency FREQUENCY-------------------------------------------1                     1 1 2 2 1 3 3 1 4          4 1 5 5 1 6 6                     1 7 7 1 8 8 1 9 9 1 1 10000 10000 9990 

Storing the total or cumulative frequency in place of a single frequency in a range scan is particularly useful for predicate cardinality such as where skew< =10 is available.
now because we've created a histogram for more skew to see what the previous query is different:

Sql> Select Column_name,num_distinct,density,histogram from user_tab_col_statistics where table_name= ' T1 '; column_name num_distinct density histogram------------------------------------------------------     -------------all_distinct 10000.0001 NONE SKEW 11

.00005 FREQUENCY sql> explain plan for SELECT * from T1 where skew=10000;

explained.

Sql> select * FROM table (dbms_xplan.display); Plan_table_output---------------------------------------------------------------------------Plan Hash Value: 3617692013--------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) |
Time |   --------------------------------------------------------------------------
| 0 |      SELECT STATEMENT |  | 9990 |     69930 | 7 (15) |  00:00:01 | |* 1 | TABLE ACCESS full|  T1 | 9990 |     69930 | 7 (15) |
00:00:01 | --------------------------------------------------------------------------predicate information (identified by



Operation ID):---------------------------------------------------1-filter ("SKEW" =10000) rows selected.

Sql> explain plan for SELECT * from T1 where skew=1;

explained.

Sql> select * FROM table (dbms_xplan.display); Plan_table_output----------------------------------------------------------------------------------------Plan Hash value:3994350891----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) |
Time |   ----------------------------------------------------------------------------------------
| 0 |          SELECT STATEMENT |     |     1 |     7 | 2 (0) |   00:00:01 | |  1 | TABLE ACCESS by INDEX rowid|     T1 |     1 |     7 | 2 (0) |   00:00:01 | |* 2 | INDEX RANGE SCAN |     Skew_idx |       1 |     | 1 (0) |
00:00:01 | ----------------------------------------------------------------------------------------Predicate information ( identified by Operation ID):---------------------------------------------------2-access ("SKEW" =1) rows Selecte

D. The optimizer now selects a full table scan for predicate skew=10000 and can accurately calculate its cardinality 9990. Note that the density of the skew column now becomes 0.00005 1/(2*num_rows) or 0.5/num_rows. A highly balanced histogram (height-balanced histograms) Oracle assigns a bucket to each of the different values in the frequency histogram, whereas the maximum number of buckets is 254, so if the columns in the table have a large number of different values (more than 254), a highly balanced histogram will be created.

. In the height-balanced histogram, because our different values exceed the number of buckets, Oracle first sorts the column data and then classifies the dataset by bucket and all other buckets contain the same number of values except that the last bucket may contain less data than the other buckets (

That's why it's called a highly egalitarian histogram. This is a separate statement used to create a highly balanced histogram. When the number of buckets requested is less than the number of different values in the column, Oracle creates a highly balanced histogram and this means that Endpoint_value and Endpoint_
Number is not the same. To explain this type of histogram information, first look at an example of a column with 23 values and 9 different values. Suppose we specify that the number of buckets for the histogram is 5, and the following chart shows how the data is stored in the histogram: based on the chart above, the following information can be obtained:
The number of buckets in the histogram is smaller than the number of different values in the column because we specify that the number of buckets for the histogram is 5, so the entire dataset is allocated to each bucket in the same size except for the last bucket (where there are only 3 values).
The endpoints in each bucket and the first in the bucket are marked because they have special meaning.
Data 3 is marked red, it is a special case its endpoint appears in multiple barrels. The chart below is another way to display the histogram: use 5 buckets with 23 values that means that every barrel except the last bucket has 3 values.There are 5 values. This is actually the way Oracle stores highly balanced histogram information in the data dictionary view. Because bucket 1 and 2 use 3 as a endpoint,oracle in order to save space will not store bucket 1. So when the bucket is merged, only a single entry is stored. Let's create a highly balanced histogram for column skew, which lets the bucket count less than the number of different values in the column 11:
Sql> Select  column_name,endpoint_number,endpoint_value from
  2    user_tab_histograms where table_name= ' T1 ' and column_name= ' SKEW ';

column_name     endpoint_number endpoint_value
-----------------------------  --------------
SKEW                         0               1
SKEW                         5           10000

Here buckets 1 to 5 are used 10000 as its endpoint so bucket 1 to 4 in order to save space is not stored. The following query can be used to show the number of buckets and its endpoinit value

Sql> SELECT Bucket_number, Max (skew) as Endpoint_value
  2   from (
  3   SELECT Skew, Ntile (5) Over ( Order by skew) as Bucket_number
  4 from   T1)
  5   GROUP by Bucket_number
  6   ORDER by bucket_ number;

Bucket_number endpoint_value
---------------------------
            1          10000
            2          10000
            3          10000
            4          10000
            5          10000

Here Ntile (5) is an analytic function that divides an ordered dataset into 5 buckets.

So in short, in a highly balanced histogram, the data is divided into buckets that contain the same data except for the last bucket. The maximum value in each bucket is recorded as endpoint_value and the minimum value in the first bucket is also recorded (bucket 0). Endpoint_ Number stands for buckets. Once the data has been recorded in the bucket, 2 types of data will be identified:
Non popular values and popular values.

Popular values are those that appear multiple times as endpoint value. For example, in the previous example, 3 is a popular value, in the example above skew 10000 is a popular Value.non Value is what is not present as endpoint values or only one occurrence of endpoint values. Popular value and non popular value are not fixed it depends on the size of the histogram bucket, Changing the size of the bucket will show different popular values.

Summary:
The number of different values in a column is less than the number of buckets of the histogram: when the number of different values is less than the number of buckets, the Endpoint_value column contains the different values themselves, and the Endpoint_number column contains the cumulative number of rows that are less than the column values. (Frequency straight Square chart)

The number of different values in a column is greater than the number of buckets in the histogram: when the number of different values is greater than the number of buckets, Endpoint_number contains the Bucekt ID and Endpoint_value is the maximum value in each bucket displayed. Bucket 0 is a special it shows the minimum value in the column ( Highly balanced straight square chart).


This article turns from: http://blog.itpub.net/26015009/viewspace-1072732/



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.