Interpreting histogram information
From:
Interpreting histogram information (Doc ID 72539.1)
Suitable for:
Oracle database-enterprise edition-version 7.3.0.0 and later
Oracle Database-standard edition-version 7.3.0.0 and later
Oracle database-personal edition-version 7.3.0.0 and later
Information in this document applies to any platform.
Objective:
How the histogram information is stored and how it is interpreted.
Range:
Other useful histogram references:
Document 1445372.1 Histograms:an Overview (10g and Above)
Details:
A histogram is a mechanism used to store the details of column data. This data is used by the CBO to determine the most optimized access path for a query statement.
When there is no histogram, all the information the optimizer relies on is the high and low values of a column, the number of different values for that column, the number of empty values for that column, and the total number of records for that table.
(The high and low values of the columns are actually stored in RAW format and therefore not particularly useful), and other information can be queried from dictionary views.
When there are no column statistics, the optimizer assumes that the data is evenly distributed, and for the equivalent predicate, generated at a selection rate (column selectivity), the selection rate is calculated as follows: 1/NVD (number of Distinct Values)
When you have a histogram, you can access more distribution information for the row data.
When the data distribution of a column is uneven (that is, the column's data distribution is highly skewed-the data distribution is skewed), Oracle can store the histogram of the column to give a better selection rate. This produces statistics that are more than the standard usage (high and low values plus number of Distinct Values) Better execution plan
For specific implementations (in terms of implementation), we can choose to store each different value together with the number of records for that value, for records with very few values are valid, at which point the ' width balanced ' histograms is used.
As the number of different values grows and the number of stored data becomes too high, we need to use a different method to store the histogram data. At this point, we can choose height balanced histograms.
Using the above two methods, the column histogram provides an efficient and centralized way to present the data distribution. When a histogram is established, the information stored depends on whether the number of different values is less than or equal to the number of buckets (default 75, maximum 254) for different interpretations.
If the number of different values is less than or equal to the number of histogram buckets (buckets up to 254), then Frequency histogram is built
If the number of different values is greater than the number of histogram buckets, Height Balanced histogram is established.
Frequency histogram
Frequency histogram use buckets to record the number of records for each different value
Height Balanced Histogram
The Height Balanced histogram is achieved by splitting the data into different buckets. Each bucket includes the same number of column values. The highest value (or End_point) and the lowest value in each bucket are recorded in bucket No. 0th.
Once the data is stored in buckets, we can identify two types of---non-popular values and popular values
Non-popular Values--are Those that does not occur multiple times as end points. Does not appear more than once
Popular values--occur multiple times as end points. will appear more than once.
We can use the Popular and Non-popular Values to the provide use with various statistics. Since We know how many values there is in a buckets we can use this information to estimate the number of rows in total th At is covered by Popular and non-popular values.
? The selectivity for popular values can is obtained by calculation the proportion of buckets endpoints filled by that Popula R value.
? The selectivity for non popular values can now being calculated as 1/number non-popular bucket endpoints, so we can now be mo Re accurate about selectivities than the original 1/NDV, because we had removed the popular values from the equation.
How histograms is used
Histograms are used to get column predicate better selectivity estimates
Where There is fewer distinct values than buckets, the selectivity was simply calculated as we have accurate row Informati On for each value. For the case where we have more distinct values than buckets, the following outlines how these selectivities is obtained.
Equality predicate selectivity calculated from:
? Popular Value:
Number of buckets for value/total number of buckets
? Non-popular Value:
Density See:
Document 43041.1 Query optimizer:what is Density?
Less than < (same principle applies for > & >=)
? All Values:
Buckets with endpoints < Value/total No. of buckets
Histogram Examples
Table tab1sql> desc tab1 Name Null? Type-------------------------------------------A number (6) B number (6)
Column A contains unique values from 1 to 10000.
Column B contains distinct values.
The value ' 5 ' occurs 9991 times.
Values ' 1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000 ' occur only once.
i.e.
SELECT DISTINCT B, count (*) from Htab1group to Border by B; B COUNT (*)-------------------- 1 1 2 1 3 1 4 1 5 9991 9996 1 9997 1 9998 1 9999 1 10000 rows selected.
There is a index on Column B.
Statistics is gathered without histograms using:
EXEC dbms_stats. Gather_table_stats (NULL, ' HTAB1 ', method_opt = ' for all COLUMNS SIZE 1 ');
Setup:
drop table Htab1;create Table HTAB1 (a number, b number); Insert into HTAB1 (A, B) values (+); Insert into HTAB1 (A, B) values (2,2); Insert into HTAB1 (A, B) values (3,3); Insert into HTAB1 (A, B) values (bis); Insert into HTAB1 (A, B) values (9996,9996); Insert into HTAB1 (A, B) values (9997,9997); Insert into HTAB1 (A, B) values (9998,9998); Insert into HTAB1 (A, B) values (9999,9999); Insert into HTAB1 (A, B) values (10000,10000); Commit;begin for I in 5.. 9995 Loop Insert into HTAB1 (A, B) values (i,5); if (mod (i,100) = 0) then commit; End If; End Loop; Commit;end;/commit;create index htab1_b on HTAB1 (B); EXEC dbms_stats. Gather_table_stats (NULL, ' HTAB1 ', method_opt = ' for all COLUMNS SIZE 1 '); Alter session set Optimizer_dynamic_sampling = 0;
Function to convert raw data on to numeric data:
Create or Replace function Raw_to_number (My_input raw) return Numberas my_output number;begin dbms_ Stats.convert_raw_value (my_input,my_output); Return my_output;end;/
This results in statistics as follows:
Column column_name format A5 heading colcolumn num_distinct format 99990column low_value format 99990column high_value for Mat 99990column DENSITY format 99990column num_nulls format 99990column num_buckets format 99990column sample_size format 99990select Column_name,num_distinct,raw_to_number (Low_value) low,raw_to_number (high_value) high,density,num_ NULLS, num_buckets,last_analyzed,sample_size,histogramfrom user_tab_columnswhere table_name = ' HTAB1 '; COL num_distinct Low High DENSITY num_nulls num_buckets last_analyzed sample_size Histogram-------- ------------------------------------------------------------------------------------------------------A 10000 1 10000 0 0 1 31-jan-2013 09:32:08 10000 NONEB 10 1 10000 0 0 1 31-jan-2013 09:32:08 10000 noneselect lpad (table_name,10) TAB, Lpad (COL Umn_name, COL, Endpoint_number, Endpoint_valuefRom user_histogramswhere table_name= ' HTAB1 ' ORDER by COL, Endpoint_number; TAB COL endpoint_number endpoint_value-------------------------------------------------HTAB1 A 0 1 HTAB1 a 1 10000 HTAB1 B 0 1 HTAB1 B 1 10000
The above you can see that the statistics gathering have not created a histogram. There is a single bucket and a low endpoint_number for each column value (you'll always get 2 entries in User_ Histograms for each column, for the high and low values respectively).
Test Queries:
To replicate the tests you'll need to disable optimizer_dynamic_sampling
Alter session Set optimizer_dynamic_sampling = 0;
See:
Document 336267.1 Optimizer Dynamic sampling (optimizer_dynamic_sampling)
Without histograms, both queries do a INDEX RANGE SCAN because the optimizer believes that the data is uniformly distribu Ted in column B and so each predicate with return 1/10th of the values because there is distinct values:
---------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |---------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1111 | 6666 | 5 (0) | 00:00:01 | | 1 | TABLE ACCESS by INDEX rowid| HTAB1 | 1111 | 6666 | 5 (0) | 00:00:01 | | * 2 | INDEX RANGE SCAN | Htab1_b | 1111 | | 3 (0) | 00:00:01 |---------------------------------------------------------------------------------------
In fact it is preferable to use a full Table Scan for the Select where B=5 and index lookups for the others.
Gathering histogram Statistics
If we collect histogram statistics with the recommended settings:
The b=5 query now does a full Table Scan
SELECT * from Htab1 where b=5;---------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |---------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 9991 | 69937 | 7 (0) | 00:00:01 | | * 1 | TABLE ACCESS full| HTAB1 | 9991 | 69937 | 7 (0) | 00:00:01 |---------------------------------------------------------------------------
The query where B is 3 still uses an index:
SELECT * from Htab1 where b=3;------------------------------------------------------------------------------------- --| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |---------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 7 | 2 (0) | 00:00:01 | | 1 | TABLE ACCESS by INDEX rowid| HTAB1 | 1 | 7 | 2 (0) | 00:00:01 | | * 2 | INDEX RANGE SCAN | Htab1_b | 1 | | 1 (0) | 00:00:01 |---------------------------------------------------------------------------------------
This is because a FREQUENCY histogram have been created:
COL num_distinct Low High DENSITY num_nulls num_buckets last_analyzed sample_size Histogram-------- ------------------------------------------------------------------------------------------------------A 10000 1 10000 0 0 1 31-jan-2013 09:58:01 10000 NONEB 10 1 10000 0 0 31-jan-2013 09:58:01 10000 frequencytab COL endpoint_numb ER endpoint_value-------------------------------------------------HTAB1 A 0 1 HTAB1 A 1 10000 HTAB1 B 1 1 HTAB1 B 2 2 HTAB1 B 3 3 HTAB1 b 4 4 HTAB1 b 9995 5 HTAB1 b 9996 9996 HTAB1 B 9997 9997 HTAB1 b 9998 9998 HTAB1 b 9999 9999 HTAB1 B 10000 1000012 rows selected.
On Column B There is buckets matching up with the ten distinct values.
The Endpoint_value shows the column VALUE and the endpoint_number shows the cumulative number of rows. So the number of rows for Endpoint_value 2, it had an endpoint_number 2, the previous endpoint_number is 1, hence the numb Er of rows with the value 2 is 1. Another example is Endpoint_value 5. It Endpoint_number is 9995. The previous bucket Endpoint_number is 4, so 9995-4 = 9991 rows containing the value 5.
Frequency histograms work fine with a low number of distinct values, if when the number exceeds the maximum number of Buc Kets, you cannot create a buckets for each value. In this case the Optimizer creates Height balanced histograms.
Height Balanced histograms
You can demonstrate this situation by forcing the optimizer to create fewer buckets than the number of Distinct Values. i.e. using 8 buckets for Distinct Values:
So now we had gathered a HEIGHT BALANCED histogram for Column B:
COL num_distinct Low high DENSITY num_nulls num_buckets last_analyzed sample_size Histogram-------- ------------------------------------------------------------------------------------------------------A 10000 1 10000 0 0 1 31-jan-2013 09:58:01 10000 noneb 1 10000 0 0 8 31-jan-2013 09:59:09 10000 HEIGHT balancedtab COL endpoint_number Endpoint_ VALUE------------------------------------------------- HTAB1 a 0 1 HTAB1 a 1 10000 HTAB1 b 0 1 HTAB1 b 7 5 HTAB1 B 8 10000
Notice that there is 8 Buckets against B now.
Oracle puts the same number of values in each bucket and records the endpoint of each bucket.
With HEIGHT BALANCED histograms, the endpoint_number are the actual bucket number and Endpoint_value is the ENDPOINT VALUE Of the bucket determined by the column value.
From the above, buckets 0 holds the low value for the column.
Because Buckets 1-7 has the same endpoint, Oracle does not store all these rows to save space. But we have:bucket 1 with a endpoint of 5, bucket 2 with a endpoint of 5, bucket 3 with a endpoint of 5, bucket 4 with An endpoint of 5, buckets 5 with an endpoint of 5, buckets 6 with an endpoint of 5, buckets 7 with an endpoint of 5 and Buck Et 8 with a endpoint of 10000 so bucket 1 contains values between 1 and 5, bucket 8 contains values between 5 and 10000.
All buckets contain the same number of values (which was why they was called height-balanced histograms), except the last B Ucket may has fewer values then and the other buckets.
Storing Character Values in histograms
For character columns, Oracle is only stores the first from bytes of any string (there is also limits on numeric columns, but These is less frequently an issue since the majority of numbers is insufficiently large to encounter any problems). See:
Document 212809.1 Limitations of the Oracle cost Based Optimizer
Any predicates, contain strings greater than, characters would not use histogram information and the selectivity would Be 1/number of DISTINCT Values. Data in histogram endpoints are normalized to double precision floating point arithmetic.
For Example
Sql> SELECT * from example; A----------Abcdeeee
The table contains 5 distinct values. There is one occurence of ' a ', ' B ', ' C ' and ' d ' there are 4 occurrences of ' e '. If we create a histogram:looking in User_histograms:
TABLE COL endpoint_number endpoint_value-------------------------------------------- EXAMPLE A 1 5.0365E+35 EXAMPLE a 2 5.0885E+35 EXAMPLE a 3 5.1404E+35 EXAMPLE a 4 5.1923E+35 EXAMPLE a 8 5.2442E+35
So:
Endpoint_value 5.0365E+35 represents a5.0885e+35 represents b5.1404e+35 represents c5.1923e+35 represents d5.2442E+35 Represents E
Then, if your look at the cumulative values for Endpoint_number, the corresponding Endpoint_value ' s is correct.
"Translate from MOS article" to interpret histogram information