What seems like ages ago, I listed 8 things you may not have known about indexes. although I 've since written about records of the 8 items, I 've yet to address the last item listed:
8. An index can potentially be the most efficient and valid tive may to retrieve anything between 0% and 100% of the data from a table.
A few recent posts on OTN reminded me that perhaps it's about time I wrote something on this topic.
Generally, the question that's commonly asked is at what point or at what percentage of Data does Oracle no longer consider the use of an index and judges the full table scan (FTS) as the most efficient method to retrieve the data from a table.
Basically, what's the "magic number", is it 1% of data, 2%, 5%, 7.5%, 15%, 42%, 50% ???
The answer unfortunately is that there is no such magic number or percentage, it all entirely depends. the way I often answer this question is by simply stating I can very easily come up with a scenario where a FTS is the most cost valid tive method to retrieve 1% of the data. equally, I can very easily come up with a scenario where an index is the most cost valid tive method to retrieve 99% of the data.
Like I said, there is no magic number, it entirely depends on a whole list of different factors and variables.
To start, I thought I might go through the example of how a 1% cardinality result is best achieved via a FTS, highlighting why and how the cost based optimizer comes to such a demo.
I'll use a simple little scenario with nice simple numbers to make the mathematics nice and easy to follow
OK, let's assume we have a table that has 10,000,000 rows. the table uses 100,000 table blocks to store these rows and so we have on average 100 rows per block. with an 8 K block size, we're basically looking at a table with an average row size of about 80 bytes.
Let's say this table has an associated index with approximately 20,000 leaf blocks required to store the index entries for a particle column and the index has a blevel of 2 (or a height of 3). this basically means we can store approximately 500 index entries per block and the average index entry is about 16 bytes or so in length.
The indexed column has 100 distinct values which are evenly distributed such that each distinct value has approximately 100,000 occurrences each. The column has no null values.
Let's say we write a query based on the indexed column and we're interested in just one of the possible 100 values or approximately 1% of the data in total. For example:
Select * From bowie_table where code = 'abcde ';
Does the CBO choose the index or does it chose the FTS?
Well, let's first cost the index access path.
We begin by reading the root block and the intermediate branch block for a cost of 2.
We also need to read approximately 1% of all the index leaf blocks in order to access all the index entries of interest. so that's 20,000 (leaf blocks) x 0.01 = 200 leaf blocks in total.
So the total cost of reading just the index is 202.
Next comes the interesting bit. How much of the 100,000 table blocks do we need to access in order to read just 1% of the data (I. e. 100,000 rows )?
Well, the answer depends entirely on the clustering factor of the index or to put it another way, in how well ordered the rows in the table are in relation to the index. if the index column values of interest are all very well clustered together in the table, then we can access the required rows by visiting fewer blocks than if the index column values are evenly and randomly distributed throughout the table.
In fact, in the worst possible cases scenario, if the clustering factor is appalling and has a value close to the number of rows in the table (10,000,000 ), we may actually need to visit each and every block in the table as each block has an average of 100 rows per block and we want on average 1% or one of these rows from each and every table block.
In the best possible case scenario, with the column values perfectly clustered together and with a clustering factor approaching the number of blocks in the table (100,000 ), we may get away with only having to visit 1% of all the table blocks or just 1,000 of them.
So the clustering factor is a crucial variable in how costly it wocould be to read the table via the index. the actual table access costs therefore are simply calculated as being the selectivity of the query (0.01 in our case) multiplied by the clustering factor of the associated index.
In this example, the clustering factor is indeed appalling with a value of 10,000,000 and the table access costs are therefore calculated as 0.01x10,000,000 = 100,000.
So the total costs of using the index is 202 (for the index related costs) + 100,000 (to access the rows from the table) = 100,202 in total.
So what are the costs associated with the FTS?
Well, the FTS has a number of advantages over the index scan. firstly, as Oracle needs to process all the blocks, it can retrieve all the necessary rows by reading a specific table block just the once. however, with the index scan, Oracle may possibly need to access a specific table block multiple times in some scenarios.
Secondly, as Oracle knows it has to read each and every block, Oracle can do so with a larger "bite of the pie" each time via multiblock reads, knowing it's not wasting resources as all blocks need to be processed anyways. index access reads perform single block I/OS whereas a FTS can perform muiltblock I/OS at a time. in this specific example, let's assume the valid multiple read value is 10, Remember, we want to keep the Arthur metic nice and simple...
Finally, a FTS can be saved med in parallel, even if the table itself isn't partitioned, which means the overall response times can be further improved and the CBO can reduce its "costs" accordingly. in this example, we won't worry about parallel query.
So the costs of a FTS in our example is basically 1 (for the segment header) + 100,000 (Table blocks)/10 (the valid multblock read value) = 1 + 10,000 = 10,001.
So that's roughly an overall cost of 100,202 for the index vs. 10,001 for the FTS.
The results are not even close with the FTS winning hands down and that's for just 1% of the Data...
A couple of final little points for now.
Firstly, the cost of just reading 1 block (for the single block index reads). 10 blocks (for the multiblock FTS reads) may actually differ somewhat as multiblock reads are doing more "work" with it's associated I/O. by default, with no parameters set and with no system statistics, the CBO will cost each I/O as being the same. more about how to possibly adjust this another time.
Also, by default the CBO will assume all associated I/OS are physical I/OS and will cost them accordingly, even if the BCHR is nice and high and the index access path in question might be accessed within (say) A nested loop join where the likelihood of capacity of the Index Related I/OS in the particle being cached is very high. more on this at another time as well.
But for now, just note how in this relatively trivial example, the following factors came into play when determining the potential costs of this query:
- Selectivity of the query
- Data Distribution with regard to the actual occurrences of the required data
- Number of table blocks (below the high water mark)
- Number of leaf Blocks
- Index height
- Average number of rows per table Block
- Average number of leaf entries per leaf Block
- Clustering factor
- Caching characteristics of index and table
- Valid multiblock read count
- Relative cost of single vs. multiblock I/OS
- Parallelism
All of which contribute to make any single "magic number" by which Oracle will no longer consider using an index but another fairy tale in the Oracle book of myths and folklore...