How the data in the data block is stored by the influence factors of the full scan

Source: Internet
Author: User
Tags create index
On the one hand, the percentage of rows returned is small or full scan is possible, and on the other hand, the percentage of rows returned is large, and it is possible that full scans should not be used.

Here's an example of how the optimizer chooses to scan differently, even though the percentages of the data returned are small, but the data is stored in a different way. First, you create two of the same tables with 10000 rows of data, and then give each table the execution plan for a query. Note that in this example, the query returns only 300 rows (3% of the total data) also selects a full table scan.

To illustrate, I test when returned 100 rows, the optimizer did not select full scan, but use a bitmap index scan, I will increase the amount of data returned to test.

Start testing: Table 1 data is stored sequentially, and table 2 is in random storage.
1. Table statement:
> Table 1:
--Test how the data is stored (sequentially stored)

CREATE TABLE seq_scan_store_1 (
   ID serial,sid integer,   
   name1 character varying, name2 character
   ,
   name3 character varying,
   name4 character varying, name5 character
);
--Indexing on the SID
CREATE INDEX seq_scan_store_1_sid_idx on seq_scan_store_1 USING btree (SID);

> table 2:
--Test data storage mode (random storage)
CREATE TABLE seq_scan_store_2 (
   ID serial,sid integer,
   name1 character varying, name2 character
   ,
   name3 character varying,
   name4 character varying, name5 character
); 
--Indexing on the SID
CREATE INDEX seq_scan_store_2_sid_idx on seq_scan_store_2 USING btree (SID);

2. Start inserting data into two tables: I inserted the data by writing two function.
> function1:
Create or Replace function func_seq_scan_1 () returns void as $$
       begin for
          i-1..10000 loop
              insert INTO seq_s Can_store_1 (SID,NAME1,NAME2,NAME3,NAME4,NAME5) VALUES (i/100, ' Zhansang ' | | (i/100), ' Lisi ' | | (i/100), ' Wangwu ' | | (i/100), ' Zhaoliu ' | | (i/100), ' Wangba ' | | (i/100));
          End Loop;
       End
$ $language plpgsql;
> function2:
Create or Replace function func_seq_scan_2 () returns void as $$
      begin for
      i-1..10000 loop
          insert INTO seq_s Can_store_2 (SID,NAME1,NAME2,NAME3,NAME4,NAME5) VALUES (mod (i,100), ' Zhansang ' | | (i/100), ' Lisi ' | | (i/100), ' Wangwu ' | | (i/100), ' Zhaoliu ' | | (i/100), ' Wangba ' | | (i/100));
      End Loop;
     End
$ $language plpgsql;

Note that the two function is different in the Insert Sid field.


> Start inserting data:

Select Func_seq_scan_1 (); 
Select Func_seq_scan_2 ();


How the current data is stored in the datasheet is shown in the following illustration:

Table 1: Sequential storage



Table 2: Random Storage



3. View two tables for the same query interpretation plan:

> Table 1:


The reason is that sid<3 rows are physically stored in only a few blocks of data files.


> table 2:


The reason is that sid<3 rows are physically stored in almost every block of data in the data file.

4. Summary:
The optimizer did not choose the same interpretation plan for the two same queries. Because the storage in these two tables is different, for table 1, you can get the target data only by accessing a few blocks of data, so using indexes is a good choice, but for table 2, Its data is physically stored in all the data blocks of the table, for its queries that almost need to read all of the data blocks to get the 300 rows of data we need, the optimizer calculates that using an index scan to read each block of data in a table may be more readable than using a full table scan to read all the blocks of data. Then it takes a long time to discard the unwanted data in each block directly.

In fact, using the index, to read the index block first, and then read the block, if 300 rows of data, read at least 300 index blocks, and then read 300 blocks (the following is a detailed description of why at least 300 index blocks), while the full table scan is just reading blocks of data, The optimizer can determine that the latter may take less time.

Once you know how the data is stored, you may understand why the optimizer always goes against its will and does not use indexes, but uses index scans.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.