How the data in the data block is stored by the influence factors of the full scan

Last Update:2018-07-26 Source: Internet

Author: User

Tags create index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

On the one hand, the percentage of rows returned is small or full scan is possible, and on the other hand, the percentage of rows returned is large, and it is possible that full scans should not be used.

Here's an example of how the optimizer chooses to scan differently, even though the percentages of the data returned are small, but the data is stored in a different way. First, you create two of the same tables with 10000 rows of data, and then give each table the execution plan for a query. Note that in this example, the query returns only 300 rows (3% of the total data) also selects a full table scan.

To illustrate, I test when returned 100 rows, the optimizer did not select full scan, but use a bitmap index scan, I will increase the amount of data returned to test.

Start testing: Table 1 data is stored sequentially, and table 2 is in random storage.
1. Table statement:
> Table 1:
--Test how the data is stored (sequentially stored)

CREATE TABLE seq_scan_store_1 (
   ID serial,sid integer,   
   name1 character varying, name2 character
   ,
   name3 character varying,
   name4 character varying, name5 character
);

--Indexing on the SID

CREATE INDEX seq_scan_store_1_sid_idx on seq_scan_store_1 USING btree (SID);

> table 2:
--Test data storage mode (random storage)

CREATE TABLE seq_scan_store_2 (
   ID serial,sid integer,
   name1 character varying, name2 character
   ,
   name3 character varying,
   name4 character varying, name5 character
);

--Indexing on the SID

CREATE INDEX seq_scan_store_2_sid_idx on seq_scan_store_2 USING btree (SID);

2. Start inserting data into two tables: I inserted the data by writing two function.
> function1:

Create or Replace function func_seq_scan_1 () returns void as $$
       begin for
          i-1..10000 loop
              insert INTO seq_s Can_store_1 (SID,NAME1,NAME2,NAME3,NAME4,NAME5) VALUES (i/100, ' Zhansang ' | | (i/100), ' Lisi ' | | (i/100), ' Wangwu ' | | (i/100), ' Zhaoliu ' | | (i/100), ' Wangba ' | | (i/100));
          End Loop;
       End
$ $language plpgsql;

> function2:

Create or Replace function func_seq_scan_2 () returns void as $$
      begin for
      i-1..10000 loop
          insert INTO seq_s Can_store_2 (SID,NAME1,NAME2,NAME3,NAME4,NAME5) VALUES (mod (i,100), ' Zhansang ' | | (i/100), ' Lisi ' | | (i/100), ' Wangwu ' | | (i/100), ' Zhaoliu ' | | (i/100), ' Wangba ' | | (i/100));
      End Loop;
     End
$ $language plpgsql;

Note that the two function is different in the Insert Sid field.

> Start inserting data:

Select Func_seq_scan_1 ();

Select Func_seq_scan_2 ();

How the current data is stored in the datasheet is shown in the following illustration:

Table 1: Sequential storage

Table 2: Random Storage

3. View two tables for the same query interpretation plan:

> Table 1:

The reason is that sid<3 rows are physically stored in only a few blocks of data files.

> table 2:

The reason is that sid<3 rows are physically stored in almost every block of data in the data file.

4. Summary:
The optimizer did not choose the same interpretation plan for the two same queries. Because the storage in these two tables is different, for table 1, you can get the target data only by accessing a few blocks of data, so using indexes is a good choice, but for table 2, Its data is physically stored in all the data blocks of the table, for its queries that almost need to read all of the data blocks to get the 300 rows of data we need, the optimizer calculates that using an index scan to read each block of data in a table may be more readable than using a full table scan to read all the blocks of data. Then it takes a long time to discard the unwanted data in each block directly.

In fact, using the index, to read the index block first, and then read the block, if 300 rows of data, read at least 300 index blocks, and then read 300 blocks (the following is a detailed description of why at least 300 index blocks), while the full table scan is just reading blocks of data, The optimizer can determine that the latter may take less time.

Once you know how the data is stored, you may understand why the optimizer always goes against its will and does not use indexes, but uses index scans.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More