PostgreSQL Tutorial (10): Performance Promotion Skills _postgresql

Source: Internet
Author: User
Tags bulk insert commit create index hash postgresql

First, the use of explain:

PostgreSQL generates a query plan for each query because the performance impact of selecting the correct query path is critical. PostgreSQL itself already contains a planner for finding the optimal plan, and we can view the query plan generated by the planner for each query by using the explain command.
The query plan generated in PostgreSQL is a planning tree consisting of 1 to n planning nodes, where the bottommost nodes are table scan nodes, which are used to return the retrieved rows of data from the datasheet. However, different scan node types represent different table access modes, such as sequential scans, index scans, and bitmap index scans. If the query still needs to connect, aggregate, sort, or otherwise operate on the original row, there are additional nodes on the scan node "above". And these operations usually have a variety of methods, so it is possible to have different node types in these locations. Explain will output a row of information for each node in the planning tree, showing the basic node type and the estimated cost value that the planner calculates for executing the planning node. The first line (the topmost node) is an estimate of the total execution cost of the plan, which is the number that the planner is trying to minimize.
Here is a simple example, as follows:

Copy Code code as follows:

EXPLAIN SELECT * from Tenk1;
QUERY Plan
-------------------------------------------------------------
Seq Scan on Tenk1 (cost=0.00..458.00 rows=10000 width=244)

The data referenced by explain is:
1. Estimated start-up overhead (the time consumed before the output scan starts, such as the time to be queued in a sorted node).
2. Estimated total cost.
3. The expected number of rows for the planned node output.
4. The estimated line width (in bytes) of the planned node.
The unit of overhead (cost) is the number of accesses to disk pages, such as 1.0, which will represent a sequential disk page read. The overhead of the upper node will include the cost of all its child nodes. The output line number (rows) Here is not the number of lines that the planning node processes/scans, usually less. Generally, the number of rows at the top is expected to be closer to the number of rows actually returned by the query.
Now we execute the following query based on the system table:
Copy Code code as follows:

SELECT relpages, reltuples from pg_class WHERE relname = ' Tenk1 ';

From the query results you can see that the Tenk1 table occupies 358 disk pages and 10,000 records, but in order to calculate the cost value, we still need to know another system parameter value.
Copy Code code as follows:

postgres=# show Cpu_tuple_cost;
Cpu_tuple_cost
----------------
0.01
(1 row)
Cost = 358 (number of disk pages) + 10000 (line count) * 0.01 (cpu_tuple_cost system parameter value)

Let's look at a query plan with a where condition.
Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 WHERE unique1 < 7000;

QUERY Plan
------------------------------------------------------------
Seq Scan on Tenk1 (cost=0.00..483.00 rows=7033 width=244)
Filter: (Unique1 < 7000)

The explain output shows that the WHERE clause is used as a "filter", which means that the planning node scans each row of data in the table before deciding whether they qualify for the filter, and then outputs only the number of rows passing through the filter criteria. Here the expected number of output rows is reduced due to the presence of the WHERE clause. Even so, the scan will still have access to all 10000 rows of data, so the overhead is not really down, and it actually adds some extra CPU overhead due to data filtering.
The above data is just an estimated number, even after each execution of the Analyze command, because the statistics generated by analyze are computed by randomly sampled samples from the table.
If we set the above criteria more strictly, we will get different query plans, such as:
Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 WHERE unique1 < 100;

QUERY Plan
------------------------------------------------------------------------------
Bitmap Heap Scan on Tenk1 (cost=2.37..232.35 rows=106 width=244)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.37 rows=106 width=0)
Index Cond: (unique1 < 100)


Here, the planner decides to use the two-step plan, the most inner planning node accesses an index, finds the location of the rows that match the index criteria, and then the upper-level planning node then reads the rows from the table. Reading data rows individually is much more expensive than reading them sequentially, but because it is not all disk pages that access the table, the cost of this method is still less than the cost of a sequential scan. The reason for using Layer two planning is because the upper-level planning node sorts the physical location of the rows retrieved by the index, which minimizes the overhead of reading the disk pages separately. The "Bitmap (bitmap)" mentioned in the node name is the mechanism for sorting.

Now we can also set the conditions of the where to be more stringent, such as:

Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 WHERE unique1 < 3;

QUERY Plan
------------------------------------------------------------------------------
Index Scan using tenk1_unique1 on Tenk1 (cost=0.00..10.00 rows=2 width=244)
Index Cond: (Unique1 < 3)


In this SQL, the data rows of the table are read in the order of the indexes, which makes it more expensive to read them, but the fact is that there are few rows to be fetched, so there is no need to sort the physical location based on the rows.
Now we need to add another condition to the WHERE clause, such as:
Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 WHERE unique1 < 3 and stringu1 = ' xxx ';

QUERY Plan
------------------------------------------------------------------------------
Index Scan using tenk1_unique1 on Tenk1 (cost=0.00..10.01 Rows=1 width=244)
Index Cond: (Unique1 < 3)
Filter: (stringu1 = ' xxx ':: name)

The new filter condition stringu1 = ' xxx ' only reduces the number of rows expected to output, but does not reduce the actual cost because we still need to access the same number of rows of data. This condition is not regarded as an index condition, but as a filtering condition for the index result.
If more than one field in the Where condition has an index, the planner might use a combination of the index and or or, such as:
Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 WHERE unique1 < unique2 > 9000;

QUERY Plan
-------------------------------------------------------------------------------------
Bitmap Heap Scan on Tenk1 (cost=11.27..49.11 rows=11 width=244)
Recheck Cond: (Unique1 <) and (Unique2 > 9000))
-> Bitmapand (cost=11.27..11.27 rows=11 width=0)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.37 rows=106 width=0)
Index Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..8.65 rows=1042 width=0)
Index Cond: (unique2 > 9000)

This result will result in access to two indexes, which is not necessarily better than using only one index and another condition as a filter.
Now let's take a look at the query planning for table joins based on indexed fields, such as:
Copy Code code as follows:

EXPLAIN SELECT * from Tenk1 T1, tenk2 T2 WHERE t1.unique1 < t1.unique2 = T2.unique2;

QUERY Plan
--------------------------------------------------------------------------------------
Nested Loop (cost=2.37..553.11 rows=106 width=488)
-> Bitmap Heap Scan on Tenk1 T1 (cost=2.37..232.35 rows=106 width=244)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.37 rows=106 width=0)
Index Cond: (unique1 < 100)
-> Index Scan using tenk2_unique2 on Tenk2 T2 (cost=0.00..3.01 Rows=1 width=244)
Index Cond: ("outer". Unique2 = t2.unique2)

You can see from the query plan (Nested loop) that the query statement uses a nested loop. The outer scan is a bitmap index, so the overhead is the same as the row count and the previous query, because the conditions unique1 < 100 play a role. This time t1.unique2 = t2.unique2 condition clause does not have any effect, so it does not affect the row count of the outer scan. For an internal scan, however, the current outer-scan data row is inserted into the inner-layer index scan, and a similar condition T2.unique2 = constant is generated. Therefore, the inner layer scan will get and explain SELECT * from tenk2 WHERE unique2 = 421-like plan and overhead. Finally, the cost of the loop node is set based on the overhead of the outer scan, plus an iteration of each outer row (here is 106 * 3.01) and a little CPU time required for the connection processing.
If you do not want to use nested loops to plan the above query, then we can turn off the nested loops by performing the following system settings, such as:
Copy Code code as follows:

SET enable_nestloop = off;
EXPLAIN SELECT * from Tenk1 T1, tenk2 T2 WHERE t1.unique1 < t1.unique2 = T2.unique2;

QUERY Plan
------------------------------------------------------------------------------------------
Hash Join (cost=232.61..741.67 rows=106 width=488)
Hash Cond: ("outer". Unique2 = "inner". Unique2)
-> Seq Scan on Tenk2 T2 (cost=0.00..458.00 rows=10000)
-> Hash (cost=232.35..232.35 rows=106 width=244)
-> Bitmap Heap Scan on Tenk1 T1 (cost=2.37..232.35 rows=106 width=244)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.37 rows=106 width=0)
Index Cond: (unique1 < 100)

The plan still attempts to use the same index scan to remove 100 lines from the TENK1 that meet the requirements, and store them in an in-memory hash table, and then do a tenk2 sequential scan of the TENK2, and query the hash (hash) table for the records in each of these columns, Look for rows that might match t1.unique2 = t2.unique2. Reading Tenk1 and establishing a hash table are all startup costs for this hash join because we cannot get any output rows until we begin to read the TENK2.

In addition, we can use the explain analyze command to check the accuracy of the planner's pre estimate. This command executes the query first, then displays the actual elapsed time within each planning node, as well as the estimated cost of the simple explain command display, such as:

Copy Code code as follows:

EXPLAIN ANALYZE SELECT * from Tenk1 T1, tenk2 T2 WHERE t1.unique1 < and t1.unique2 = T2.unique2;
QUERY Plan
--------------------------------------------------------------------------------------------------------------- -------------------
Nested Loop (cost=2.37..553.11 rows=106 width=488) (actual time=1.392..12.700 rows=100)
-> Bitmap Heap Scan on Tenk1 T1 (cost=2.37..232.35 rows=106 width=244) (actual time=0.878..2.367 rows=100 Loops=1)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..2.37 rows=106 width=0) (actual time=0.546..0.546 rows=100 Loops=1)
Index Cond: (unique1 < 100)
-> Index Scan using tenk2_unique2 on Tenk2 T2 (cost=0.00..3.01 rows=1 width=244) (actual time=0.067..0.078 Rows=1 Loo PS=100)
Index Cond: ("outer". Unique2 = t2.unique2)
Total runtime:14.452 ms

Note that the "actual time" value is measured in milliseconds that are real, and that "cost" estimates are calculated in terms of the number of disk page reads, so they are likely to be inconsistent. What we need to focus on, however, is whether the ratios of the two sets of data are consistent.

In some query planning, a child planning node is likely to run multiple times, such as the previous nested loop plan, where the inner-layer index scan executes once for each outer row. In this case, "loops" reports the total number of times the node executes, and the actual time and number of rows displayed are the average for each execution. The reason for this is that these real values are more comparable to the values expected to be displayed. If you want to get the total amount of time spent on the node, it is calculated by multiplying the value by the "loops" value.
The "Total runtime" shown by EXPLAIN analyze includes the time that the executor was started and closed, and the time the resulting row was processed, but it does not include parsing, rewriting, or planning time.
If the explain command can only be used in a test environment and not in a real-world environment, it does nothing. For example, to perform explain on a table with fewer data, it does not apply to large numbers of tables, because the cost calculation of a planner is not linear, so it is likely to choose a different plan for larger or smaller tables. An extreme example is a table that occupies only one disk page, and you almost always get sequential scan planning on such a table, regardless of whether it has an index to use. The planner knows that it takes a disk page to read in any case, so it makes no sense to add a few more disk page reads to find the index.

Second, bulk data insertion:

There are several ways to optimize bulk inserts of data.

1. Turn off automatic submission:

In BULK insert data, if each data is automatically submitted, when a system failure, not only can not guarantee the data consistency of the bulk insert, and due to the occurrence of multiple submissions, the entire insertion efficiency will be greatly hit. The workaround is to turn off automatic submission of the system and, before the insertion begins, the Execute BEGIN TRANSACTION command, which executes the commit command to commit all inserts after all insert operations have completed.

2. Use copy:

Use Copy to load all records in a single command, not a series of INSERT commands. The copy command is optimized for loading a large number of data rows and is not as flexible as the insert command, but the overhead is much lower when loading large amounts of data. Because copy is a single command, it is not necessary to turn off autocommit when populating a table.

3. Delete index:

If you are loading a newly created table, the quickest way is to create a table, bulk load with copy, and then create any indexes that the table requires. Because it is faster to create an index on a table that already has data than to maintain line-by-row increases. Of course, during the absence of indexes, the performance of other query operations on the table will be affected, and uniqueness constraints may be compromised.

4. Delete the foreign KEY constraint:
As with indexes, "bulk" checking of foreign key constraints is more efficient than a row of checks. Therefore, we can delete the foreign key constraint first, load the data, and then rebuild the constraint.

5. Increase Maintenance_work_mem:
When loading large amounts of data, temporarily increasing the value of the MAINTENANCE_WORK_MEM system variable can improve performance. This system parameter improves the execution efficiency of the CREATE INDEX command and the alter TABLE ADD FOREIGN key command, but it does not have much impact on the copy operation itself.

6. Increase checkpoint_segments:
Temporarily increasing the value of checkpoint_segments system variables can also increase the efficiency of large data loads. This is because when you load a large amount of data into a PostgreSQL, you will cause checkpoint operations (declared by the system variable Checkpoint_timeout) to occur more frequently than usual. Every time a checkpoint occurs, all dirty data must be flush to disk. By increasing the value of the checkpoint_segments variable, you can effectively reduce the number of checkpoints.

7. Afterwards run Analyze:
After adding or updating a large amount of data, you should run the Analyze command immediately, which will ensure that the planner gets the latest data statistics based on the table. In other words, if there is no statistical data or the statistics are too old, then the planner will probably choose a poor query plan, resulting in too low query efficiency.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.