PostgreSQL CPU Full (100%) performance analysis and optimization
Transferred from: https://help.aliyun.com/knowledge_detail/43562.html
In database operations, a DBA often encounters a more urgent problem, that is, the sudden CPU full (CPU utilization reaches 100%), causing the business to stall. When the CPU is full, it is often necessary to start from the backend database, go back to the specific SQL, and finally locate the business layer. The following is a specific approach to this problem.
View the number of connections changes
CPU utilization reached 100%, first of all suspected, is not the business peak active connection spikes, and the database reserves insufficient resources caused by the result. We need to see if the number of active connections is much more than usual when the problem occurs. For the RDS for PG, the number of connections on the database can be seen from the monitoring information in the console. The current number of active connections > can be directly connected to the database, using the following query statements to get:
select count( * ) from pg_stat_activity where state not like ‘%idle‘;
Track Slow SQL
If the change in the number of active connections is in the normal range, then a large probability may be that the poor performance of SQL at that time is being caused by a large number of executions. Because RDS has slow SQL logs, we can use this log to locate the time-consuming SQL to do further analysis. But usually when the problem occurs, the entire system is stuck, all SQL slows down, and the recorded > slow SQL can be very much, and it is not easy to troubleshoot the culprit. Here we introduce several methods that involve tracing slow SQL when a problem occurs.
1. The first approach is to use the pg_stat_statements plug-in to locate slow SQL, with the following steps.
1.1. If you do not create this plugin, you need to create it manually. We want to take advantage of the count information in plug-ins and database systems (such as SQL execution time accumulation, etc.), and this information is cumulative and contains historical information. In order to more easily troubleshoot the current CPU full problem, we need to reset the counter first.
create Extension Pg_stat_statements;
select pg_stat_reset
Select pg_stat_statements_reset (
1.2. Wait for a period of time (for example, 1 minutes) to accumulate enough information for the counter.
1.3. Query the most time-consuming SQL (which is typically the direct cause of the problem).
select * from pg_stat_statements order by total_time desc limit 5;
1.4. Query the SQL that reads the most buffer times, which may result in too much buffer reads and CPU consumption because the queried data is not indexed.
select * from pg_stat_statements order by shared_blks_hit+shared_blks_read desc limit 5;
2. The second approach is to use the following query directly through the Pg_stat_activity view to see the current long execution, never ending SQL. These SQL counterparts cause the CPU to be full, and there is a direct suspicion.
SelectDatname,Usename,Client_addr,Application_name,State,Backend_start,Xact_start,Xact_stay,Query_start,Query_stay, Replace(Query, Chr(10), ‘ ‘) As QueryFrom (SelectPgsa.DatnameAsDatname,Pgsa.UsenameAsUsename,Pgsa.Client_addr client_addr,Pgsa.Application_nameAsApplication_name,Pgsa.StateAsState,Pgsa.Backend_startAsBackend_start,Pgsa.Xact_startAsXact_start, Extract(EpochFrom (Now() -Pgsa.Xact_start)) AsXact_stay,Pgsa.Query_startAsQuery_start, Extract(EpochFrom (Now() -Pgsa.Query_start)) AsQuery_stay,Pgsa.QueryAs QueryFromPg_stat_activityAsPgsawherePgsa.State!= ' Idle ' and Pgsa.< Span class= "PLN" >state != ' idle in Transaction ' and Pgsa state != ' idle In transaction (aborted) ' ) idleconnections order Span class= "KWD" >by Query_stay desc limit 5
3. The 3rd method is to search for a table with missing indexes starting with information from the Table Scan (table scan) on the data table. Data tables If the index is missing, most of the hot data is in memory (for example, memory 8G, hot data 6G), the database can only use table scan, and need to deal with the memory of a large number of unrelated records, and CPU-intensive. Especially for tables with a table record number exceeding 100, a single table scan consumes a large number of CPUs (basically one CPU full), multiple connections (such as hundreds of connections), and full CPU.
3.1. Use the following query to find the table with the most scanned tables:
Select * from Pg_stat_user_tables where N_live_tup Span class= "pun" >> 100000 Span class= "KWD" >and Seq_scan > 0 Order by Seq_tup_read Desc limit 10
3.2. Query the currently running slow query that accesses the table above:
select * from pg_stat_activity where query ilike '%<table name>% ' and Query_start - Span class= "Hljs-keyword" >now () > Span class= "hljs-built_in" >interval
3.3. You can also locate queries that involve these tables through the Pg_stat_statements plugin:
select * from pg_stat_statements where query ilike ‘%<table>%‘order by shared_blks_hit+shared_blks_read desc limit 3;
Working with Slow SQL
For the above method to find out the slow SQL, the first thing to do is to cancel or kill them, so that the business first restores:
select pg_cancel_backend(pid) from pg_stat_activity where query like ‘%<query text>%‘ and pid != pg_backend_pid();
select pg_terminate_backend(pid) from pg_stat_activity where query like ‘%<query text>%‘ and pid != pg_backend_pid();
If these SQL are truly business-critical, they need to be optimized. There are "kick" in this area:
1.ANALYZE <table>VACUUM ANZLYZE <table>make the query plan more accurate by performing or updating the table statistics on the tables involved in the query. Note that in order to avoid business impact, it is best to perform at a low peak of business.
2. Executeexplain <query text>orexplain (buffers true, analyze true, verbose true) <query text>command to view the SQL Execution plan (note that the former does not actually execute the SQL, which is actually executed and can get detailed execution information), and indexes the tables in which the table scan is involved.
3. Re-writing SQL, removing unnecessary subqueries, overwriting union ALL, using join clause fixed connection order, is a means of further deep optimization of SQL, which is no longer explained in depth.
PostgreSQL CPU Full (100%) performance Analysis and optimization (RPM)