PostgreSQL performance optimization http://blog.sina.com.cn/s/blog_681cd80d0100md5o.html
Every time I read the settings of postgresql, it seems that I forgot it every time. It will take a while. I finally decided to start my blog today.
PostgreSQL optimization and Maintenance
1. Hardware
The most important thing about databases is I/O. So everything starts with I/O.
Raid: Needless to say, the database is placed on raid 10, and the read-only backup database can be set to RAID 0. It doesn't matter if it goes down. Remember: the database is random read.
RAID card selection:
Raid cards must be equipped with batteries before they can be written into the cache with power (bbu). raid will return the hard disk write success (no need to wait)
1. Areca
2. LSI (true LSI, re-brand do not)
3. HP P400 and above series
Hard Disk selection:
First Choice is SAS: 15 k rpm each SAS can provide about 25 Mbit/s of random write. That is to say, in the raid10 setting, four hard disks are required if 50 MB/S random write is required.
The Frugal choice is: SATA can use several hard disks (one times the number of SAS) to achieve a speed close to SAS in raid 10. Even if Sata is bought twice as much as SAS, the price is still cheaper than SAS.
You can also buy a product, such as Compaq's MSA 70 (P800 battery backed raid control)
CPU: 64-bit
Cache: The bigger the cache, the better. (currently, my PC has 3 MB of cache)
Core: the more cores, the better (PostgreSQL runs the CPU after all). We recommend that you have at least 4 cores.
Ram: at least 4 GB. Usually use 16-64 GB RAM
2. OS (system)
Available systems:
1. Debian stable
2. centos
3. Ubuntu lts
4. Red Hat
5. SuSE Enterprise
If you plan to pay for the service, you can choose the three options: Canonical, Novell, and RedHat.
If you are not buying any services, you can use Debian, centos, and Ubuntu lts.
Here, I still think that the system uses Red Hat (centos if you don't pay for it). After all, people are enterprise-level Big Brother and cannot go wrong.
* Now centos can also buy services.
Unavailable system: for example, fedora (RedHat QA) ubuntu (non-lts)
Scheduler:
Added GRUB: Elevator = deadline
The RedHat icon shows that deadline is the best choice for databases.
File System)
The options are: ext2, ext3, and ext4. Why do we only consider these? Because the database is still the first stable, kernel developers do less about the file system in theory.
Wal: Put ext2 because Wal itself has journal and does not need to use ext3 (ext2 is much faster)
Data: ext3
Block Size: Postgres itself is an 8 K block size. Therefore, the file system uses an 8 K block size. In this way, the system performance can be improved.
Ext4: the release time is not long enough.
Partition)
Postgres and system OS should be in different partitions
System (OS): the system should be placed in an independent RAID 1.
Database (ipvs data): the database should be placed on an independent raid 10. If raid has a battery, Set Data = writeback when mounting.
For independent database partitions, you are not allowed to record the file time (all data is stored). Therefore, the noatime option is required during the Mount process, which can save the Update Time (timestamp).
Wal log (xlogs): the log on the independent RAID 1 (ext2 System) is sequential write, so the general hard disk (SATA) speed is enough, there is no need to waste SAS on the log
PostgreSQL log (logs): it can be directly thrown to syslog. It is best to set a separate file name in syslog. conf. For example, use local2 for PostgreSQL.
Local2. *-/var/log/S/Postgres. Log
Remember to log to async so that it will not be stuck on the log I/O, and remember to set logrotate and create path)
Ext2 vs ext3 performance test:
HP dl585
4 dual core 8222 Processors
64 GB RAM
(2) msa70 direct attached storage arrays.
25 spindles in each array (RAID 10)
HP P800 Controller
6 disk in raid 10 on Embedded Controller
Xlog with ext3: AVG = 87418.44 KB/sec
Xlog with ext2: AVG = 115375.34 KB/sec
3. Memory (memory usage)
Shared buffer cache
Working memory
Maintenance memory
Shared Buffers
Fixed memory to be reached during ipvs startup. Each allocation is 8 K. Ipvs does not directly perform hard disk read/write, but puts the items in the hard disk into the shared buffers, then changes the shared buffers, and flush to the hard disk.
Generally, shared buffers is set to about 25%-40% of the memory (available memory.
In the system (OS), remember to set the kernel. shmmax value (/etc/sysctl. conf)
Kernel. shmmax determines the maximum number of shared memory that can be called by the process. The simple calculation method is
Kernel. shmmax = ipvs shared_buffers + 32 MB
Enough Space (or out of memory) should be reserved. PostgreSQL uses some other memory in addition to the shared buffer, such as max_connections and max_locks_pre_transaction.
Working memory
This is the memory required for running tasks in ipvs, such as hashed (aggregates, hash joins) sort (order by, distinct, etc.) in the memory, it can be ensured that ipvs can complete these tasks completely in the memory without spitting data back to the hard disk for swap. However, if it is set to too large, the memory used by ipvs will be larger than the memory of the actual machine. At this time, it will go to the hard disk swap. (Reduced performance)
Working memory is the setting of per connection and per sort. Therefore, be very careful when setting. For example, if you set working memory to 32 MB, the following example is as follows:
Select * from lines, lineitems
Where lines. lineid = lineitems. lineid
And lineid = 6
Order by Baz;
64 MB memory can be used here.
Hashjoin between lines and lineitems (32 MB)
Order by Baz (32 MB)
Pay attention to the number of queries in which order by or join is used.
If there are 100 connections at the same time, 100 connection x 64 MB = 6400 MB (6 GB) memory
Generally, the working mem should not be too large, 2-4 MB is enough.
In Versions later than ipvs 8.3, working mem can be set in Query
Query:
Begin;
Set work_mem to '128mb ';
Select * From Foo order by bar;
Insert into Foo values ('bar ');
Reset work_mem;
Commit;
Function:
Create Function return_foo () returns setof text
$ Select * From Foo order by bar; $
Set work_mem to '128mb'
Language 'SQL'
Postgres is not officially recommended (but supported) to change work_mem In the PostgreSQL. conf file and then HUP (the database should have no interruptions)
Use explain analyze to check whether enough work_mem exists.
Sort (cost = 0. 02 .. 0.03 rows = 1 width = 0) (actual time = 2270. 744 .. 22588.341 rows = 1000000 loops = 1)
Sort key: (generate_series (1, 1000000 ))
Sort method: External merge Disk: 13696kb
-> Result (cost = 0. 00 .. 0.01 rows = 1 width = 0) (actual time = 0. 006 .. 144.720 rows = 1000000 loops = 1)
Total runtime: 3009.218 MS
(5 rows)
The above query analysis shows that 13 MB of data needs to be taken from the hard disk. Therefore, this query should be set work_mem to 16 MB to ensure performance.
Maintenance memory (memory maintenance)
Maintenance_work_mem determines the memory size that can be called during system maintenance.
This can also be set at any time in the query.
This memory is only used in vacuum, create index, reindex, and other system maintenance commands. System maintenance means that calling the hard disk swap greatly reduces the system performance. Generally, when maintenance_work_mem exceeds 1 GB, there is no actual performance increase (if the memory is sufficient, it is set to 1 GB)
Background writer (bgwriter)
Function:
Responsible for regularly writing dirty shared buffers in the shared buffer cache
Benefits:
A. Reduce the system flush shared buffers to the hard disk (already done by bgwriter)
B. the I/O burst is not displayed in the checkpoint, because dirty buffers has been flushed into the hard disk in the background.
Disadvantages:
Because flush disk is always behind the scenes, you will see how to add the average hard disk I/O (I/O surge better than checkpoint)
Settings:
Bgwriter_delay:
Sleep between rounds. Default 200 (adjusted based on machine and data)
Bgwriter_lru_maxpages:
Determine how much data is written by bgwriter each time. If the actual data is greater than the setting, the remaining data will be completed by the Postgres process. The data written by the server Porcess itself may cause certain performance degradation. To determine that all data is written by bgwriter, set the value here to-1.
Bgwriter_lru_multiplier:
The calculation method is used to determine how much data should be written by bgwriter. Here we can keep the built-in 2.0.
Calculate the I/O of bgwriter:
1000/bgwriter_delay * bgwriter_lru_maxpages * 8192 = actual I/O
(8192 is the 8 K block of ipvs)
For example:
1000/200*100*8192 = 4096000 = 4000 KB
Bgwrater can be monitored using pg_stat_bgwriter. To observe the running status of bgwrater, clear the old stat information first.
If bgwriter is set too large (too many tasks), it will affect the front-end performance (server). However, if the system (server) is used as the buffer flush, it will also affect the performance. So the best setting here is to find an optimal balance point by observing pg_stat_bgwriter.
WAL (write ahead log)
All write actions in ipvs are written to Wal before execution. This ensures data accuracy and integrity. When the database crashes midway through, Postgres can recover to the status before the crash through Wal without any data errors.
WAL is written back to the hard disk in two cases.
1. Commit. When commit data is written, WAL is forcibly written back to the hard disk (flush) and all the items before the commit are flush in Wal.
2. The wal writer process will write back regularly.
Fsync vs async
The default function of ipvs is fsync. That is to say, ipvs will wait for the data to be written to the hard disk before returning a successful signal to the query. If you set sync = No to disable fsync, s will not wait for Wal to write the hard disk, and directly return the query success. Generally, this will increase the performance by 15-25%.
However, if the system crashes (power-off, Postgres), you may lose the last transcation. However, this will not cause data structure problems in your system. (No data packet upt) if it is acceptable to lose 1-2 pieces of data in case of system problems, the performance improvement of 25% is considerable.
Wal settings:
Fsync can be set to on or off.
Wal_sync_method:
In Linux, fdatasync is used. Others... I don't know. I should check the system's file parameters.
Full_page_writes:
When it is enabled, Postgres writes each disk page to Wal for the first change to the page after the checkpoint. In this way, the page is only half written when the system is powered off. Enabling this option ensures the integrity of the page image.
When it is closed, performance will increase. Especially when RAID cards with batteries are used, the risk is lower. This option is the underlying risk in exchange for performance. You can disable it.
Wal_buffers:
The storage size of Wal. The default value is 64 KB. Experiments show that setting this value between 256 kb and 1 MB will improve the performance.
Wal_writer_delay
The interval at which Wal checks Wal data (write-back. The value is Millisecond (milliseconds)
Checkpoints
Make sure that the data is written back to the hard disk. Dirty data page will be flushed back to the hard disk.
Checkpoint is triggered by the following three conditions (if bgwriter is set, it will help to write data in the background, so there will be no short-term high I/O checkpoint)
1. Arrive at the set Wal segments
2. The specified timeout is reached.
3. the user issues the checkpoint command
If the checkpoint running frequency is higher than the checkpint_warning value. S will be recorded in the log. By observing the log, you can determine the setting of checkpoint_segments.
The addition of cehckpoint_segments or checkpoint_timeout can improve the efficiency. The only downside is that if the system crashes, it takes a little more time to reply when the system restarts (the database cannot be used during the system startup reply period). In view of the fact that ipvs rarely fails, this can be set to a very long time (one day)
Settings:
The maximum number of Wal logs in checkpoint_segments. When it is reached, the checkpoint will be triggered, which is usually set to 30.
Checkpoint_timeout is usually set to 15-20 minutes. It does not matter if it can be set to 1 day.
Checkpoint_completion_target. The built-in value is 0.5, which means that each checkpoint is expected to be completed within half of the time before the next checkpoint is completed (it sounds a bit confusing)
Checkpoint_warning if the checkpint speed is faster than this time, it is recorded in the log. Built-in: 30 seconds
The perfect setting in theory is that your backend never writes back to the hard disk. Everything is written by the background. This requires the adjustment of bgwriter, checkpoints and Wal to an optimal balance. Of course, this is the perfect ideal... Continue. Haha
4. Maintenance-keep the smile of Postgres
Database maintenance is required. Basic Maintenance
Vacuum
When deleting data, the database only records this data and does not really delete the data. Therefore, vacuum is required at this time. vacuum will clear the data marked as "not. Note that vacuum does not clear the index. When the data changes more than 75%, you need to re-create the index. Postgres 8.4 index can be rebuilt using cluster much faster. In ipvs. X, vacuum = cluster is no different (the cluster is retained to be compatible with earlier commands)
Full vacuum
This will implement the exclusive lock. The difference between vacuum and full vacuum is that vacuum marks the space marked as "not" as available for reuse (recycling), while full vacuum deletes the space (returned to the system OS) therefore, after vacuum, the occupied space displayed by your ipvs on the hard disk will not be reduced, but full vacuum will reduce the occupied space of the hard disk. It is not recommended to use full vacuum. The first is unnecessary, and the second exclusive lock is not fun.
Analyze
Analyze will update the statistical information (Statistics) of all the query best solutions, and SQL prepared statement is determined by this statistical information. Therefore, after a certain amount of data changes in the database (for example, more than 10%), analyze should be performed. Strictly speaking, this is something that should be done frequently and is part of the normal maintenance of the database. Another important thing is that after upload data (Restore Type) is completed, remember to perform analyze (the restore will not automatically Do It For You)
When creating a new table, or adding an index to the table, or reindex the table, or restore data into the database, you need to manually run analyze. Analyze directly affects default_statistics_target data.
Autovacuum
According to official documents of Postgres, autovacuum became more practical after 8.3 (launched in 8.1) Because before 8.3, autovacuum can only make one table in one database at a time. Versions later than 8.3 can be used as multiple databases and multiple tables.
Set
Log_autovacuum_min_duration:
-1 is disabled. 0 indicates all logs.> 0 indicates that logs are logged out of this time. For example, if it is set to 30, all logs that exceed 30 ms will be recorded.
Autovacuum_max_workers:
The autovacuum process that is enabled at the same time. Generally, do not set it to too high. You can set it to 3.
Autovacuum_naptime:
During database check, the default value is 1 minute and no changes are required.
Autovacuum_vacuum_threshold:
Autovacuum is triggered only when there are at least N rows of records. That is, if the data change is lower than this value, autovacuum will not run. Default is 50
Autovacuum_analyze_threshold:
The lowest value of the analyze operation, which is the same as above
Autovacuum_vacuum_scale_factor:
The formula for calculating the percentage in the table (vacuum if the percentage exceeds a certain percentage) is 20% (0.2)
Autovacuum_analyze_scale_factor:
Same as above, but it is set by analyze.
Autovacuum_freeze_max_age:
Autovacuum starting from Max Xid
Autovacuum_vacuum_cost_delay:
Delay .. If the system loads other things, vacuum can be slowed down to ensure the operation of other things. Here, delay is used to limit
Autovacuum_vacuum_cost_limit:
In the same way, it also imposes restrictions. Here, we use cost to restrict limit.
Cluster
Cluster is similar to vacuum full. We recommend that you use cluster instead of vacuum full. Like vacuum full, cluster will overwrite the table and remove all dead row. The exclusive lock is also required.
Truncate
Turncat deletes all data in a table and does not cause any dead row (delete will cause dead row). turncate can also be used to recreate a table.
Begin;
Lock Foo in access exclusive mode;
Create Table bar as select * From Foo;
Turncate Foo;
Insert into Foo (select * from bar );
Commit;
In this way, the table foo is cleared again.
Reindex
Re-create Index
5. Others
Planner:
Statistics directly determines the planner result. When using planner, make sure that statistics is accurate (analyze)
Default_statistics_target:
Set the analyze analysis value. This can be set and changed at any time in the query.
Set default_statistics_target to 100;
Analyze verbose mytable;
Info: analyzing "aweber_shoggoth.mytable"
Info: "mytable": Scanned 30000 of 1448084 pages, containing 1355449 live rows and 0 dead rows; 30000 rows in sample, 65426800 estimated total rows
Analyze
Set default_statistics_target to 300;
Analyze verbose mytable;
Info: analyzing "aweber_shoggoth.mytable"
Info: "mytable": Scanned 90000 of 1448084 pages, containing 4066431 live rows and 137 dead rows; 90000 rows in sample, 65428152 estimated total rows
Analyze
Set statistics per column set different statistics values for different columns
Alter table Foo alter column bar set statistics 120
When to add statistics
Run a query as expain analyze.
This will show the example:
-> Seq scan on bar (cost = 0.00-52.00 rows = 52 width = 2 (actual time = 0. 007 .. 1.894 rows = 3600 loops = 1)
The number of rows here should be similar to the actual number of rows.
Seq_page_cost
The cost when the planner performs Sequential Scan. The default value is 1. If the memory, cache, and shared buffer are set correctly. So the default value is too low and can be increased.
Random_page_cost
Planner is the value of random page fetch. The default value is 4.0. If the memory, cache, and shared buffer are set correctly, this value is too high and can be reduced.
Seq_page_cost and random_page_cost can be set to the same value. Then test the performance and appropriately reduce the value of random_page_cost.
Cpu_operator_cost
The default value is 0.0025. the test value is usually set to 0.5.
Set cpu_operator_cost to 0.5;
Explain analyze select ....
Cpu_tuple_cost
The default value is 0.01, which is usually set to 0.5.
Set cpu_tuple_cost to 0.5;
Explain analyze select...
Effective_cache
Should be followed as much as possible to the system free acceptable size (the larger the better)
Total used free shared buffer cached
Mem: XXXX yyyyy ZZZ AAAA BBBB CCCC
The calculation method is as follows:
Required tive_cache = cached x 50% + shared
Here, 50% can be adjusted between 40% and 70% Based on the server's busy degree.
Monitoring Method:
Explain analyze;
Set inclutive_cache_size = new value;
Explain analyze;
Reset inclutive_cache_size;
Try out the most suitable value, and you can change the PostgreSQL. conf file to a fixed value.
Natural vs primary key
The primary key consumes about 20% of the performance compared with natural because it needs to be joined. So we try our best to make primary on the natural key.
Btree vs hash
Btree is faster than hash, so do not use Hash
Gin vs gist