Hardware performance tuning of open-source PostgreSQL (1)
POSTGRESQL is an object relational database developed by a group of network developers around the world. It is an open-source version that can replace commercial databases such as Oracle and Informix.
POSTGRESQL was initially developed by UC Berkeley. In 1996, a group began to develop the database on the Internet. They use email to share ideas and file servers to share code. POSTGRESQL is now comparable to commercial databases in terms of functionality, performance, and reliability. It supports transactions, views, stored procedures, and reference integrity constraints. It also supports a large number of programming interfaces, including ODBC, Java (JDBC), TCL/TK, PHP, Perl, and Python. Thanks to the talent pool of Internet developers, POSTGRESQL has a huge room for growth.Performance Concept
There are two ways to optimize database performance. On the one hand, it is to improve the database's use of computer CPU, memory and hard disk. On the other hand, it is optimized for the query passed to the database. This article discusses how to optimize database performance in terms of hardware. By using database SQL commands such as CREATE INDEX, VACUUM, VACUUM FULL, ANALYZE, CLUSTER, and EXPLAIN, the optimization of plug-in queries has been completed. These have been discussed in my book PostgreSQL: Introduction and Concepts http://momjian.us/main/writings/pgsql/aw_pgsql_book.
To understand the hardware performance, you must understand what happened inside the computer. Simply put, a computer can be regarded as a central processing unit CPU surrounded by memory ). There are different registers on the same small segment as the CPU. They store intermediate calculation results, various pointers, and counters. The CPU cache is surrounded by the latest access information. A large amount of Random Access to Memory RAM) beyond the CPU cache, which stores running programs and data. It is a hard disk on the periphery of RAM and stores more information. The hard disk is the only region that can store information permanently ., So after the computer is shut down, all the information saved here. To sum up, these are the storage areas that enclose the CPU:
Storage area capacity
Number of CPU registers
Thousands of bytes in High-Speed CPU Cache
Several gigabytes of Hard Disk
You can see that the storage size increases as the distance from the CPU increases. Theoretically, large-capacity permanent storage can be placed next to the CPU, but this will be slow and expensive. In practice, the most commonly used information is placed next to the CPU, and the unused information is far away from the CPU. Take it to the CPU when the CPU is needed.
Shorten the distance between data and CPU
Data is automatically transferred in various storage areas. The compiler determines which data exists in the register. The CPU determines which data exists in the cache. The operating system is responsible for data exchange between memory and hard disk.
The database administrator can't do anything about the CPU registers and cache. To improve database performance, you can only increase the amount of useful data in the memory to reduce disk access.
It seems simple. Otherwise, the data in the memory contains many things:
Program in progress
Program Data and stack
POSTGRESQL shared Cache
Kernel disk cache
The ideal performance adjustment not only increases the amount of database data in the memory, but also does not have a negative impact on the system.
POSTGRESQL shared Cache
POSTGRESQL does not directly access the disk, but accesses the cache of POSTGRESQL. Then, the background program of POSTGRESQL reads and writes the data blocks to the disk.
The background first checks whether the cache already exists in the table. Yes. Proceed. No, the operating system caches the data from the kernel disk or directly loads the data from the disk. Either way, the cost is high.
POSTGRESQL allocates 1000 caches by default. Each cache contains 8 k bytes. Increase the number of caches, increase the frequency of access to the cache in the background, and reduce system requests with high costs. The number of caches can be set through the postmaster command line parameter or the shared_buffers value in the configuration file postgresql. conf.How big is too big?
You may be thinking, "I have allocated all the memory to the buffer zone of POSTGRESQL ". If you do this, the system kernel and other programs will no longer have memory available. The ideal POSTGRESQL shared buffer size is that the larger the size, the better, without adverse impact on the system.
To understand the negative effects, first understand how UNIX manages memory. If the memory capacity is large enough, all programs and data can be stored. Then we don't need to manage the memory. The problem is that the memory capacity is limited. Therefore, the kernel needs to paging the data in the memory and store it to the disk. This is the legendary data exchange. The principle is to move the currently unavailable data to the disk. This operation is called migrating the swap page to swap pageout ). It is not difficult to move pages into the swap zone, as long as they are executed during the inactive period of the program. The problem is that when the page is removed from the swap area again. That is, the old page is moved to the swap area and the memory is moved back. This operation is called to remove the swap zone from the swap pagein ). It is a problem because when the page is moved into the memory, the program needs to terminate the execution until the operation is completed.
When the system page is active, you can use a system analysis tool such as vmstatand sar to check whether there is enough memory to maintain the normal operation of the system. Do not remove the swap page from the normal page. Regular page removal reads page data from the file system as part of system operations. If you cannot see whether there is any page removal operation in the SWAp area. However, the page Migration Operations in the SWAp area are very active, which also indicates that a large number of page removal operations are in progress.High-speed cache) Capacity impact
You may wonder why the cache size is so important. First, imagine that the full table can be put down when the PostgreSQL shared cache is large. Scanning this table repeatedly does not require hard disk participation because the data is already in the cache. It is assumed that the cache is a smaller unit than the table. A continuous scan loads all units into the cache until the last unit. When the last unit is required, the original unit is removed. When another continuous scan starts, the initial unit is no longer in the cache. to load it, the initial unit is removed, that is, the second unit will be removed during the first scan. This will continue until the end of the unit. This example is very extreme, but you can see that reducing a unit will change the cache efficiency from 100% to 0%. This indicates that finding an appropriate cache capacity will dramatically change the performance.Suitable capacity shared Cache
Theoretically, the POSTGERSQL shared cache will be:
It should be large enough to cope with common table access operations.
It should be small enough to avoid the occurrence of swap pagein.
Remember to allocate all shared storage when the Database Manager is running. This region remains the same size even if no database access request is made. Some operating systems do not specify shared storage for pageout, and some other LOCK shared storage to RAM. LOCK contributes to better storage. The Administrator guide for p ostgersql contains information about the core configurations of different operating systems, http://developer.postgresql.org/docs/developers/kernel-resources.html.