PostgreSQL source analysis of shared buffer and disk files

Last Update:2015-08-05 Source: Internet

Author: User

Tags postgresql

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We know that the information in the PostgreSQL database is ultimately written to the persistent device. So how does PostgreSQL store the information organization on disk? Bruce Momjian has a slide "insider PostgreSQL Shared Memory", inside the picture is very intuitive description, shared buffer,page, the relationship between disk files, please see. The next few blogs, from different levels, tell about the memory associated with PostgreSQL storage:

The bottom-left corner is the page's organization. PostgreSQL 8K is a page that writes relation corresponding disk file from share buffer, or reads 8K to shared buffer from the relation corresponding disk file. The shared buffers is a set of 8K pages that act as a cache. For the relation of the database, a record (item or tuple), different size, does not occupy exactly 8K of space, probably only dozens of bytes, so how to store multiple records into the 8K shared buffer, which is the form of the page, I'll introduce it in another blog post.
For Linux we know that reading a file will first read the contents of the disk into memory, write the file will first write to the cache, the cache is marked as dirty, at the appropriate time to write to disk. For this unfamiliar, you can read one of my previous blog file and page cache some things, PostgreSQL shared buffers to relation file in disk is equivalent to the Linux page cache to file In disk.

To view/Set the shared buffers size:
First of all, PostgreSQL buffers how big, how many 8KB buffers, of course, this is configurable, we can see the configuration by the following methods:

Show Shared_buffers

Or:

Select Name,unit,setting,current_setting (name) from pg_settings where name = ' Shared_buffers ';

The above is the view, how to modify it? You need to modify the configuration file postgresql.conf:

[Email protected]:/usr/pgdata# cat postgresql.conf | grep ^shared_buffers
shared_buffers = 24MB # min 128kB

We can change the shared_buffers to a different value, as to how much value is reasonable, depending on your hardware environment, such as your hardware is very strong, 16GB of memory, then this value set to 24MB is too stingy. As for the shared buffers is reasonable, there are many words on the internet, some say that the amount of memory 10%~15%, some say the total memory of 25%, fortunately PostgreSQL provides some performance measure tools, So that we can monitor the performance of PostgreSQL operation, we can adjust the size of this shared buffers according to the performance statistics of PostgreSQL.
However, shared buffer is allocated as shared memory, and if the value configured in the configuration file exceeds the operating system's maximum limit for share memory, the PostgreSQL initialization fails. For example, I will postgresql.conf in shared_buffers = 64MB, resulting in a startup failure as shown in:

The reason is kernel Shmmax max is only 32MB, below I view and modify to 512MB

After you've changed, you can start PostgreSQL, and we can see that Shared_buffers has become 64MB:

manu_db=# show Shared_buffers;
Shared_buffers
----------------
64MB
(1 row)

Simple content is over, we need to drill down into the principles of Code Analysis Shared buffers, how to organize memory, how to allocate it, how to page replacement, Find the answers in the source code. Detailed content, I intend to introduce in the next post, because the principle part itself will have a lot of content, will inevitably lead to my this article is relatively long. The rest of this article is about how the in-memory shared buffer learns about the corresponding disk file. Because the 8K content in the shared buffer will eventually sync to the disk file. PostgreSQL corresponds to a shared buffer in memory and a file on the disk.

shared buffer correspondence to relation disk File
First diagram of this article, The upper part is about the structure of the shared buffer, which is divided into two parts
1 buffer,n 8K blocks of red fruit, each of which holds a 8K of content read from the relation corresponding disk file.
&NBSP;2 manages the structure of buffer, also n, there are several buffer, there are several management structures. of Course, the management structure occupies much less memory space than the red fruit buffer, otherwise the memory utilization is too low.
This is the time to initialize, allocate space for these two parts:

Bufferdescriptors = (BUFFERDESC *)
Shmeminitstruct ("Buffer descriptors",
Nbuffers * sizeof (BUFFERDESC), &founddescs);
Bufferblocks = (char *)
Shmeminitstruct ("Buffer Blocks",
Nbuffers * (Size) Blcksz, &FOUNDBUFS);

This management buffer structure called BUFFERDESC, my IQ is not high, but also know that the corresponding buffer is not used, corresponding to which disk file of the 8K block, in order to deal with concurrency, there will definitely be a lock. Let's look at the definition of this structure:

typedef struct SBUFDESC
{
Buffertag tag; /* ID of page contained in buffer */
Bufflags flags; /* See bit definitions above */
UInt16 Usage_count; /* Usage counter for Clock sweep code */
unsigned refcount; /* # backends holding pins on buffer */
int wait_backend_pid; /* Backend PID of pin-count waiter */
slock_t Buf_hdr_lock; /* Protects the above fields * *
int buf_id; /* buffer ' s index number (from 0) */
int freenext; /* link in freelist chain */
Lwlockid Io_in_progress_lock; /* To-wait for I/O to complete */
Lwlockid Content_lock; /* To lock access to buffer contents */
} Bufferdesc;

OK, we return to the question of our initial relationship, the current shared buffer and which db, which Table,which type (later interpreted type), which file which 8KB block corresponds. The tag field of the first Buffertag type determines the corresponding relationship:

typedef enum FORKNUMBER
{
Invalidforknumber =-1,
Main_forknum = 0,
Fsm_forknum,
Visibilitymap_forknum,
Init_forknum
/*
* Note:if Add a new fork, change max_forknum below and update the
* Forknames Array in catalog.c
*/
} Forknumber;
typedef struct RELFILENODE
{
Oid Spcnode; /* tablespace */
Oid Dbnode; /* Database */
Oid Relnode; /* Relation */
} Relfilenode;
/*
* Buffer tag identifies which disk block the buffer contains.
*
* Note:the Buffertag data must is sufficient to determine where to write the
* block, without reference to Pg_class or pg_tablespace entries. It ' s
* Possible that the backend flushing the buffer doesn ' t even believe the
* Relation is visible yet (it XACT may has started before the XACT that
* created the rel). The storage Manager must is able to cope anyway.
*
* Note:if there ' s any pad bytes in the struct, Init_buffertag would have
* To was fixed to zero them, since this struct is used as a hash key.
*/
typedef struct BUFTAG
{
Relfilenode Rnode; /* Physical Relation identifier */
Forknumber Forknum;
Blocknumber Blocknum; /* Blknum relative to begin of Reln */
} Buffertag;

We can see the rnode in Buffertag, which is characterized by which relation. The type of the Rnode is the Relfilenode type, including the database space/database/relation, from top to bottom level three structure, the only one relation that determines the PostgreSQL. For relation, there is not only one type of disk file,

-RW-------1 Manu Manu 270336 June 3 21:31 11785
-RW-------1 Manu Manu 24576 June 3 21:31 11785_FSM
-RW-------1 Manu Manu 8192 June 3 21:31 11785_VM

As shown in Figure 11785 corresponds to a relation, but there are three types of disk space, including two files for FSM and VM suffixes. Let's take a look at Forknumber's notes:

/*
* The physical storage of a relation consists of one or more forks. The
* Main fork is all created, but addition to that there can be
* Additional forks for storing various metadata. Forknumber is used when
* We need to refer to a specific fork in a relation.
*/

Main_forknum type always exists, but some relation still exist fsm_forknum and visibilitymap_forknum two kinds of files, these two kinds of I know now unknown, I will not nonsense.
Let's take it slow, let's put down Blocknum this member variable, step too big easy to pull the egg, we first find the disk corresponding file according to Rnode+forknum?
The thing to look for in a disk file is RELPATH This macro is implemented by calling Relpathbackend:

char *
Relpathbackend (Relfilenode Rnode, Backendid backend, Forknumber forknum)
{
if (Rnode.spcnode = = globaltablespace_oid)
{
...
}
else if (Rnode.spcnode = =defaulttablespace_oid)
{
Pathlen = 5 + oidchars + 1 + oidchars + 1 + forknamechars + 1;
Path = (char *) palloc (Pathlen);
if (forknum! = main_forknum)
snprintf (Path, Pathlen, "base/%u/%u_%s",
Rnode.dbnode, Rnode.relnode,
Forknames[forknum]);
Else
snprintf (Path, Pathlen, "base/%u/%u",
Rnode.dbnode, Rnode.relnode);
}
Else
{
...
}
}

Because we are pg_default, so we go defaulttablespace_oid this branch. Determines that we are in the base directory, the OID of the db (that is, Buffertag->rnode->dbnode) is 16384 determines the Base/16384/,buffertag->rnode->relnode + Buffertag->forknum decided whether it was base/16384/16385 or BASE/16384/16385_FSM or BASE/16384/16385_VM.

Find the basic end of the file, however, some of the relation is larger, more records, will cause the disk file is very large, in order to prevent the file system to the disk file size limit caused by the write failure, PostgreSQL did a segmented mechanism. Take my friends as an example, if the record continues to insert, the last friends corresponding disk file 16385 is getting larger, when more than 1G, PostgreSQL will create a new disk file called 16385.1, more than 2G when PostgreSQL again segmented , create a new file 16385.2. This 1G is a block size = 8KB and blocks per segment of large relation=128k (each) jointly determined.

The definition in the source code has a comment that explains a lot of things:

/* Relseg_size is the maximum number of blocks allowed in one disk file. Thus,
The maximum size of a single file is relseg_size * BLCKSZ; Relations bigger
than that is divided into multiple files. Relseg_size * Blcksz must be
Less than your OS ' limit on file size. This is often 2 GB or 4GB in a
32-bit operating system, unless you has large file support enabled. By
Default, we make the limit 1 GB to avoid any possible integer-overflow
Problems within the OS. A limit smaller than necessary only means we divide
A large relation into more chunks than necessary, so it seems best to err
In the direction of a small limit. A power-of-2 value is recommended to
Save a few cycles in MD.C, but was not absolutely required. changing
Relseg_size requires an initdb. */
#define Relseg_size 131072

Of course, the value of this 128K is the default value, we compile the PostgreSQL phase configure, you can specify other values through--with-segsize, but this I did not try.
Considering the segment, the real disk file name FullPath is on the horizon:
If segmented, after the name obtained by RelPath, add the segment number Segno, if the segment number is 0, then FullPath is the RelPath.

Static char *
_mdfd_segpath (smgrrelation Reln, Forknumber forknum, Blocknumber segno)
{
Char *path,
*fullpath;
Path = RelPath (Reln->smgr_rnode, forknum);
if (Segno > 0)
{
/* Is sure we have enough space for the '. Segno ' * *
FullPath = (char *) palloc (strlen (path) + 12);
sprintf (FullPath, "%s.%u", Path, Segno);
Pfree (path);
}
Else
FullPath = path;
return fullpath;
}

How do you judge Segno? This is too easy, (buffertag->rnode->blocknum/relseg_size).
OK, with the corresponding relationship between the 8K block in the shared buffer and the relation disk file, we can safely tell the contents of the shared buffer. Tragedy Ah, the article wrote for a long time.
Reference documents:
1 PostgreSQL Performance Tuning
2 PostgreSQL 9.1.9 Source Code
3 Bruce Momjian's insider PostgreSQL shared memory

PostgreSQL source analysis of shared buffer and disk files

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More