PostgreSQL源碼分析之shared buffer與磁碟檔案

最後更新：2015-08-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

我們知道，PostgreSQL資料庫中的資訊，最終是要寫入持久裝置的。那麼PostgreSQL是怎麼將資訊組織儲存在磁碟上的呢？ Bruce Momjian有一個slide 《Insider PostgreSQL shared memory》，裡面的圖片非常直觀的描述了，shared buffer，page ，磁碟檔案之間的關係，請看。接下來幾篇部落格，從不同層面講述PostgreSQL儲存相關的的記憶體：

中左下角是page的組織形式。PostgreSQL 8K為一個頁面，從share buffer寫入relation 對應的磁碟檔案，或者從relation對應的磁碟檔案讀入8K到shared buffer。shared buffers是一組8K的頁面，作為緩衝。對於資料庫的relation而言，一條記錄（Item或者叫Tuple），大小不一，不會恰好佔據8K的空間，可能只有幾十個位元組，所以，如何將多條記錄存放進8K的shared buffer，這就是page的組織形式了，我會在另一篇博文介紹。
對於Linux 我們知道，讀檔案，會首先將磁碟上的內容讀入記憶體，寫檔案會首先寫入cache，將cache標記成dirty，在合適的時機寫入磁碟。對於這個不太熟悉的，可以閱讀我前面的一篇博文 file 和page cache的一些事，PostgreSQL中shared buffers 之於relation file in disk 就相當於Linux 中page cache之於file in disk。

查看/設定 shared buffers大小：
首當其衝的是，PostgreSQL中shared buffers有多大，多少個8KB的buffers，當然這是可以配置的，我們通過如下方法查看配置：

show shared_buffers

或者：

select name,unit,setting,current_setting(name) from pg_settings where name = ‘shared_buffers‘ ;

上面講述的是查看，如何修改呢？需要修改設定檔postgresql.conf :

[email protected]:/usr/pgdata# cat postgresql.conf | grep ^shared_buffers
shared_buffers = 24MB # min 128kB

我們可以將shared_buffers改成一個其他的值，至於改成多大的值是合理的，則取決與你的硬體環境，比如你的硬體很強悍，16GB記憶體，那麼這個值設定成24MB就太摳門了。至於shared buffers多大才合理，網上有很多的說法，有的說記憶體總量的10%～15%,有的說記憶體總量的25%,幸好PostgreSQL提供了一些performance measure的工具，讓我們能夠監測PostgreSQL啟動並執行performance，我們實際情況可以根據PostgreSQL的效能統計資訊，調大或者調小這個shared buffers的大小。
但是又有個問題，shared buffer是以共用記憶體的形式分配的，如果在設定檔中配置的值超過作業系統對share memory的最大限制，會導PostgreSQL初始化失敗。如，我將postgresql.conf中shared_buffers = 64MB,就導致了啟動失敗如所示：

原因是kernel的SHMMAX最大隻有32MB，下面我查看並且修改成512MB

改過之後，就可以啟動PostgreSQL了，我們可以查看shared_buffers已經變成了64MB：

manu_db=# show shared_buffers ;
shared_buffers
----------------
64MB
(1 row)

簡單的內容結束了，我們需要深入程式碼分析shared buffers的原理了，如何組織記憶體，如何分配，如何page replacement，都在源碼之中尋找答案。詳細的內容，我打算在下一篇博文裡面介紹，因為原理部分本身就會內容有很多，必然會導致我這篇文章比較長。我本文剩下的內容想介紹記憶體中的shared buffer 如何得知對應的磁碟的檔案。因為shared buffer中的8K內容，最終會sync到磁碟檔案。PostgreSQL是將記憶體中的shared buffer和磁碟上的某個檔案對應起來的呢。

shared buffer與relation的磁碟檔案的對應關係
本文的第一個圖，上半部分講述的是shared buffer的結構，分兩部分
1 赤果果的buffer，N個8K塊，每個塊存放從relation對應磁碟檔案讀上來的某個8K的內容。
2 管理buffer的結構，也是N個，有幾個buffer，就有幾個管理結構。Of Course，管理結構佔用的記憶體空間要遠小於赤果果的buffer，否則記憶體利用率太低了。
這是初始化的時候，為這兩個部分分配空間：

BufferDescriptors = (BufferDesc *)
ShmemInitStruct("Buffer Descriptors",
NBuffers * sizeof(BufferDesc), &foundDescs);
BufferBlocks = (char *)
ShmemInitStruct("Buffer Blocks",
NBuffers * (Size) BLCKSZ, &foundBufs);

這個管理buffer的結構體叫BufferDesc，我智商不高，也知道肯定也知道會記錄對應的buffer有沒有被使用，對應的是哪個磁碟檔案的第幾個8K block，為了應對並發，肯定會有鎖。我們看下這個結構體的定義：

typedef struct sbufdesc
{
BufferTag tag; /* ID of page contained in buffer */
BufFlags flags; /* see bit definitions above */
uint16 usage_count; /* usage counter for clock sweep code */
unsigned refcount; /* # of backends holding pins on buffer */
int wait_backend_pid; /* backend PID of pin-count waiter */
slock_t buf_hdr_lock; /* protects the above fields */
int buf_id; /* buffer‘s index number (from 0) */
int freeNext; /* link in freelist chain */
LWLockId io_in_progress_lock; /* to wait for I/O to complete */
LWLockId content_lock; /* to lock access to buffer contents */
} BufferDesc;

OK，我們回到我們最初關係的問題，當前這個shared buffer和which db ，which table，which type（後面解釋type），which file的which 8KB block對應。第一個 BUfferTag類型的tag欄位就是確定這個對應關係的：

typedef enum ForkNumber
{
InvalidForkNumber = -1,
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
INIT_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM below and update the
* forkNames array in catalog.c
*/
} ForkNumber;
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
/*
* Buffer tag identifies which disk block the buffer contains.
*
* Note: the BufferTag data must be sufficient to determine where to write the
* block, without reference to pg_class or pg_tablespace entries. It‘s
* possible that the backend flushing the buffer doesn‘t even believe the
* relation is visible yet (its xact may have started before the xact that
* created the rel). The storage manager must be able to cope anyway.
*
* Note: if there‘s any pad bytes in the struct, INIT_BUFFERTAG will have
* to be fixed to zero them, since this struct is used as a hash key.
*/
typedef struct buftag
{
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;

我們可以看到BufferTag中的rnode，表徵的是which relation。這個rnode的類型是RelFileNode類型，包括資料庫空間/database/relation，從上到下三級結構，唯一確定了PostgreSQL的一個relation。對於relation而言並不是只有一種類型的磁碟檔案，

-rw------- 1 manu manu 270336 6月 3 21:31 11785
-rw------- 1 manu manu 24576 6月 3 21:31 11785_fsm
-rw------- 1 manu manu 8192 6月 3 21:31 11785_vm

如所示11785對應某relation，但磁碟空間中有三種，包括fsm和vm尾碼的兩個檔案。我們看下ForkNumber的注釋：

/*
* The physical storage of a relation consists of one or more forks. The
* main fork is always created, but in addition to that there can be
* additional forks for storing various metadata. ForkNumber is used when
* we need to refer to a specific fork in a relation.
*/

MAIN_FORKNUM type的總是存在，但是某些relation還存在FSM_FORKNUM和VISIBILITYMAP_FORKNUM兩種檔案，這兩種我目前知之不詳，我就不瞎說了。
我們慢慢來，先放下blockNum這個成員變數，步子太大容易扯蛋，我們先根據rnode+forkNum找到磁碟對應的檔案？
這個尋找磁碟檔案的事兒是relpath這個宏通過調用relpathbackend實現的：

char *
relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
{
if (rnode.spcNode == GLOBALTABLESPACE_OID)
{
...
}
else if (rnode.spcNode ==DEFAULTTABLESPACE_OID)
{
pathlen = 5 + OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
path = (char *) palloc(pathlen);
if (forknum != MAIN_FORKNUM)
snprintf(path, pathlen, "base/%u/%u_%s",
rnode.dbNode, rnode.relNode,
forkNames[forknum]);
else
snprintf(path, pathlen, "base/%u/%u",
rnode.dbNode, rnode.relNode);
}
else
{
...
}
}

因為我們是pg_default,所以我們走DEFAULTTABLESPACE_OID這個分支。決定了我們在base目錄下，db的oid（即BufferTag->rnode->dbNode）是16384決定了base/16384/，BufferTag->rnode->relNode + BufferTag->forkNum 決定了是base/16384/16385還是 base/16384/16385_fsm or base/16384/16385_vm。

尋找檔案基本結束，不過，某些某些relation比較大，記錄比較多，會導致磁碟檔案超大，為了防止檔案系統對磁碟檔案大小的限制而導致的寫入失敗，PostgreSQL做了分段的機制。以我的friends為例，如果隨著記錄的不斷插入，最後friends對應的磁碟檔案16385越來越大，當超過1G的時候，PostgreSQL就會建立一個磁碟檔案叫16385.1,超過2G的時候PostgreSQL再次分段，建立檔案16385.2 。這個1G就是有Block size = 8KB和blockS per segment of large relation=128K（個）共同決定的。

源碼中的定義上面有注釋，解釋了很多內容：

/* RELSEG_SIZE is the maximum number of blocks allowed in one disk file. Thus,
the maximum size of a single file is RELSEG_SIZE * BLCKSZ; relations bigger
than that are divided into multiple files. RELSEG_SIZE * BLCKSZ must be
less than your OS‘ limit on file size. This is often 2 GB or 4GB in a
32-bit operating system, unless you have large file support enabled. By
default, we make the limit 1 GB to avoid any possible integer-overflow
problems within the OS. A limit smaller than necessary only means we divide
a large relation into more chunks than necessary, so it seems best to err
in the direction of a small limit. A power-of-2 value is recommended to
save a few cycles in md.c, but is not absolutely required. Changing
RELSEG_SIZE requires an initdb. */
#define RELSEG_SIZE 131072

當然了這個128K的值是預設值，我們編譯PostgreSQL的階段 configure的時候，可以通過--with-segsize 指定其他的值，不過這個我沒有try過。
考慮上segment，真正的磁碟檔案名稱fullpath就呼之欲出了：
如果分段了，在relpath擷取的名字後面加上段號segno，如果段號是0,那麼fullpath就是前面講的relpath。

static char *
_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
{
char *path,
*fullpath;
path = relpath(reln->smgr_rnode, forknum);
if (segno > 0)
{
/* be sure we have enough space for the ‘.segno‘ */
fullpath = (char *) palloc(strlen(path) + 12);
sprintf(fullpath, "%s.%u", path, segno);
pfree(path);
}
else
fullpath = path;
return fullpath;
}

怎麼判斷segno是幾？這個太easy了，(BufferTag->rnode->blockNum/RELSEG_SIZE)。
OK，講過這個shared buffer中的8K塊和relation 的磁碟檔案的對應關係，我們就可以安心講述 shared buffer的一些內容了。悲劇啊，文章寫了好久。
參考文獻：
1 PostgreSQL 效能調校
2 PostgreSQL 9.1.9 Source Code
3 Bruce Momjian的Insider PostgreSQL shared memory

PostgreSQL源碼分析之shared buffer與磁碟檔案

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More