PostgreSQL 基础模块---缓冲池管理

参考资料

《PostgreSQL数据库内核分析》彭智勇彭煜玮：P99~P101

概述

在PostgreSQL中，任何对于表、元组、索引等操作都在缓冲池中进行，缓冲池的数据调度都以磁盘块为单位，需要访问的数据块以磁盘块为单位调用函数smgrread写入缓冲区，而smgrwrite将缓冲池数据写回磁盘。调入缓冲池中的磁盘块称为缓冲区，多个缓冲区组成的缓冲池。

PostgreSQL有两种缓冲池：共享缓冲池和本地缓冲池。共享缓冲池主要作为普通表的操作场所，本地缓冲池则仅本地可见的临时表的操作场所。本文仅对共享缓冲池进行阐述。

对缓冲池中，缓冲区的管理通过两种机制完成：

pin

当进程要访问缓冲区前，对于缓冲区加pin，pin的数目保存在缓冲区的refcount属性中。当refcount不为0时表明有进程正在访问缓冲区，此时该缓冲区不能被替换。
lock

lock机制为缓冲区的并发访问提供了保障，当进程对缓冲区进行写操作时加EXCLUSIVE锁，读操作加SHARE锁。比如：Insert操作，在获取到缓冲区后需要先将缓冲区加EXCLUSIVE锁。（加锁操作在RelationGetBufferForTuple函数中进行，详见插入流程）。

初始化共享缓冲区

共享缓冲池的初始化工作由InitBufferPool来完成。在共享缓冲池管理中，使用了一个全局数组BufferDescriptors来管理缓冲池中的缓冲区，其数组元素类型为BufferDesc。另外使用了一个全局指针变量BufferBlocks来存储缓冲池的起始地址。

下面先来看看BufferDesc的定义：

typedef struct BufferDesc
{BufferTag  tag;                    /* ID of page contained in buffer */int         buf_id;                 /* buffer's index number (from 0) *//* state of the tag, containing flags, refcount and usagecount */pg_atomic_uint32 state;int            wait_backend_pid;       /* backend PID of pin-count waiter */int            freeNext;               /* link in freelist chain */LWLock      content_lock;           /* to lock access to buffer contents */
} BufferDesc;

其中：

tag：用于标识该缓冲块的物理信息，具体定义如下：

typedef struct buftag
{RelFileNode rnode;         /* 表所在表空间oid，数据库oid，表本身oid组成 */ForkNumber forkNum;        /* 枚举类型，标记缓冲区中是什么类型的文件块 */BlockNumber blockNum;      /* 块号 */
} BufferTag;

tag唯一标识了一个物理块，注意是物理块！（后面的缓冲区加载流程会再次用到tag）

buf_id：缓冲区的索引号，buf_id唯一标识了一个缓冲区。对缓冲区的各种操作都会用到buf_id。

共享缓冲区和本地缓冲区都使用buf_id，他们的编号规则不同：共享缓冲区的buf_id从0开始编号，后续依次加1。而本地缓冲区的buf_id从-2开始编号，后续依次减1。
```
/* 本地缓冲区从-2开始编号 */
#define LocalBufHdrGetBlock(bufHdr) LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
```
state：由flags、refcount、usagecount组成
- flags：标志位，表示缓冲区是否为脏等。
- refcount：表示当前正在引用该块缓冲区的进程数，通过pin操作来修改该字段。
- usagecount：最近缓冲区使用次数，用于缓冲区替换。
wait_backend_pid：用于记录一个请求修改缓冲区的进程号。
freeNext：如果当前缓冲区在空闲链中，则freeNext指向下一个空闲缓冲区。
content_lock：当进程访问缓冲块时，会在content_lock上加锁，读访问加LW_SHARE锁，写访问加LW_EXCLUSIVE锁，此锁可以防止因多个进程对缓冲区访问的冲突而造成数据不一致。

缓冲区的操作

前面说到共享缓冲池管理中有两个全局变量：BufferDesc数组BufferDescriptors和BufferBlocks指针。那么这两个全局变量之间有什么关系，两者由如何转换？

首先，BufferDescriptors是一个数组，数组元素的个数为N。N=缓冲池中缓冲区的数量，默认值为1000。BufferBlocks是一段连续的内存空间，大小为BLCKSZ*N，所以BufferBlocks也可以理解为一个数组，数组元素个数为N，每个数组元素都是一个缓冲区。

在BufferDesc中有一个成员buf_id，这个值表示了当前的BufferDesc在BufferDescriptors中的下标，即

BufferDesc == BufferDescriptors[BufferDesc ->buf_id]。

所以根据buf_id就可以从BufferDescriptors中获取BufferDesc，也可以从BufferBlocks中获取实际的缓冲区。具体操作见如下函数：

/* 返回一个bufferid,后续的操作都是基于bufferid进行 */
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)

/* 从BufferDescriptors中获取一个BufferDesc */
#define GetBufferDescriptor(id) (&BufferDescriptors[(id)].bufferdesc)

/* 从BufferBlocks中获取一个缓冲区 */
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))#define BufferIsLocal(buffer)   ((buffer) < 0)   /* 判断是否是本地缓冲区 */#define BufferGetBlock(buffer) \
( \AssertMacro(BufferIsValid(buffer)), \BufferIsLocal(buffer) ? \LocalBufferBlockPointers[-(buffer) - 1] \: \(Block) (BufferBlocks + ((Size) ((buffer) - 1)) * BLCKSZ) \
)

对于GetBufferDescriptor的调用需要的参数直接是BufferDesc的数组下标，但对于BufferGetBlock的调用需要的参数却必须是BufferDescriptorGetBuffer的返回值，即数组下标+1。目前尚不清楚为什么要这样设计。

InitBufferPool的主要功能

InitBufferPool主要做三件事：

初始化BufferDescriptors。
初始化BufferBlocks。
初始化缓冲区hash表。

初始化缓冲区hash表，在StrategyInitialize中调用InitBufTable来完成。缓冲区hash表的作用在共享缓冲区的加载中来讲。

共享缓冲区加载（共享缓冲区查询）

当PostgreSQL读写一个物理块时，首先需要把物理块读取到共享缓冲区中，然后再从缓冲区中读写数据。从物理块读取到共享缓冲区的过程称为共享缓冲区加载。

ReadBuffer_common是所有缓冲区的通用函数，定义了本地缓冲区和共享缓冲区的通用读取方法。代码如下：

static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum, ReadBufferMode mode,BufferAccessStrategy strategy, bool *hit)
{BufferDesc *bufHdr;Block       bufBlock;bool       found;bool      isExtend;bool       isLocalBuf = SmgrIsTemp(smgr);*hit = false;/* Make sure we will have room to remember the buffer pin */ResourceOwnerEnlargeBuffers(CurrentResourceOwner);isExtend = (blockNum == P_NEW);TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend);/* Substitute proper block number if caller asked for P_NEW */if (isExtend)blockNum = smgrnblocks(smgr, forkNum);if (isLocalBuf){bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);if (found)pgBufferUsage.local_blks_hit++;elsepgBufferUsage.local_blks_read++;}else{/** lookup the buffer.  IO_IN_PROGRESS is set if the requested block is* not currently in memory.*/bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,strategy, &found);if (found)pgBufferUsage.shared_blks_hit++;elsepgBufferUsage.shared_blks_read++;}/* At this point we do NOT hold any locks. *//* if it was already in the buffer pool, we're done */if (found){if (!isExtend){/* Just need to update stats before we exit */*hit = true;VacuumPageHit++;if (VacuumCostActive)VacuumCostBalance += VacuumCostPageHit;TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend,found);/** In RBM_ZERO_AND_LOCK mode the caller expects the page to be* locked on return.*/if (!isLocalBuf){if (mode == RBM_ZERO_AND_LOCK)LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_EXCLUSIVE);else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));}return BufferDescriptorGetBuffer(bufHdr);}/** We get here only in the corner case where we are trying to extend* the relation but we found a pre-existing buffer marked BM_VALID.* This can happen because mdread doesn't complain about reads beyond* EOF (when zero_damaged_pages is ON) and so a previous attempt to* read a block beyond EOF could have left a "valid" zero-filled* buffer.  Unfortunately, we have also seen this case occurring* because of buggy Linux kernels that sometimes return an* lseek(SEEK_END) result that doesn't account for a recent write. In* that situation, the pre-existing buffer would contain valid data* that we don't want to overwrite.  Since the legitimate case should* always have left a zero-filled buffer, complain if not PageIsNew.*/bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);if (!PageIsNew((Page) bufBlock))ereport(ERROR,(errmsg("unexpected data beyond EOF in block %u of relation %s",blockNum, relpath(smgr->smgr_rnode, forkNum)),errhint("This has been seen to occur with buggy kernels; consider updating your system.")));/** We *must* do smgrextend before succeeding, else the page will not* be reserved by the kernel, and the next P_NEW call will decide to* return the same page.  Clear the BM_VALID bit, do the StartBufferIO* call that BufferAlloc didn't, and proceed.*/if (isLocalBuf){/* Only need to adjust flags */uint32     buf_state = pg_atomic_read_u32(&bufHdr->state);Assert(buf_state & BM_VALID);buf_state &= ~BM_VALID;pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);}else{/** Loop to handle the very small possibility that someone re-sets* BM_VALID between our clearing it and StartBufferIO inspecting* it.*/do{uint32      buf_state = LockBufHdr(bufHdr);Assert(buf_state & BM_VALID);buf_state &= ~BM_VALID;UnlockBufHdr(bufHdr, buf_state);} while (!StartBufferIO(bufHdr, true));}}/** if we have gotten to this point, we have allocated a buffer for the* page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,* if it's a shared buffer.** Note: if smgrextend fails, we will end up with a buffer that is* allocated but not marked BM_VALID.  P_NEW will still select the same* block number (because the relation didn't get any longer on disk) and* so future attempts to extend the relation will find the same buffer (if* it's not been recycled) but come right back here to try smgrextend* again.*/Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));   /* spinlock not needed */bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);if (isExtend){/* new buffers are zero-filled */MemSet((char *) bufBlock, 0, BLCKSZ);/* don't set checksum for all-zero page */smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);/** NB: we're *not* doing a ScheduleBufferTagForWriteback here;* although we're essentially performing a write. At least on linux* doing so defeats the 'delayed allocation' mechanism, leading to* increased file fragmentation.*/}else{/** Read in the page, unless the caller intends to overwrite it and* just wants us to allocate a buffer.*/if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)MemSet((char *) bufBlock, 0, BLCKSZ);else{instr_time  io_start,io_time;if (track_io_timing)INSTR_TIME_SET_CURRENT(io_start);smgrread(smgr, forkNum, blockNum, (char *) bufBlock);if (track_io_timing){INSTR_TIME_SET_CURRENT(io_time);INSTR_TIME_SUBTRACT(io_time, io_start);pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);}/* check for garbage data */if (!PageIsVerified((Page) bufBlock, blockNum)){if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages){ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),errmsg("invalid page in block %u of relation %s; zeroing out page",blockNum,relpath(smgr->smgr_rnode, forkNum))));MemSet((char *) bufBlock, 0, BLCKSZ);}elseereport(ERROR,(errcode(ERRCODE_DATA_CORRUPTED),errmsg("invalid page in block %u of relation %s",blockNum,relpath(smgr->smgr_rnode, forkNum))));}}}/** In RBM_ZERO_AND_LOCK mode, grab the buffer content lock before marking* the page as valid, to make sure that no other backend sees the zeroed* page before the caller has had a chance to initialize it.** Since no-one else can be looking at the page contents yet, there is no* difference between an exclusive lock and a cleanup-strength lock. (Note* that we cannot use LockBuffer() or LockBufferForCleanup() here, because* they assert that the buffer is already valid.)*/if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&!isLocalBuf){LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);}if (isLocalBuf){/* Only need to adjust flags */uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);buf_state |= BM_VALID;pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);}else{/* Set BM_VALID, terminate IO, and wake up any waiters */TerminateBufferIO(bufHdr, false, BM_VALID);}VacuumPageMiss++;if (VacuumCostActive)VacuumCostBalance += VacuumCostPageMiss;TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend,found);return BufferDescriptorGetBuffer(bufHdr);
}

代码较长，只选择其中比较重要的部分来讲。如果是获取共享缓冲区，那么函数首先调用BufferAlloc获取一个共享缓冲区。在查看BufferAlloc代码之前，我们提一个问题，然后带着问题来调试BufferAlloc。问题是：如果两个进程需要同时加载同一个物理块，那么如何保证这个块不会被重复加载？为了解决这个问题，我们先设计如下的测试步骤：

创建一张表。
```
create table t1(a int);
```
向表中插入一条记录，此时该表就会包含一个物理块。
```
insert into t1 values(1);
```
重启数据库，如此步骤2产生的物理块就不会存在于共享缓冲池中。
在BufferAlloc中打上断点。
开启两个客户端连接PostgreSQL，然后执行查询语句。
```
select * from t1;
```

还记得InitBufferPool中初始化的hash表么，下面它将隆重登场，hash在这里相当于一个缓冲区字典，以物理块的BufferTag为key，以缓冲区的buf_id为value。BufferAlloc按照以下步骤执行：

将物理块对应表的表空间oid、数据库oid、本身oid等信息组成BufferTag（见：INIT_BUFFERTAG）。前面说过BufferTag唯一标识一个物理块。那么就可以以BufferTag为key在hash表中进行查询，若能够查询到相应的buf_id，则说明请求的物理块已经被加载到缓冲池中，那么直接返回（以BufferDesc的形式返回）。
当hash表中不存在时，则需要在找到一个空闲的缓冲区来装入文件。如果存在空闲缓冲区则返回该缓冲区，如果不存在则使用替换机制进行替换缓冲区。

BufferAlloc的代码如下：

static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum,BufferAccessStrategy strategy,bool *foundPtr)
{BufferTag  newTag;         /* identity of requested block */uint32     newHash;        /* hash value for newTag */LWLock      *newPartitionLock;       /* buffer partition lock for it */BufferTag oldTag;         /* previous identity of selected buffer */uint32        oldHash;        /* hash value for oldTag */LWLock      *oldPartitionLock;       /* buffer partition lock for it */uint32        oldFlags;int            buf_id;BufferDesc *buf;bool     valid;uint32        buf_state;/* create a tag so we can lookup the buffer */INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);/* determine its hash code and partition lock ID */newHash = BufTableHashCode(&newTag);newPartitionLock = BufMappingPartitionLock(newHash);/* * see if the block is in the buffer pool already * step1：检查物理块是否已经在缓冲区中。*/LWLockAcquire(newPartitionLock, LW_SHARED);buf_id = BufTableLookup(&newTag, newHash);if (buf_id >= 0){/** Found it.  Now, pin the buffer so no one can steal it from the* buffer pool, and check to see if the correct data has been loaded* into the buffer.*/buf = GetBufferDescriptor(buf_id);valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */LWLockRelease(newPartitionLock);*foundPtr = TRUE;if (!valid){/** We can only get here if (a) someone else is still reading in* the page, or (b) a previous read attempt failed.  We have to* wait for any active read attempt to finish, and then set up our* own read attempt if the page is still not BM_VALID.* StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer must* have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}/** Didn't find it in the buffer pool.  We'll have to initialize a new* buffer.  Remember to unlock the mapping lock while doing the work.*/LWLockRelease(newPartitionLock);/* * Loop here in case we have to try another victim buffer * step2：获取一个空闲缓冲区。*/for (;;){/** Ensure, while the spinlock's not yet held, that there's a free* refcount entry.*/ReservePrivateRefCountEntry();/** Select a victim buffer.  The buffer is returned with its header* spinlock still held!*/buf = StrategyGetBuffer(strategy, &buf_state);Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);/* Must copy buffer flags while we still hold the spinlock */oldFlags = buf_state & BUF_FLAG_MASK;/* Pin the buffer and then release the buffer spinlock */PinBuffer_Locked(buf);/** If the buffer was dirty, try to write it out.  There is a race* condition here, in that someone might dirty it after we released it* above, or even while we are writing it out (since our share-lock* won't prevent hint-bit updates).  We will recheck the dirty bit* after re-locking the buffer header.*/if (oldFlags & BM_DIRTY){/** We need a share-lock on the buffer contents to write it out* (else we might write invalid data, eg because someone else is* compacting the page contents while we write).  We must use a* conditional lock acquisition here to avoid deadlock.  Even* though the buffer was not pinned (and therefore surely not* locked) when StrategyGetBuffer returned it, someone else could* have pinned and exclusive-locked it by the time we get here. If* we try to get the lock unconditionally, we'd block waiting for* them; if they later block waiting for us, deadlock ensues.* (This has been observed to happen when two backends are both* trying to split btree index pages, and the second one just* happens to be trying to split the page the first one got from* StrategyGetBuffer.)*/if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),LW_SHARED)){/** If using a nondefault strategy, and writing the buffer* would require a WAL flush, let the strategy decide whether* to go ahead and write/reuse the buffer or to choose another* victim.  We need lock to inspect the page LSN, so this* can't be done inside StrategyGetBuffer.*/if (strategy != NULL){XLogRecPtr   lsn;/* Read the LSN while holding buffer header lock */buf_state = LockBufHdr(buf);lsn = BufferGetLSN(buf);UnlockBufHdr(buf, buf_state);if (XLogNeedsFlush(lsn) &&StrategyRejectBuffer(strategy, buf)){/* Drop lock/pin and loop around for another buffer */LWLockRelease(BufferDescriptorGetContentLock(buf));UnpinBuffer(buf, true);continue;}}/* OK, do the I/O */TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);FlushBuffer(buf, NULL);LWLockRelease(BufferDescriptorGetContentLock(buf));ScheduleBufferTagForWriteback(&BackendWritebackContext,&buf->tag);TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);}else{/** Someone else has locked the buffer, so give it up and loop* back to get another one.*/UnpinBuffer(buf, true);continue;}}/** To change the association of a valid buffer, we'll need to have* exclusive lock on both the old and new mapping partitions.*/if (oldFlags & BM_TAG_VALID){/** Need to compute the old tag's hashcode and partition lock ID.* XXX is it worth storing the hashcode in BufferDesc so we need* not recompute it here?  Probably not.*/oldTag = buf->tag;oldHash = BufTableHashCode(&oldTag);oldPartitionLock = BufMappingPartitionLock(oldHash);/** Must lock the lower-numbered partition first to avoid* deadlocks.*/if (oldPartitionLock < newPartitionLock){LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}else if (oldPartitionLock > newPartitionLock){LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);}else{/* only one partition, only one lock */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}}else{/* if it wasn't valid, we need only the new partition */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);/* remember we have no old-partition lock or tag */oldPartitionLock = NULL;/* this just keeps the compiler quiet about uninit variables */oldHash = 0;}/** Try to make a hashtable entry for the buffer under its new tag.* This could fail because while we were writing someone else* allocated another buffer for the same block we want to read in.* Note that we have not yet removed the hashtable entry for the old* tag.*/buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);if (buf_id >= 0){/** Got a collision. Someone has already done what we were about to* do. We'll just handle this as if it were found in the buffer* pool in the first place.  First, give up the buffer we were* planning to use.*/UnpinBuffer(buf, true);/* Can give up that buffer's mapping partition lock now */if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);/* remaining code should match code at top of routine */buf = GetBufferDescriptor(buf_id);valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */LWLockRelease(newPartitionLock);*foundPtr = TRUE;if (!valid){/** We can only get here if (a) someone else is still reading* in the page, or (b) a previous read attempt failed.  We* have to wait for any active read attempt to finish, and* then set up our own read attempt if the page is still not* BM_VALID.  StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer* must have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}/** Need to lock the buffer header too in order to change its tag.*/buf_state = LockBufHdr(buf);/** Somebody could have pinned or re-dirtied the buffer while we were* doing the I/O and making the new hashtable entry.  If so, we can't* recycle this buffer; we must undo everything we've done and start* over with a new victim buffer.*/oldFlags = buf_state & BUF_FLAG_MASK;if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))break;UnlockBufHdr(buf, buf_state);BufTableDelete(&newTag, newHash);if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);LWLockRelease(newPartitionLock);UnpinBuffer(buf, true);}/** Okay, it's finally safe to rename the buffer.** Clearing BM_VALID here is necessary, clearing the dirtybits is just* paranoia.  We also reset the usage_count since any recency of use of* the old content is no longer relevant.  (The usage_count starts out at* 1 so that the buffer can survive one clock-sweep pass.)** Make sure BM_PERMANENT is set for buffers that must be written at every* checkpoint.  Unlogged buffers only need to be written at shutdown* checkpoints, except for their "init" forks, which need to be treated* just like permanent relations.*/buf->tag = newTag;buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |BUF_USAGECOUNT_MASK);if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;elsebuf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;UnlockBufHdr(buf, buf_state);if (oldPartitionLock != NULL){BufTableDelete(&oldTag, oldHash);if (oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);}LWLockRelease(newPartitionLock);/** Buffer contents are currently invalid.  Try to get the io_in_progress* lock.  If StartBufferIO returns false, then someone else managed to* read it before we did, so there's nothing left for BufferAlloc() to do.*/if (StartBufferIO(buf, true))*foundPtr = FALSE;else*foundPtr = TRUE;return buf;
}

现在我们回到，之前提出的问题：如果两个进程需要同时加载同一个物理块，那么如何保证这个块不会被重复加载？在调试的过程中，我们发现由于数据库重新启动，所以物理块肯定不会被加载到缓冲池中，所以步骤1的BufTableLookup返回值为-1，于是进入到了步骤2。并且此时两个进程都获取到了一个缓冲区！紧接着会调用函数BufTableInsert将获取的进程插入hash表中（以BufferTag为key，buf_id为value）。但在插入hash表之前，首先对hash表加了互斥锁（上面代码187行~226行），于是两个进程变为了串行！接着执行BufTableInsert，BufTableInsert会返回一个buf_id，如果在插入前hash表中没有相应的BufferTag，则返回-1，否则返回BufferTag对应的buf_id。由于对于BufTableInsert的调用是串行的，并且由于是加载同一个物理块，所以显然后执行BufTableInsert的进程会发现Hash表中已经存在相应的BufferTag。此时，该进程会放弃获取到的缓冲区，然后执行类似步骤1的操作，直接返回从hash表中获取到的缓冲区。这样就避免了物理块的重复加载。

我们再来思考另外一个问题：如果两个进程需要加载不同的物理块，但是获取到了同一个缓冲区怎么办？获取缓冲区的函数为StrategyGetBuffer，该函数按如下步骤执行：

如果有空闲缓冲区，则获取一个空闲缓冲区。否则执行步骤2。
使用替换机制替换缓冲区。

代码如下：

BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{BufferDesc *buf;int            bgwprocno;int           trycounter;uint32       local_buf_state;    /* to avoid repeated (de-)referencing *//** If given a strategy object, see whether it can select a buffer. We* assume strategy objects don't need buffer_strategy_lock.*/if (strategy != NULL){buf = GetBufferFromRing(strategy, buf_state);if (buf != NULL)return buf;}/** If asked, we need to waken the bgwriter. Since we don't want to rely on* a spinlock for this we force a read from shared memory once, and then* set the latch based on that value. We need to go through that length* because otherwise bgprocno might be reset while/after we check because* the compiler might just reread from memory.** This can possibly set the latch of the wrong process if the bgwriter* dies in the wrong moment. But since PGPROC->procLatch is never* deallocated the worst consequence of that is that we set the latch of* some arbitrary process.*/bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);if (bgwprocno != -1){/* reset bgwprocno first, before setting the latch */StrategyControl->bgwprocno = -1;/** Not acquiring ProcArrayLock here which is slightly icky. It's* actually fine because procLatch isn't ever freed, so we just can* potentially set the wrong process' (or no process') latch.*/SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);}/** We count buffer allocation requests so that the bgwriter can estimate* the rate of buffer consumption.  Note that buffers recycled by a* strategy object are intentionally not counted here.*/pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);/** First check, without acquiring the lock, whether there's buffers in the* freelist. Since we otherwise don't require the spinlock in every* StrategyGetBuffer() invocation, it'd be sad to acquire it here -* uselessly in most cases. That obviously leaves a race where a buffer is* put on the freelist but we don't see the store yet - but that's pretty* harmless, it'll just get used during the next buffer acquisition.** If there's buffers on the freelist, acquire the spinlock to pop one* buffer of the freelist. Then check whether that buffer is usable and* repeat if not.** Note that the freeNext fields are considered to be protected by the* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to* manipulate them without holding the spinlock.** 步骤1：获取空闲缓冲区**/if (StrategyControl->firstFreeBuffer >= 0){while (true){/* * Acquire the spinlock to remove element from the freelist * 加锁*/SpinLockAcquire(&StrategyControl->buffer_strategy_lock);if (StrategyControl->firstFreeBuffer < 0){SpinLockRelease(&StrategyControl->buffer_strategy_lock);break;}buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);/* Unconditionally remove buffer from freelist */StrategyControl->firstFreeBuffer = buf->freeNext;buf->freeNext = FREENEXT_NOT_IN_LIST;/** Release the lock so someone else can access the freelist while* we check out this buffer.*/SpinLockRelease(&StrategyControl->buffer_strategy_lock);/** If the buffer is pinned or has a nonzero usage_count, we cannot* use it; discard it and retry.  (This can only happen if VACUUM* put a valid buffer in the freelist and then someone else used* it before we got to it.  It's probably impossible altogether as* of 8.3, but we'd better check anyway.)*/local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0){if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}UnlockBufHdr(buf, local_buf_state);}}/* Nothing on the freelist, so run the "clock sweep" algorithm * 步骤2：使用替换机制替换缓冲区*/trycounter = NBuffers;for (;;){buf = GetBufferDescriptor(ClockSweepTick());/** If the buffer is pinned or has a nonzero usage_count, we cannot use* it; decrement the usage_count (unless pinned) and keep scanning.* 加锁*/local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0){if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0){local_buf_state -= BUF_USAGECOUNT_ONE;trycounter = NBuffers;}else{/* Found a usable buffer */if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}}else if (--trycounter == 0){/** We've scanned all the buffers without making any state changes,* so all the buffers are pinned (or were when we looked at them).* We could hope that someone will free one eventually, but it's* probably better to fail than to risk getting stuck in an* infinite loop.*/UnlockBufHdr(buf, local_buf_state);elog(ERROR, "no unpinned buffers available");}UnlockBufHdr(buf, local_buf_state);}
}

注意不论是步骤1还是步骤2，都有加锁的操作，所以两个进程不可能获取到同一个缓冲区。

共享缓冲区替换策略

在缓冲池中，初始化定义的缓冲区个数是有限的（由宏NBuffers定义，默认为1000个），并且这个值在初始化分配后将不会再被改变。因此在不断的操作过程中，可能出现缓冲区被用光的局面，这时候就需要替换一些最近未使用的缓冲区，以加载请求的文件块。

PostgreSQL提供两种缓冲区替换策略：一般替换策略和缓冲环替换策略。在上述StrategyGetBuffer代码中，缓冲环替换策略在GetBufferFromRing函数中实现，即13行~18行。剩下的代码就是一般替换策略的实现，下面我们分别来阐述这两种策略：

一般替换策略

在前面其实已经讲过一般替换策略的两个步骤，这里再详细描述下

如果有空闲缓冲区，则获取一个空闲缓冲区

首先在缓冲池中维持一个FreeList链表，FreeList是一个单项链表。FreeList中的缓冲区通过其描述符的FreeNext字段链接起来，在BufferStrategyControl结构中记录了FreeList第一个和最后一个元素。当某缓冲区refcount变为0时，将其加入到FreeList链尾，当需要一个空闲缓冲区时，从链首取得。BufferStrategyControl定义如下：

typedef struct
{/* Spinlock: protects the values below */slock_t       buffer_strategy_lock;/** Clock sweep hand: index of next buffer to consider grabbing. Note that* this isn't a concrete buffer - we only ever increase the value. So, to* get an actual buffer, it needs to be used modulo NBuffers.*/pg_atomic_uint32 nextVictimBuffer;int         firstFreeBuffer;    /* Head of list of unused buffers */int         lastFreeBuffer; /* Tail of list of unused buffers *//** NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,* when the list is empty)*//** Statistics.  These counters should be wide enough that they can't* overflow during a single bgwriter cycle.*/uint32       completePasses; /* Complete cycles of the clock sweep */pg_atomic_uint32 numBufferAllocs;   /* Buffers allocated since last reset *//** Bgworker process to be notified upon activity or -1 if none. See* StrategyNotifyBgWriter.*/int          bgwprocno;
} BufferStrategyControl;

使用替换机制替换缓冲区

替换机制实际是一个简单的clock-sweep算法。主要流程如下：

初始化tryCounter = NBuffers。
根据nextVictimBuffer字段找到相应缓冲区，初始值为0。
将nextVictimBuffer+1，如果当nextVictimBuffer指向池中最后一个缓冲区，设置nextVictimBuffer为0。
如果步骤2中得到的缓冲区refcount为0：
a. 若usagecount不为0，则置usagecount减1，并重置trycounter为NBuffers。
b. 否则获取这个缓冲区并返回。
如果步骤2中得到的缓冲区的refcount不为0，则将trycounter减1，如果trycounter等于0，报错。
返回步骤2。

为了看的更清楚，将这部分代码（StrategyGetBuffer125行~167行）再罗列一下，对应上面的步骤添加相应注释：

trycounter = NBuffers;      /* 步骤1 */
for (;;)
{/* 步骤2~步骤3 */buf = GetBufferDescriptor(ClockSweepTick());/** If the buffer is pinned or has a nonzero usage_count, we cannot use* it; decrement the usage_count (unless pinned) and keep scanning.*/local_buf_state = LockBufHdr(buf);/* 步骤4，判断refcount是否为0 */if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0){/* 判断usagecount是否为0 */if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0){/* usagecount不为0，则置usagecount减1，并重置trycounter为NBuffers */local_buf_state -= BUF_USAGECOUNT_ONE;trycounter = NBuffers;}else{/* usagecount为0，获取这个缓冲区并返回 */if (strategy != NULL)    /* 如果使用了缓冲环策略，则将这个缓冲区添加到缓冲环中 */AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}}else if (--trycounter == 0){/* 步骤5*//** We've scanned all the buffers without making any state changes,* so all the buffers are pinned (or were when we looked at them).* We could hope that someone will free one eventually, but it's* probably better to fail than to risk getting stuck in an* infinite loop.*/UnlockBufHdr(buf, local_buf_state);elog(ERROR, "no unpinned buffers available");}UnlockBufHdr(buf, local_buf_state);/* 步骤6 继续循环*/
}

核心思想：
不论何种数据库，缓存替换的核心思想都是将访问不频繁的页面交换出去。在PostgreSQL中就通过usage_count来表示一个页面的访问频率，usage_count初始值为0，页面每次执行pin操作都会递增usage_count。所以访问越频繁的页面usage_count就越大，那么在clock-sweep算法中就越不容易变为0，从而越不容易被交换。

缓冲环替换策略

缓冲环是一般替换策略的一种优化，考虑如下场景：假设当前有多个进程在对数据库进行常规操作。此时有一个进程发起了一个全表遍历查询。这个查询会访问大量物理块，但每个块都只访问一次。如果按照一般替换策略，这个全表遍历将导致缓冲池中存在大量只会使用一次的页面，而将许多会被多次使用的页面替换出缓冲区。显然这违背了缓冲区减少I\O的初衷。针对这种情况，缓冲环的基本思想是分配固定数量的缓冲区，替换操作首先在这些缓冲区中进行，如果这些缓冲区中没有可替换的，再使用一般替换策略。环缓冲区主要依靠数据结构BufferAccessStrategy结构来控制，其定义如下：

typedef struct BufferAccessStrategyData
{/* Overall strategy type 缓冲环控制策略*/BufferAccessStrategyType btype;/* Number of elements in buffers[] array 环大小*/int         ring_size;/** Index of the "current" slot in the ring, ie, the one most recently* returned by GetBufferFromRing.* 最近加入到环中的Buffer*/int         current;/** True if the buffer just returned by StrategyGetBuffer had been in the* ring already.* 最近通过StrategyGetBuffer获取的Buffer是否是直接在环中取的*/bool        current_was_in_ring;/** Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we* have not yet selected a buffer for this ring slot.  For allocation* simplicity this is palloc'd together with the fixed fields of the* struct.* 数组，用于存储加入到环中的缓冲区索引号*/Buffer       buffers[FLEXIBLE_ARRAY_MEMBER];
}   BufferAccessStrategyData;
typedef struct BufferAccessStrategyData *BufferAccessStrategy;

缓冲环替换策略在GetBufferFromRing函数中实现，该函数有三个步骤：

将strategy中的current指针指向strategy的Buffers字段的下一个元素（代表可能的下一个缓冲区），如果当前指向的是Buffers的最后一个元素，则将current置为0 (指向Buffers的第一个元素）。
检査current指针指向的元素，如果其中记录的值为InvalidBuffer，表明环还未充满，这个位置还没有记录一个缓冲区。这种情况下设置strategy的current_was_in_ring字段为 false之后返回空值。

GetBufferFromRing的上层调用函数（StrategyGetBuffer）在检测到返回值为空之后会采用一般的替换策略取得一个空闲缓冲区，并通过AddBufferToRing将该缓冲区加人到缓冲环中。
如果current指针指向的元素中记录的是一个有效的缓冲区索引号，则检査该缓冲区的refcount和usagecount。如果refcount为0且usagecount<=1 （最多被访问过一次，而这一次很可能是全表遍历时，当前进程访问的）, 则把这个缓冲区替换出来返回；否则表明该缓冲区仍在被其他进程使用中或最近被其他进程使用过，这时需采用和步骤2类似的方法，由上层调用函数采用一般的替换策略取得空闲缓冲区。

上述三步，简而言之就是：获取当前指针的下一个元素对应的缓冲区，若存在一个合法缓冲区，且该缓冲区没有进程在访问，且最近最多被访问过一次，则返回该缓冲区，否则采用一般替换策略。

代码如下：

static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{BufferDesc *buf;Buffer     bufnum;uint32       local_buf_state;    /* to avoid repeated (de-)referencing *//* Advance to next ring slot */if (++strategy->current >= strategy->ring_size)strategy->current = 0;/** If the slot hasn't been filled yet, tell the caller to allocate a new* buffer with the normal allocation strategy.  He will then fill this* slot by calling AddBufferToRing with the new buffer.*/bufnum = strategy->buffers[strategy->current];if (bufnum == InvalidBuffer){strategy->current_was_in_ring = false;return NULL;}/** If the buffer is pinned we cannot use it under any circumstances.** If usage_count is 0 or 1 then the buffer is fair game (we expect 1,* since our own previous usage of the ring element would have left it* there, but it might've been decremented by clock sweep since then). A* higher usage_count indicates someone else has touched the buffer, so we* shouldn't re-use it.*/buf = GetBufferDescriptor(bufnum - 1);local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1){strategy->current_was_in_ring = true;*buf_state = local_buf_state;return buf;}UnlockBufHdr(buf, local_buf_state);/** Tell caller to allocate a new buffer with the normal allocation* strategy.  He'll then replace this ring element via AddBufferToRing.*/strategy->current_was_in_ring = false;return NULL;
}

AddBufferToRing

前面提到了AddBufferToRing，我们来看看他的实现：

static void
AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
{strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
}

这个函数非常简单，就是将一个缓冲区在BufferDescriptors中的下标信息存入缓冲环中，current对应的位置。