参考资料

《PostgreSQL数据库内核分析》 彭智勇 彭煜玮:P99~P101

概述

在PostgreSQL中,任何对于表、元组、索引等操作都在缓冲池中进行,缓冲池的数据调度都以磁盘块为单位,需要访问的数据块以磁盘块为单位调用函数smgrread写入缓冲区,而smgrwrite将缓冲池数据写回磁盘。调入缓冲池中的磁盘块称为缓冲区,多个缓冲区组成的缓冲池

PostgreSQL有两种缓冲池:共享缓冲池和本地缓冲池。共享缓冲池主要作为普通表的操作场所,本地缓冲池则仅本地可见的临时表的操作场所。本文仅对共享缓冲池进行阐述。

对缓冲池中,缓冲区的管理通过两种机制完成:

  1. pin

    当进程要访问缓冲区前,对于缓冲区加pin,pin的数目保存在缓冲区的refcount属性中。当refcount不为0时表明有进程正在访问缓冲区,此时该缓冲区不能被替换

  2. lock

    lock机制为缓冲区的并发访问提供了保障,当进程对缓冲区进行写操作时加EXCLUSIVE锁,读操作加SHARE锁。比如:Insert操作,在获取到缓冲区后需要先将缓冲区加EXCLUSIVE锁。(加锁操作在RelationGetBufferForTuple函数中进行,详见插入流程)。

初始化共享缓冲区

共享缓冲池的初始化工作由InitBufferPool来完成。在共享缓冲池管理中,使用了一个全局数组BufferDescriptors来管理缓冲池中的缓冲区,其数组元素类型为BufferDesc。另外使用了一个全局指针变量BufferBlocks来存储缓冲池的起始地址。

下面先来看看BufferDesc的定义:

typedef struct BufferDesc
{BufferTag  tag;                    /* ID of page contained in buffer */int         buf_id;                 /* buffer's index number (from 0) *//* state of the tag, containing flags, refcount and usagecount */pg_atomic_uint32 state;int            wait_backend_pid;       /* backend PID of pin-count waiter */int            freeNext;               /* link in freelist chain */LWLock      content_lock;           /* to lock access to buffer contents */
} BufferDesc;

其中:

  • tag:用于标识该缓冲块的物理信息,具体定义如下:

    typedef struct buftag
    {RelFileNode rnode;         /* 表所在表空间oid,数据库oid,表本身oid组成 */ForkNumber forkNum;        /* 枚举类型,标记缓冲区中是什么类型的文件块 */BlockNumber blockNum;      /* 块号 */
    } BufferTag;
    

    tag唯一标识了一个物理块,注意是物理块!(后面的缓冲区加载流程会再次用到tag)

  • buf_id:缓冲区的索引号,buf_id唯一标识了一个缓冲区。对缓冲区的各种操作都会用到buf_id。

    共享缓冲区和本地缓冲区都使用buf_id,他们的编号规则不同:共享缓冲区的buf_id从0开始编号,后续依次加1。而本地缓冲区的buf_id从-2开始编号,后续依次减1。

    /* 本地缓冲区从-2开始编号 */
    #define LocalBufHdrGetBlock(bufHdr) LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
    
  • state:由flags、refcount、usagecount组成

    • flags:标志位,表示缓冲区是否为脏等。
    • refcount:表示当前正在引用该块缓冲区的进程数,通过pin操作来修改该字段。
    • usagecount:最近缓冲区使用次数,用于缓冲区替换。
  • wait_backend_pid:用于记录一个请求修改缓冲区的进程号。

  • freeNext:如果当前缓冲区在空闲链中,则freeNext指向下一个空闲缓冲区。

  • content_lock:当进程访问缓冲块时,会在content_lock上加锁,读访问加LW_SHARE锁,写访问加LW_EXCLUSIVE锁,此锁可以防止因多个进程对缓冲区访问的冲突而造成数据不一致。

缓冲区的操作

前面说到共享缓冲池管理中有两个全局变量:BufferDesc数组BufferDescriptors和BufferBlocks指针。那么这两个全局变量之间有什么关系,两者由如何转换?

首先,BufferDescriptors是一个数组,数组元素的个数为N。N=缓冲池中缓冲区的数量,默认值为1000。BufferBlocks是一段连续的内存空间,大小为BLCKSZ*N,所以BufferBlocks也可以理解为一个数组,数组元素个数为N,每个数组元素都是一个缓冲区。

在BufferDesc中有一个成员buf_id,这个值表示了当前的BufferDesc在BufferDescriptors中的下标,即

BufferDesc == BufferDescriptors[BufferDesc ->buf_id]。

所以根据buf_id就可以从BufferDescriptors中获取BufferDesc,也可以从BufferBlocks中获取实际的缓冲区。具体操作见如下函数:

/* 返回一个bufferid,后续的操作都是基于bufferid进行 */
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
/* 从BufferDescriptors中获取一个BufferDesc */
#define GetBufferDescriptor(id) (&BufferDescriptors[(id)].bufferdesc)
/* 从BufferBlocks中获取一个缓冲区 */
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))#define BufferIsLocal(buffer)   ((buffer) < 0)   /* 判断是否是本地缓冲区 */#define BufferGetBlock(buffer) \
( \AssertMacro(BufferIsValid(buffer)), \BufferIsLocal(buffer) ? \LocalBufferBlockPointers[-(buffer) - 1] \: \(Block) (BufferBlocks + ((Size) ((buffer) - 1)) * BLCKSZ) \
)

对于GetBufferDescriptor的调用需要的参数直接是BufferDesc的数组下标,但对于BufferGetBlock的调用需要的参数却必须是BufferDescriptorGetBuffer的返回值,即数组下标+1。目前尚不清楚为什么要这样设计。

InitBufferPool的主要功能

InitBufferPool主要做三件事:

  1. 初始化BufferDescriptors。

  2. 初始化BufferBlocks。

  3. 初始化缓冲区hash表。

    初始化缓冲区hash表,在StrategyInitialize中调用InitBufTable来完成。缓冲区hash表的作用在共享缓冲区的加载中来讲。

共享缓冲区加载(共享缓冲区查询)

当PostgreSQL读写一个物理块时,首先需要把物理块读取到共享缓冲区中,然后再从缓冲区中读写数据。从物理块读取到共享缓冲区的过程称为共享缓冲区加载。

ReadBuffer_common是所有缓冲区的通用函数,定义了本地缓冲区和共享缓冲区的通用读取方法。代码如下:

static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum, ReadBufferMode mode,BufferAccessStrategy strategy, bool *hit)
{BufferDesc *bufHdr;Block       bufBlock;bool       found;bool      isExtend;bool       isLocalBuf = SmgrIsTemp(smgr);*hit = false;/* Make sure we will have room to remember the buffer pin */ResourceOwnerEnlargeBuffers(CurrentResourceOwner);isExtend = (blockNum == P_NEW);TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend);/* Substitute proper block number if caller asked for P_NEW */if (isExtend)blockNum = smgrnblocks(smgr, forkNum);if (isLocalBuf){bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);if (found)pgBufferUsage.local_blks_hit++;elsepgBufferUsage.local_blks_read++;}else{/** lookup the buffer.  IO_IN_PROGRESS is set if the requested block is* not currently in memory.*/bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,strategy, &found);if (found)pgBufferUsage.shared_blks_hit++;elsepgBufferUsage.shared_blks_read++;}/* At this point we do NOT hold any locks. *//* if it was already in the buffer pool, we're done */if (found){if (!isExtend){/* Just need to update stats before we exit */*hit = true;VacuumPageHit++;if (VacuumCostActive)VacuumCostBalance += VacuumCostPageHit;TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend,found);/** In RBM_ZERO_AND_LOCK mode the caller expects the page to be* locked on return.*/if (!isLocalBuf){if (mode == RBM_ZERO_AND_LOCK)LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_EXCLUSIVE);else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));}return BufferDescriptorGetBuffer(bufHdr);}/** We get here only in the corner case where we are trying to extend* the relation but we found a pre-existing buffer marked BM_VALID.* This can happen because mdread doesn't complain about reads beyond* EOF (when zero_damaged_pages is ON) and so a previous attempt to* read a block beyond EOF could have left a "valid" zero-filled* buffer.  Unfortunately, we have also seen this case occurring* because of buggy Linux kernels that sometimes return an* lseek(SEEK_END) result that doesn't account for a recent write. In* that situation, the pre-existing buffer would contain valid data* that we don't want to overwrite.  Since the legitimate case should* always have left a zero-filled buffer, complain if not PageIsNew.*/bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);if (!PageIsNew((Page) bufBlock))ereport(ERROR,(errmsg("unexpected data beyond EOF in block %u of relation %s",blockNum, relpath(smgr->smgr_rnode, forkNum)),errhint("This has been seen to occur with buggy kernels; consider updating your system.")));/** We *must* do smgrextend before succeeding, else the page will not* be reserved by the kernel, and the next P_NEW call will decide to* return the same page.  Clear the BM_VALID bit, do the StartBufferIO* call that BufferAlloc didn't, and proceed.*/if (isLocalBuf){/* Only need to adjust flags */uint32     buf_state = pg_atomic_read_u32(&bufHdr->state);Assert(buf_state & BM_VALID);buf_state &= ~BM_VALID;pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);}else{/** Loop to handle the very small possibility that someone re-sets* BM_VALID between our clearing it and StartBufferIO inspecting* it.*/do{uint32      buf_state = LockBufHdr(bufHdr);Assert(buf_state & BM_VALID);buf_state &= ~BM_VALID;UnlockBufHdr(bufHdr, buf_state);} while (!StartBufferIO(bufHdr, true));}}/** if we have gotten to this point, we have allocated a buffer for the* page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,* if it's a shared buffer.** Note: if smgrextend fails, we will end up with a buffer that is* allocated but not marked BM_VALID.  P_NEW will still select the same* block number (because the relation didn't get any longer on disk) and* so future attempts to extend the relation will find the same buffer (if* it's not been recycled) but come right back here to try smgrextend* again.*/Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));   /* spinlock not needed */bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);if (isExtend){/* new buffers are zero-filled */MemSet((char *) bufBlock, 0, BLCKSZ);/* don't set checksum for all-zero page */smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);/** NB: we're *not* doing a ScheduleBufferTagForWriteback here;* although we're essentially performing a write. At least on linux* doing so defeats the 'delayed allocation' mechanism, leading to* increased file fragmentation.*/}else{/** Read in the page, unless the caller intends to overwrite it and* just wants us to allocate a buffer.*/if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)MemSet((char *) bufBlock, 0, BLCKSZ);else{instr_time  io_start,io_time;if (track_io_timing)INSTR_TIME_SET_CURRENT(io_start);smgrread(smgr, forkNum, blockNum, (char *) bufBlock);if (track_io_timing){INSTR_TIME_SET_CURRENT(io_time);INSTR_TIME_SUBTRACT(io_time, io_start);pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);}/* check for garbage data */if (!PageIsVerified((Page) bufBlock, blockNum)){if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages){ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),errmsg("invalid page in block %u of relation %s; zeroing out page",blockNum,relpath(smgr->smgr_rnode, forkNum))));MemSet((char *) bufBlock, 0, BLCKSZ);}elseereport(ERROR,(errcode(ERRCODE_DATA_CORRUPTED),errmsg("invalid page in block %u of relation %s",blockNum,relpath(smgr->smgr_rnode, forkNum))));}}}/** In RBM_ZERO_AND_LOCK mode, grab the buffer content lock before marking* the page as valid, to make sure that no other backend sees the zeroed* page before the caller has had a chance to initialize it.** Since no-one else can be looking at the page contents yet, there is no* difference between an exclusive lock and a cleanup-strength lock. (Note* that we cannot use LockBuffer() or LockBufferForCleanup() here, because* they assert that the buffer is already valid.)*/if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&!isLocalBuf){LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);}if (isLocalBuf){/* Only need to adjust flags */uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);buf_state |= BM_VALID;pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);}else{/* Set BM_VALID, terminate IO, and wake up any waiters */TerminateBufferIO(bufHdr, false, BM_VALID);}VacuumPageMiss++;if (VacuumCostActive)VacuumCostBalance += VacuumCostPageMiss;TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode,smgr->smgr_rnode.backend,isExtend,found);return BufferDescriptorGetBuffer(bufHdr);
}

代码较长,只选择其中比较重要的部分来讲。如果是获取共享缓冲区,那么函数首先调用BufferAlloc获取一个共享缓冲区。在查看BufferAlloc代码之前,我们提一个问题,然后带着问题来调试BufferAlloc。问题是:如果两个进程需要同时加载同一个物理块,那么如何保证这个块不会被重复加载?为了解决这个问题,我们先设计如下的测试步骤:

  1. 创建一张表。

    create table t1(a int);
    
  2. 向表中插入一条记录,此时该表就会包含一个物理块。

    insert into t1 values(1);
    
  3. 重启数据库,如此步骤2产生的物理块就不会存在于共享缓冲池中。

  4. 在BufferAlloc中打上断点。

  5. 开启两个客户端连接PostgreSQL,然后执行查询语句。

    select * from t1;
    

还记得InitBufferPool中初始化的hash表么,下面它将隆重登场,hash在这里相当于一个缓冲区字典,以物理块的BufferTag为key,以缓冲区的buf_id为value。BufferAlloc按照以下步骤执行:

  1. 将物理块对应表的表空间oid、数据库oid、本身oid等信息组成BufferTag(见:INIT_BUFFERTAG)。前面说过BufferTag唯一标识一个物理块。那么就可以以BufferTag为key在hash表中进行查询,若能够查询到相应的buf_id,则说明请求的物理块已经被加载到缓冲池中,那么直接返回(以BufferDesc的形式返回)。
  2. 当hash表中不存在时,则需要在找到一个空闲的缓冲区来装入文件。如果存在空闲缓冲区则返回该缓冲区,如果不存在则使用替换机制进行替换缓冲区。

BufferAlloc的代码如下:

static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum,BufferAccessStrategy strategy,bool *foundPtr)
{BufferTag  newTag;         /* identity of requested block */uint32     newHash;        /* hash value for newTag */LWLock      *newPartitionLock;       /* buffer partition lock for it */BufferTag oldTag;         /* previous identity of selected buffer */uint32        oldHash;        /* hash value for oldTag */LWLock      *oldPartitionLock;       /* buffer partition lock for it */uint32        oldFlags;int            buf_id;BufferDesc *buf;bool     valid;uint32        buf_state;/* create a tag so we can lookup the buffer */INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);/* determine its hash code and partition lock ID */newHash = BufTableHashCode(&newTag);newPartitionLock = BufMappingPartitionLock(newHash);/* * see if the block is in the buffer pool already * step1:检查物理块是否已经在缓冲区中。*/LWLockAcquire(newPartitionLock, LW_SHARED);buf_id = BufTableLookup(&newTag, newHash);if (buf_id >= 0){/** Found it.  Now, pin the buffer so no one can steal it from the* buffer pool, and check to see if the correct data has been loaded* into the buffer.*/buf = GetBufferDescriptor(buf_id);valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */LWLockRelease(newPartitionLock);*foundPtr = TRUE;if (!valid){/** We can only get here if (a) someone else is still reading in* the page, or (b) a previous read attempt failed.  We have to* wait for any active read attempt to finish, and then set up our* own read attempt if the page is still not BM_VALID.* StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer must* have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}/** Didn't find it in the buffer pool.  We'll have to initialize a new* buffer.  Remember to unlock the mapping lock while doing the work.*/LWLockRelease(newPartitionLock);/* * Loop here in case we have to try another victim buffer * step2:获取一个空闲缓冲区。*/for (;;){/** Ensure, while the spinlock's not yet held, that there's a free* refcount entry.*/ReservePrivateRefCountEntry();/** Select a victim buffer.  The buffer is returned with its header* spinlock still held!*/buf = StrategyGetBuffer(strategy, &buf_state);Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);/* Must copy buffer flags while we still hold the spinlock */oldFlags = buf_state & BUF_FLAG_MASK;/* Pin the buffer and then release the buffer spinlock */PinBuffer_Locked(buf);/** If the buffer was dirty, try to write it out.  There is a race* condition here, in that someone might dirty it after we released it* above, or even while we are writing it out (since our share-lock* won't prevent hint-bit updates).  We will recheck the dirty bit* after re-locking the buffer header.*/if (oldFlags & BM_DIRTY){/** We need a share-lock on the buffer contents to write it out* (else we might write invalid data, eg because someone else is* compacting the page contents while we write).  We must use a* conditional lock acquisition here to avoid deadlock.  Even* though the buffer was not pinned (and therefore surely not* locked) when StrategyGetBuffer returned it, someone else could* have pinned and exclusive-locked it by the time we get here. If* we try to get the lock unconditionally, we'd block waiting for* them; if they later block waiting for us, deadlock ensues.* (This has been observed to happen when two backends are both* trying to split btree index pages, and the second one just* happens to be trying to split the page the first one got from* StrategyGetBuffer.)*/if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),LW_SHARED)){/** If using a nondefault strategy, and writing the buffer* would require a WAL flush, let the strategy decide whether* to go ahead and write/reuse the buffer or to choose another* victim.  We need lock to inspect the page LSN, so this* can't be done inside StrategyGetBuffer.*/if (strategy != NULL){XLogRecPtr   lsn;/* Read the LSN while holding buffer header lock */buf_state = LockBufHdr(buf);lsn = BufferGetLSN(buf);UnlockBufHdr(buf, buf_state);if (XLogNeedsFlush(lsn) &&StrategyRejectBuffer(strategy, buf)){/* Drop lock/pin and loop around for another buffer */LWLockRelease(BufferDescriptorGetContentLock(buf));UnpinBuffer(buf, true);continue;}}/* OK, do the I/O */TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);FlushBuffer(buf, NULL);LWLockRelease(BufferDescriptorGetContentLock(buf));ScheduleBufferTagForWriteback(&BackendWritebackContext,&buf->tag);TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);}else{/** Someone else has locked the buffer, so give it up and loop* back to get another one.*/UnpinBuffer(buf, true);continue;}}/** To change the association of a valid buffer, we'll need to have* exclusive lock on both the old and new mapping partitions.*/if (oldFlags & BM_TAG_VALID){/** Need to compute the old tag's hashcode and partition lock ID.* XXX is it worth storing the hashcode in BufferDesc so we need* not recompute it here?  Probably not.*/oldTag = buf->tag;oldHash = BufTableHashCode(&oldTag);oldPartitionLock = BufMappingPartitionLock(oldHash);/** Must lock the lower-numbered partition first to avoid* deadlocks.*/if (oldPartitionLock < newPartitionLock){LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}else if (oldPartitionLock > newPartitionLock){LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);}else{/* only one partition, only one lock */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}}else{/* if it wasn't valid, we need only the new partition */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);/* remember we have no old-partition lock or tag */oldPartitionLock = NULL;/* this just keeps the compiler quiet about uninit variables */oldHash = 0;}/** Try to make a hashtable entry for the buffer under its new tag.* This could fail because while we were writing someone else* allocated another buffer for the same block we want to read in.* Note that we have not yet removed the hashtable entry for the old* tag.*/buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);if (buf_id >= 0){/** Got a collision. Someone has already done what we were about to* do. We'll just handle this as if it were found in the buffer* pool in the first place.  First, give up the buffer we were* planning to use.*/UnpinBuffer(buf, true);/* Can give up that buffer's mapping partition lock now */if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);/* remaining code should match code at top of routine */buf = GetBufferDescriptor(buf_id);valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */LWLockRelease(newPartitionLock);*foundPtr = TRUE;if (!valid){/** We can only get here if (a) someone else is still reading* in the page, or (b) a previous read attempt failed.  We* have to wait for any active read attempt to finish, and* then set up our own read attempt if the page is still not* BM_VALID.  StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer* must have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}/** Need to lock the buffer header too in order to change its tag.*/buf_state = LockBufHdr(buf);/** Somebody could have pinned or re-dirtied the buffer while we were* doing the I/O and making the new hashtable entry.  If so, we can't* recycle this buffer; we must undo everything we've done and start* over with a new victim buffer.*/oldFlags = buf_state & BUF_FLAG_MASK;if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))break;UnlockBufHdr(buf, buf_state);BufTableDelete(&newTag, newHash);if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);LWLockRelease(newPartitionLock);UnpinBuffer(buf, true);}/** Okay, it's finally safe to rename the buffer.** Clearing BM_VALID here is necessary, clearing the dirtybits is just* paranoia.  We also reset the usage_count since any recency of use of* the old content is no longer relevant.  (The usage_count starts out at* 1 so that the buffer can survive one clock-sweep pass.)** Make sure BM_PERMANENT is set for buffers that must be written at every* checkpoint.  Unlogged buffers only need to be written at shutdown* checkpoints, except for their "init" forks, which need to be treated* just like permanent relations.*/buf->tag = newTag;buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |BUF_USAGECOUNT_MASK);if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;elsebuf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;UnlockBufHdr(buf, buf_state);if (oldPartitionLock != NULL){BufTableDelete(&oldTag, oldHash);if (oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);}LWLockRelease(newPartitionLock);/** Buffer contents are currently invalid.  Try to get the io_in_progress* lock.  If StartBufferIO returns false, then someone else managed to* read it before we did, so there's nothing left for BufferAlloc() to do.*/if (StartBufferIO(buf, true))*foundPtr = FALSE;else*foundPtr = TRUE;return buf;
}

现在我们回到,之前提出的问题:如果两个进程需要同时加载同一个物理块,那么如何保证这个块不会被重复加载?在调试的过程中,我们发现由于数据库重新启动,所以物理块肯定不会被加载到缓冲池中,所以步骤1的BufTableLookup返回值为-1,于是进入到了步骤2。并且此时两个进程都获取到了一个缓冲区!紧接着会调用函数BufTableInsert将获取的进程插入hash表中(以BufferTag为key,buf_id为value)。但在插入hash表之前,首先对hash表加了互斥锁(上面代码187行~226行),于是两个进程变为了串行!接着执行BufTableInsert,BufTableInsert会返回一个buf_id,如果在插入前hash表中没有相应的BufferTag,则返回-1,否则返回BufferTag对应的buf_id。由于对于BufTableInsert的调用是串行的,并且由于是加载同一个物理块,所以显然后执行BufTableInsert的进程会发现Hash表中已经存在相应的BufferTag。此时,该进程会放弃获取到的缓冲区,然后执行类似步骤1的操作,直接返回从hash表中获取到的缓冲区。这样就避免了物理块的重复加载。

我们再来思考另外一个问题:如果两个进程需要加载不同的物理块,但是获取到了同一个缓冲区怎么办?获取缓冲区的函数为StrategyGetBuffer,该函数按如下步骤执行:

  1. 如果有空闲缓冲区,则获取一个空闲缓冲区。否则执行步骤2。
  2. 使用替换机制替换缓冲区。

代码如下:

BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{BufferDesc *buf;int            bgwprocno;int           trycounter;uint32       local_buf_state;    /* to avoid repeated (de-)referencing *//** If given a strategy object, see whether it can select a buffer. We* assume strategy objects don't need buffer_strategy_lock.*/if (strategy != NULL){buf = GetBufferFromRing(strategy, buf_state);if (buf != NULL)return buf;}/** If asked, we need to waken the bgwriter. Since we don't want to rely on* a spinlock for this we force a read from shared memory once, and then* set the latch based on that value. We need to go through that length* because otherwise bgprocno might be reset while/after we check because* the compiler might just reread from memory.** This can possibly set the latch of the wrong process if the bgwriter* dies in the wrong moment. But since PGPROC->procLatch is never* deallocated the worst consequence of that is that we set the latch of* some arbitrary process.*/bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);if (bgwprocno != -1){/* reset bgwprocno first, before setting the latch */StrategyControl->bgwprocno = -1;/** Not acquiring ProcArrayLock here which is slightly icky. It's* actually fine because procLatch isn't ever freed, so we just can* potentially set the wrong process' (or no process') latch.*/SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);}/** We count buffer allocation requests so that the bgwriter can estimate* the rate of buffer consumption.  Note that buffers recycled by a* strategy object are intentionally not counted here.*/pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);/** First check, without acquiring the lock, whether there's buffers in the* freelist. Since we otherwise don't require the spinlock in every* StrategyGetBuffer() invocation, it'd be sad to acquire it here -* uselessly in most cases. That obviously leaves a race where a buffer is* put on the freelist but we don't see the store yet - but that's pretty* harmless, it'll just get used during the next buffer acquisition.** If there's buffers on the freelist, acquire the spinlock to pop one* buffer of the freelist. Then check whether that buffer is usable and* repeat if not.** Note that the freeNext fields are considered to be protected by the* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to* manipulate them without holding the spinlock.** 步骤1:获取空闲缓冲区**/if (StrategyControl->firstFreeBuffer >= 0){while (true){/* * Acquire the spinlock to remove element from the freelist * 加锁*/SpinLockAcquire(&StrategyControl->buffer_strategy_lock);if (StrategyControl->firstFreeBuffer < 0){SpinLockRelease(&StrategyControl->buffer_strategy_lock);break;}buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);/* Unconditionally remove buffer from freelist */StrategyControl->firstFreeBuffer = buf->freeNext;buf->freeNext = FREENEXT_NOT_IN_LIST;/** Release the lock so someone else can access the freelist while* we check out this buffer.*/SpinLockRelease(&StrategyControl->buffer_strategy_lock);/** If the buffer is pinned or has a nonzero usage_count, we cannot* use it; discard it and retry.  (This can only happen if VACUUM* put a valid buffer in the freelist and then someone else used* it before we got to it.  It's probably impossible altogether as* of 8.3, but we'd better check anyway.)*/local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0){if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}UnlockBufHdr(buf, local_buf_state);}}/* Nothing on the freelist, so run the "clock sweep" algorithm * 步骤2:使用替换机制替换缓冲区*/trycounter = NBuffers;for (;;){buf = GetBufferDescriptor(ClockSweepTick());/** If the buffer is pinned or has a nonzero usage_count, we cannot use* it; decrement the usage_count (unless pinned) and keep scanning.* 加锁*/local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0){if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0){local_buf_state -= BUF_USAGECOUNT_ONE;trycounter = NBuffers;}else{/* Found a usable buffer */if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}}else if (--trycounter == 0){/** We've scanned all the buffers without making any state changes,* so all the buffers are pinned (or were when we looked at them).* We could hope that someone will free one eventually, but it's* probably better to fail than to risk getting stuck in an* infinite loop.*/UnlockBufHdr(buf, local_buf_state);elog(ERROR, "no unpinned buffers available");}UnlockBufHdr(buf, local_buf_state);}
}

注意不论是步骤1还是步骤2,都有加锁的操作,所以两个进程不可能获取到同一个缓冲区。

共享缓冲区替换策略

在缓冲池中,初始化定义的缓冲区个数是有限的(由宏NBuffers定义,默认为1000个),并且这个值在初始化分配后将不会再被改变。因此在不断的操作过程中,可能出现缓冲区被用光的局面,这时候就需要替换一些最近未使用的缓冲区,以加载请求的文件块。

PostgreSQL提供两种缓冲区替换策略:一般替换策略缓冲环替换策略。在上述StrategyGetBuffer代码中,缓冲环替换策略在GetBufferFromRing函数中实现,即13行~18行。剩下的代码就是一般替换策略的实现,下面我们分别来阐述这两种策略:

一般替换策略

在前面其实已经讲过一般替换策略的两个步骤,这里再详细描述下

如果有空闲缓冲区,则获取一个空闲缓冲区

首先在缓冲池中维持一个FreeList链表,FreeList是一个单项链表。FreeList中的缓冲区通过其描述符的FreeNext字段链接起来,在BufferStrategyControl结构中记录了FreeList第一个和最后一个元素。当某缓冲区refcount变为0时,将其加入到FreeList链尾,当需要一个空闲缓冲区时,从链首取得。BufferStrategyControl定义如下:

typedef struct
{/* Spinlock: protects the values below */slock_t       buffer_strategy_lock;/** Clock sweep hand: index of next buffer to consider grabbing. Note that* this isn't a concrete buffer - we only ever increase the value. So, to* get an actual buffer, it needs to be used modulo NBuffers.*/pg_atomic_uint32 nextVictimBuffer;int         firstFreeBuffer;    /* Head of list of unused buffers */int         lastFreeBuffer; /* Tail of list of unused buffers *//** NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,* when the list is empty)*//** Statistics.  These counters should be wide enough that they can't* overflow during a single bgwriter cycle.*/uint32       completePasses; /* Complete cycles of the clock sweep */pg_atomic_uint32 numBufferAllocs;   /* Buffers allocated since last reset *//** Bgworker process to be notified upon activity or -1 if none. See* StrategyNotifyBgWriter.*/int          bgwprocno;
} BufferStrategyControl;

使用替换机制替换缓冲区

替换机制实际是一个简单的clock-sweep算法。主要流程如下:

  1. 初始化tryCounter = NBuffers。
  2. 根据nextVictimBuffer字段找到相应缓冲区,初始值为0。
  3. 将nextVictimBuffer+1,如果当nextVictimBuffer指向池中最后一个缓冲区,设置nextVictimBuffer为0。
  4. 如果步骤2中得到的缓冲区refcount为0:
    a. 若usagecount不为0,则置usagecount减1,并重置trycounter为NBuffers。
    b. 否则获取这个缓冲区并返回。
  5. 如果步骤2中得到的缓冲区的refcount不为0,则将trycounter减1,如果trycounter等于0,报错。
  6. 返回步骤2。

为了看的更清楚,将这部分代码(StrategyGetBuffer125行~167行)再罗列一下,对应上面的步骤添加相应注释:

trycounter = NBuffers;      /* 步骤1 */
for (;;)
{/* 步骤2~步骤3 */buf = GetBufferDescriptor(ClockSweepTick());/** If the buffer is pinned or has a nonzero usage_count, we cannot use* it; decrement the usage_count (unless pinned) and keep scanning.*/local_buf_state = LockBufHdr(buf);/* 步骤4,判断refcount是否为0 */if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0){/* 判断usagecount是否为0 */if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0){/* usagecount不为0,则置usagecount减1,并重置trycounter为NBuffers */local_buf_state -= BUF_USAGECOUNT_ONE;trycounter = NBuffers;}else{/* usagecount为0,获取这个缓冲区并返回 */if (strategy != NULL)    /* 如果使用了缓冲环策略,则将这个缓冲区添加到缓冲环中 */AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}}else if (--trycounter == 0){/* 步骤5*//** We've scanned all the buffers without making any state changes,* so all the buffers are pinned (or were when we looked at them).* We could hope that someone will free one eventually, but it's* probably better to fail than to risk getting stuck in an* infinite loop.*/UnlockBufHdr(buf, local_buf_state);elog(ERROR, "no unpinned buffers available");}UnlockBufHdr(buf, local_buf_state);/* 步骤6 继续循环*/
}

核心思想:
不论何种数据库,缓存替换的核心思想都是将访问不频繁的页面交换出去。在PostgreSQL中就通过usage_count来表示一个页面的访问频率,usage_count初始值为0,页面每次执行pin操作都会递增usage_count。所以访问越频繁的页面usage_count就越大,那么在clock-sweep算法中就越不容易变为0,从而越不容易被交换。

缓冲环替换策略

缓冲环是一般替换策略的一种优化,考虑如下场景:假设当前有多个进程在对数据库进行常规操作。此时有一个进程发起了一个全表遍历查询。这个查询会访问大量物理块,但每个块都只访问一次。如果按照一般替换策略,这个全表遍历将导致缓冲池中存在大量只会使用一次的页面,而将许多会被多次使用的页面替换出缓冲区。显然这违背了缓冲区减少I\O的初衷。针对这种情况,缓冲环的基本思想是分配固定数量的缓冲区,替换操作首先在这些缓冲区中进行,如果这些缓冲区中没有可替换的,再使用一般替换策略。环缓冲区主要依靠数据结构BufferAccessStrategy结构来控制,其定义如下:

typedef struct BufferAccessStrategyData
{/* Overall strategy type 缓冲环控制策略*/BufferAccessStrategyType btype;/* Number of elements in buffers[] array 环大小*/int         ring_size;/** Index of the "current" slot in the ring, ie, the one most recently* returned by GetBufferFromRing.* 最近加入到环中的Buffer*/int         current;/** True if the buffer just returned by StrategyGetBuffer had been in the* ring already.* 最近通过StrategyGetBuffer获取的Buffer是否是直接在环中取的*/bool        current_was_in_ring;/** Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we* have not yet selected a buffer for this ring slot.  For allocation* simplicity this is palloc'd together with the fixed fields of the* struct.* 数组,用于存储加入到环中的缓冲区索引号*/Buffer       buffers[FLEXIBLE_ARRAY_MEMBER];
}   BufferAccessStrategyData;
typedef struct BufferAccessStrategyData *BufferAccessStrategy;

缓冲环替换策略在GetBufferFromRing函数中实现,该函数有三个步骤:

  1. 将strategy中的current指针指向strategy的Buffers字段的下一个元素(代表可能的下一 个缓冲区),如果当前指向的是Buffers的最后一个元素,则将current置为0 (指向Buffers的第一个 元素)。

  2. 检査current指针指向的元素,如果其中记录的值为InvalidBuffer,表明环还未充满,这个位置还没有记录一个缓冲区。这种情况下设置strategy的current_was_in_ring字段为 false之后返回空值。

    GetBufferFromRing的上层调用函数(StrategyGetBuffer)在检测到返回值为空 之后会采用一般的替换策略取得一个空闲缓冲区,并通过AddBufferToRing将该缓冲区加人到缓冲环中。

  3. 如果current指针指向的元素中记录的是一个有效的缓冲区索引号,则检査该缓冲区的refcount和usagecount。如果refcount为0且usagecount<=1 (最多被访问过一次,而这一次很可能是全表遍历时,当前进程访问的), 则把这个缓冲区替换出来返回;否则表明该缓冲区仍在被其他进程使用中或最近被其他进程使用过,这时需采用和步骤2类似的方法,由上层调用函数采用一般的替换策略取得空闲缓冲区。

上述三步,简而言之就是:获取当前指针的下一个元素对应的缓冲区,若存在一个合法缓冲区,且该缓冲区没有进程在访问,且最近最多被访问过一次,则返回该缓冲区,否则采用一般替换策略。

代码如下:

static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{BufferDesc *buf;Buffer     bufnum;uint32       local_buf_state;    /* to avoid repeated (de-)referencing *//* Advance to next ring slot */if (++strategy->current >= strategy->ring_size)strategy->current = 0;/** If the slot hasn't been filled yet, tell the caller to allocate a new* buffer with the normal allocation strategy.  He will then fill this* slot by calling AddBufferToRing with the new buffer.*/bufnum = strategy->buffers[strategy->current];if (bufnum == InvalidBuffer){strategy->current_was_in_ring = false;return NULL;}/** If the buffer is pinned we cannot use it under any circumstances.** If usage_count is 0 or 1 then the buffer is fair game (we expect 1,* since our own previous usage of the ring element would have left it* there, but it might've been decremented by clock sweep since then). A* higher usage_count indicates someone else has touched the buffer, so we* shouldn't re-use it.*/buf = GetBufferDescriptor(bufnum - 1);local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1){strategy->current_was_in_ring = true;*buf_state = local_buf_state;return buf;}UnlockBufHdr(buf, local_buf_state);/** Tell caller to allocate a new buffer with the normal allocation* strategy.  He'll then replace this ring element via AddBufferToRing.*/strategy->current_was_in_ring = false;return NULL;
}

AddBufferToRing

前面提到了AddBufferToRing,我们来看看他的实现:

static void
AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
{strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
}

这个函数非常简单,就是将一个缓冲区在BufferDescriptors中的下标信息存入缓冲环中,current对应的位置。

PostgreSQL 基础模块---缓冲池管理相关推荐

  1. PostgreSQL 基础模块---表和元组组织方式

    参考资料 <PostgreSQL数据库内核分析> 彭智勇 彭煜玮:P58~P60 概述 PostgreSQL是堆表,其中每个文件由多个块组成,块在物理磁盘中的存储形式如下图所示: 块由4个 ...

  2. CMS管理后台基础模块

    2019独角兽企业重金招聘Python工程师标准>>> 基本功能:增删改查 首先了解清楚一个基础模块包含的功能:增删改查 查询 根据查询条件进行查询,将查询结果以列表形式展示 新增 ...

  3. 00815 计算机基础,国开(山东)00815-计算机应用基础-模块1 windows 7 操作系统——客观题-辅导资料...

    国开(山东)00815-计算机应用基础-模块1 windows 7 操作系统--客观题-辅导资料 (9页) 本资源提供全文预览,点击全文预览即可全文预览,如果喜欢文档就下载吧,查找使用更方便哦! 14 ...

  4. 【24-业务开发-基础业务-品牌管理-图片管理-阿里云OSS服务开通和使用-阿里云OSS服务API使用-SpringCloudAlibaba OSS服务的使用】

    一.知识回顾 [0.三高商城系统的专题专栏都帮你整理好了,请点击这里!] [1-系统架构演进过程] [2-微服务系统架构需求] [3-高性能.高并发.高可用的三高商城系统项目介绍] [4-Linux云 ...

  5. 【高项】项目的概念,项目管理基础与立项管理

    [高项]项目的概念,项目管理基础与立项管理 文章目录 1.什么是项目?项目的概念 2.项目管理知识体系 2.1 组织结构对项目的影响 2.2 信息系统项目的生命周期 2.3 单个项目的管理过程 3.立 ...

  6. Autosar诊断基础—诊断事件管理(DEM)

    Autosar诊断基础--诊断事件管理 1 诊断事件管理(DEM)概念 2 DEM模块及关联模块关系 3 DEM模块介绍 3.1 诊断故障码(DTC)概念及确定方式 3.2 诊断故障码(DTC)的掩码 ...

  7. 搞一下CP AUTOSAR 入门 | 02 CP AUTOSAR 基础模块功能描述

    前言 本系列请点击:<搞一下CP AUTOSAR入门> 所有系列请点击:<汽车电子系列分享> 在上一篇介绍了CP AUTOSAR的分层结构,对于多数使用AUTOSAR架构开发汽 ...

  8. Spring Cloud 进阶--Rest 微服务基础模块构建

    < Rest 微服务基础模块构建 > 前言 前面进行了微服务架构风格.微服务框架以及微服务相关的理论体系的简介与介绍,本篇博文是微服务实践的开始,本篇博客主要为完成 Rest 微服务基础模 ...

  9. 计算机应用基础模块四,第四部分计算机应用基础(基础模块)考试说明

    第四部分计算机应用基础(基础模块)考试说明 (7页) 本资源提供全文预览,点击全文预览即可全文预览,如果喜欢文档就下载吧,查找使用更方便哦! 9.9 积分 第四部分 计算机应用基础(基础模块)考试说明 ...

最新文章

  1. java中方法的重写
  2. mysql连接的空闲时间超过8小时后 MySQL自动断开该连接解决方案
  3. Tungsten Fabric SDN — 制作/分发 Local Docker Registry
  4. git使用报错:fatal: Couldn't find remote ref master的解决方法
  5. CentOS7中安装图形界面
  6. SAP ABAP Netweaver里的SE80事务码是如何响应用户请求的
  7. swift5导航栏标题文字属性设置
  8. [JavaWeb-XML]XML概述
  9. SpringBoot 启动报错:Failed to configure a DataSource: ‘url‘ attribute is not specified and no emb
  10. 2022年考研结束了
  11. JAVA定时器ScheduledExecutorService中,scheduleAtFixedRate和scheduleWithFixedDelay的区别
  12. RHEL7中设置ssh
  13. Atitit 常见软件设计图纸总结 目录 1.1. ui原型图与html 2 1.2. 业务逻辑 伪代码 各种uml图 2 1.3. 总体设计图纸 结构图 层次图 架构图 2 1.4. 业务逻辑
  14. 医疗软件测试工作流程
  15. 微型计算机原理与应用实验指导书,微型计算机技术与应用实验指导书.doc
  16. 计算机表格做八折怎么辛,五笔字根表口诀.doc
  17. 找到小菇凉 (BFS)
  18. 2017计算机四级网络工程师,2017计算机四级网络工程师真题练习
  19. Manjaro Linux安装QQ和微信
  20. php+aira2+ffmpeg下载m3u8文件并保存成mp4

热门文章

  1. 免费领取地图下载流量与流量使用方法
  2. python股票趋势线_在Python中自动检测股票价格的高低并绘制趋势线
  3. AI的故事:半人马的诞生之路
  4. 激荡10年,珍贵的毕业礼物
  5. 架设游戏的服务器系统,架设游戏服务器需要备案吗
  6. 反激式开关电源设计方案,12V6A输出,有完整原理图
  7. 基于网页网站在线视频点播系统 毕业设计毕设源码毕业论文开题报告参考(3)网站后台系统管理功能
  8. 9月开学季CSDN高校俱乐部专家巡讲讲师招募
  9. java 文件头_常用文件的文件头(附JAVA测试类)
  10. @Value(${}) 与@Value(#{}) 区别