Postgresql源码（5）缓冲区管理

学习笔记：https://www.interdb.jp/pg/pgsql08.html

前言

设想中一个简单的buffer管理器具备的功能：

写
- 可以申请到一个空闲PAGE用来写，并对应到磁盘上的一个PAGE
- 满了可以自动淘汰刷盘一个PAGE
- 写完了可以决定立即刷盘或 lazy刷盘
读
- 可以读一个已经缓存的PAGE
- 可以读一个磁盘上的PAGE
- 满了可以自动淘汰刷盘一个PAGE，再把需要的读上来

看看PG是怎么实现的？

PG实现

1 TAG

{(16821, 16384, 37721), 1, 3}

表示

tablespace=16821
db=16384
table=37721
freespace map
文件中的3号块

typedef struct buftag
{RelFileNode rnode;         /* physical relation identifier */typedef struct RelFileNode{Oid            spcNode;        /* tablespace */Oid         dbNode;         /* database */Oid           relNode;        /* relation */} RelFileNode;ForkNumber  forkNum;    // tables, freespace maps and visibility maps are defined in 0, 1 and 2BlockNumber blockNum;        /* blknum relative to begin of reln */
} BufferTag;

2 结构

三层结构（第二层是逻辑上的）

buffer table可以直接从Hash中找到buffer id
在使用buffer slot前，需要用buffer id找到desc，查看slot的meta信息，才可以使用
所以逻辑上中间有一层desc数组

2.1 Buffer Table

这里需要关注的，注意这是核心的入口数据结构，是tag对id的映射关系。

key
- BufferTag
  - RelFileNode rnode
  - ForkNumber forkNum
  - BlockNumber blockNum
value
- BufferLookupEnt
  - BufferTag key
  - int id
分区的哈希表，加锁的粒度更小。

void
InitBufTable(int size)
{HASHCTL        info;/* assume no locking is needed yet *//* BufferTag maps to Buffer */info.keysize = sizeof(BufferTag);info.entrysize = sizeof(BufferLookupEnt);info.num_partitions = NUM_BUFFER_PARTITIONS;SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",size, size,&info,HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
}

2.2 Buffer Descriptor

这也是一个数组，在这里初始化的：

void
InitBufferPool(void)
{bool       foundBufs,foundDescs,foundIOLocks,foundBufCkpt;/* Align descriptors to a cacheline boundary. */BufferDescriptors = (BufferDescPadded *)ShmemInitStruct("Buffer Descriptors",NBuffers * sizeof(BufferDescPadded),&foundDescs);
...

初始化的BufferDescPadded，一个buffer对应一个desc：

typedef union BufferDescPadded
{BufferDesc bufferdesc;char     pad[BUFFERDESC_PAD_TO_SIZE];
} BufferDescPadded;

具体内容：

/**  BufferDesc -- shared descriptor/state data for a single shared buffer.** Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
【加BM_LOCKED（Buffer header lock）锁才能读写tag、state、wait_backend_pid】* the tag, state or wait_backend_pid fields.  In general, buffer header lock* is a spinlock which is combined with flags, refcount and usagecount into* single atomic variable.  This layout allow us to do some operations in a* single atomic operation, without actually acquiring and releasing spinlock;* for instance, increase or decrease refcount.  buf_id field never changes* after initialization, so does not need locking.  freeNext is protected by* the buffer_strategy_lock not buffer header lock.  The LWLock can take care* of itself.  The buffer header lock is *not* used to control access to the* data in the buffer!** It's assumed that nobody changes the state field while buffer header lock* is held.  Thus buffer header lock holder can do complex updates of the* state variable in single write, simultaneously with lock release (cleaning* BM_LOCKED flag).  On the other hand, updating of state without holding* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag* is not set.  Atomic increment/decrement, OR/AND etc. are not allowed.*【假设BM_LOCKED（Buffer header lock）持有时，别人不能更新state】【那么持有锁的人可以对state做很多更新，然后释放锁】* An exception is that if we have the buffer pinned, its tag can't change* underneath us, so we can examine the tag without locking the buffer header.* Also, in places we do one-time reads of the flags without bothering to* lock the buffer header; this is generally for situations where we don't* expect the flag bit being tested to be changing.** We can't physically remove items from a disk page if another backend has* the buffer pinned.  Hence, a backend may need to wait for all other pins* to go away.  This is signaled by storing its own PID into* wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present,* there can be only one such waiter per buffer.** We use this same struct for local buffer headers, but the locks are not* used and not all of the flag bits are useful either. To avoid unnecessary* overhead, manipulations of the state field should be done without actual* atomic operations (i.e. only pg_atomic_read_u32() and* pg_atomic_unlocked_write_u32()).** Be careful to avoid increasing the size of the struct when adding or* reordering members.  Keeping it below 64 bytes (the most common CPU* cache line size) is fairly important for performance.*/
typedef struct BufferDesc
{BufferTag  tag;            /* ID of page contained in buffer */int         buf_id;         /* buffer's index number (from 0) *//* state of the tag, containing flags, refcount and usagecount */pg_atomic_uint32 state;int            wait_backend_pid;   /* backend PID of pin-count waiter */int            freeNext;       /* link in freelist chain */LWLock      content_lock;   /* to lock access to buffer contents */
} BufferDesc;

tag/buf_id：上面讲过了
state
- flags
  - dirty bit：indicates whether the stored page is dirty.
  - valid bit：当前页面是可以读的（1）slot有数据、对应的desc有数据，可以读。（2）invalid：desc没数据或正在做页面替换中。
  - io_in_progress bit：缓冲区管理器是否正在从/向存储读取/写入关联的页面。换句话说，该位指示是否有单个进程持有该描述符的 io_in_progress_lock。
- recount
  - 记录访问当前页面的进程数，也叫pin count。访问页面必须pin count++，使用后必须pin count–。
  - pin count=0时叫做unpinned；非0时叫做pinned。
- usagecount：记录自从加载上来之后，被访问的次数，时钟算法会用到。
freeNext：下一个空闲buffer，在数组上加一个空闲链表的逻辑

【desc描述页面的三种状态】

Empty
- 当对应的缓冲池槽没有存储页面时（即refcount和usage_count为0），该描述符的状态为空。
Pinned
- 当相应的缓冲池槽存储一个页面并且任何 PostgreSQL 进程正在访问该页面（即 refcount 和 usage_count 大于或等于 1）时，该缓冲区描述符的状态被锁定。
Unpinned
- 当对应的缓冲池槽存储了一个页面但没有PostgreSQL进程访问该页面时（即usage_count大于或等于1，但refcount为0），这个缓冲区描述符的状态是unpinned。

2.3 Buffer Descriptor 逻辑层

BufferDescriptors数组初始化，freelist初始化buf->freeNext = i + 1;

...
BufferDescPadded *BufferDescriptors;
...void
InitBufferPool(void)
{bool       foundBufs,foundDescs,foundIOLocks,foundBufCkpt;/* Align descriptors to a cacheline boundary. */BufferDescriptors = (BufferDescPadded *)ShmemInitStruct("Buffer Descriptors",NBuffers * sizeof(BufferDescPadded),&foundDescs);
...
.../** Initialize all the buffer headers.*/for (i = 0; i < NBuffers; i++){BufferDesc *buf = GetBufferDescriptor(i);CLEAR_BUFFERTAG(buf->tag);pg_atomic_init_u32(&buf->state, 0);buf->wait_backend_pid = 0;buf->buf_id = i;/** Initially link all the buffers together as unused. Subsequent* management of this list is done by freelist.c.*/buf->freeNext = i + 1;LWLockInitialize(BufferDescriptorGetContentLock(buf),LWTRANCHE_BUFFER_CONTENT);LWLockInitialize(BufferDescriptorGetIOLock(buf),LWTRANCHE_BUFFER_IO_IN_PROGRESS);}
...
...

加载第一个页面的过程：

freelist中拿一个free的desc，pin住（refcount++, usage_count++）
buffertable中新增entry，记录tag : buffer_id
存储中读取页面内容到内存
更新desc中的meta信息

desc使用之后就不会在加入到freelist中了，除非：

表或索引被删除了
db 被删除了
表或索引被vacuum full清空了

2.4 Buffer Pool

一段内存空间，大小为8K * NBuffers

 BufferBlocks = (char *)ShmemInitStruct("Buffer Blocks",NBuffers * (Size) BLCKSZ, &foundBufs);

3 锁

这几把锁都在是共享内存中的。

3.1 Buffer Table Locks

BufMappingLock

哈希表的分区锁，分s/e

3.2 Desc锁

content_lock

读写PAGE的轻量锁，分s/e

e模式出现在以下情况下
- 插入page、修改tuple的t_xmin/t_xmax字段
- 物理删除tuple、压缩页面剩余空间（vacuum）
- 页面内freeze

io_in_progress_lock

用于等PAGE的IO动作完成，当进程从/向存储加载/写入页面数据时，该进程在访问存储时持有相应描述符的独占 io_in_progress 锁。

spinlock（现在是BM_LOCK标志位）

desc的flags和fields修改时会加spinlock。

例如PIN：

LockBufHdr(bufferdesc);    /* Acquire a spinlock */
bufferdesc->refcont++;
bufferdesc->usage_count++;
UnlockBufHdr(bufferdesc); /* Release the spinlock */

例如set the dirty bit to ‘1’：

#define BM_DIRTY             (1 << 0)    /* data needs writing */
#define BM_VALID             (1 << 1)    /* data is valid */
#define BM_TAG_VALID         (1 << 2)    /* tag is assigned */
#define BM_IO_IN_PROGRESS    (1 << 3)    /* read or write in progress */
#define BM_JUST_DIRTIED      (1 << 5)    /* dirtied since write started */LockBufHdr(bufferdesc);
bufferdesc->flags |= BM_DIRTY;
UnlockBufHdr(bufferdesc);

4 淘汰策略

四种策略：

typedef enum BufferAccessStrategyType
{BAS_NORMAL,                    /* Normal random access */BAS_BULKREAD,             /* Large read-only scan (hint bit updates are* ok) */BAS_BULKWRITE,             /* Large multi-block write (e.g. COPY IN) */BAS_VACUUM                  /* VACUUM */
} BufferAccessStrategyType;

BufferAccessStrategyType	使用场景	替换算法
BAS_NORMAL	一般情况的随机读写	clock sweep 算法
BAS_BULKREAD	批量读	ring算法，环大小为 256 * 1024 / BLCKSZ
BAS_BULKWRITE	批量写	ring算法，环大小为 16 * 1024 * 1024 / BLCKSZ
BAS_VACUUM	VACUUM 进程	ring算法，环大小为 256 * 1024 / BLCKSZ

clock sweep 算法

/** The shared freelist control information.*/
typedef struct
{/* Spinlock: protects the values below */// 自旋锁，用来保护下面的成员slock_t        buffer_strategy_lock;/** Clock sweep hand: index of next buffer to consider grabbing. Note that* this isn't a concrete buffer - we only ever increase the value. So, to* get an actual buffer, it needs to be used modulo NBuffers.*/// 下次遍历位置pg_atomic_uint32 nextVictimBuffer;// 空闲buffer链表的头部int            firstFreeBuffer;    /* Head of list of unused buffers */// 空闲buffer链表的尾部int         lastFreeBuffer; /* Tail of list of unused buffers *//** NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,* when the list is empty)*//** Statistics.  These counters should be wide enough that they can't* overflow during a single bgwriter cycle.*/// 记录遍历完数组的次数uint32      completePasses; /* Complete cycles of the clock sweep */pg_atomic_uint32 numBufferAllocs;   /* Buffers allocated since last reset *//** Bgworker process to be notified upon activity or -1 if none. See* StrategyNotifyBgWriter.*/int          bgwprocno;
} BufferStrategyControl;

每次从上次位置开始轮询，然后检查buffer 的引用次数 refcount 和访问次数 usagecount。

如果 refcount，usagecount 都为零，那么直接返回。
如果 refcount 为零，usagecount 不为零，那么将其usagecount 减1，遍历下一个buffer。
如果 refcount 不为零，则遍历下一个。

clock sweep 算法是一个死循环算法，直到找出一个 refcount，usagecount 都为零的buffer。

空闲链表

为了加快查找空闲 buffer 的速度，postgresql 使用链表来保存这些buffer。
链表的头部和尾部由 BufferStrategyControl 结构体的 firstFreeBuffer 和 lastFreeBuffer 成员指定。
链表节点由 BufferDesc 结构体表示，它的 freeNext 成员指向下一个节点。
当有新增的空闲buffer，它会被添加到链表的尾部。当需要空闲空间时，则直接返回链表的头部。

引用计数器

这里的state的前18 bits在共享内存中记录引用次数。

typedef struct BufferDesc
{BufferTag  tag;            /* ID of page contained in buffer */int         buf_id;         /* buffer's index number (from 0) *//* state of the tag, containing flags, refcount and usagecount */pg_atomic_uint32 state;int            wait_backend_pid;   /* backend PID of pin-count waiter */int            freeNext;       /* link in freelist chain */LWLock      content_lock;   /* to lock access to buffer contents */
} BufferDesc;/** Buffer state is a single 32-bit variable where following data is combined.** - 18 bits refcount
【进程本身用数组+哈希表做二级缓存pined，这里记录最后刷新的结果】* - 4 bits usage count* - 10 bits of flags** Combining these values allows to perform some operations without locking* the buffer header, by modifying them together with a CAS loop.** The definition of buffer state components is below.*/
#define BUF_REFCOUNT_ONE 1
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
#define BUF_USAGECOUNT_MASK 0x003C0000U
#define BUF_USAGECOUNT_ONE (1U << 18)
#define BUF_USAGECOUNT_SHIFT 18
#define BUF_FLAG_MASK 0xFFC00000U/* Get refcount and usagecount from buffer state */
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)

真正的引用计数器在这里，每个BUFFER一个：

typedef struct PrivateRefCountEntry
{Buffer     buffer;int32        refcount;
} PrivateRefCountEntry;

为了快速找到指定buffer的引用计数，PrivateRefCountEntry数组作为一级缓存，使用哈希表作为二级缓存。

注意：进程私有的缓存！

/** Backend-Private refcount management:** Each buffer also has a private refcount that keeps track of the number of* times the buffer is pinned in the current process.  This is so that the* shared refcount needs to be modified only once if a buffer is pinned more* than once by an individual backend.  It's also used to check that no buffers* are still pinned at the end of transactions and when exiting.
【当前进程在私有内存记录：使用的buffer倍pin了多少次】
【功能1：所以当前进程如果pin了多次，最后在共享内存里面只需要pin一次】
【功能2：也用来检查当事务结束、进程退出时没有pin住的buffer】** To avoid - as we used to - requiring an array with NBuffers entries to keep* track of local buffers, we use a small sequentially searched array* (PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to* keep track of backend local pins.*
【为了避免使用NBuffers个元素的大数组来跟踪本地pin缓存，这里使用一个8元素的数组+一个overflow哈希表来记录pin】* Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all* refcounts are kept track of in the array; after that, new array entries* displace old ones into the hash table. That way a frequently used entry* can't get "stuck" in the hashtable while infrequent ones clog the array.** Note that in most scenarios the number of pinned buffers will not exceed* REFCOUNT_ARRAY_ENTRIES.
【pinned不超过8个之前，所有的refcounts都会在数组中跟踪，再来新的pin会把旧的换到哈希表中】
【这样经常使用的会一直在数组中，不常用的会在哈希表中】*** To enter a buffer into the refcount tracking mechanism first reserve a free* entry using ReservePrivateRefCountEntry() and then later, if necessary,* fill it with NewPrivateRefCountEntry(). That split lets us avoid doing* memory allocations in NewPrivateRefCountEntry() which can be important* because in some scenarios it's called with a spinlock held...
【要使用这套缓存跟踪机制，首先用ReservePrivateRefCountEntry保留一个空闲数组位置】
【在使用时用NewPrivateRefCountEntry填充这个位置】
【为什么要拆分？避免在ReservePrivateRefCountEntry做内存分配，因为有时候会拿着spinlock调这个函数】*/// 【一级缓存】：8个元素
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
// 空闲位置index
static uint32 PrivateRefCountClock = 0;
// 指向数组中空余的位置
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;// 【二级缓存】：key=buffer_id  value=PrivateRefCountEntry
static HTAB *PrivateRefCountHash = NULL;
// 哈希表包含entry的数目
static int32 PrivateRefCountOverflowed = 0;...
void
InitBufferPoolAccess(void)
{HASHCTL        hash_ctl;memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));MemSet(&hash_ctl, 0, sizeof(hash_ctl));hash_ctl.keysize = sizeof(int32);hash_ctl.entrysize = sizeof(PrivateRefCountEntry);PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,HASH_ELEM | HASH_BLOBS);
}

ReservePrivateRefCountEntry

（1）在一级缓存中申请一个初始状态的ReservedRefCountEntry（buffer_id = 0, ref_count = 0）

（2）如果数组满了，把PrivateRefCountClock位置的entry插入hash表，然后清空后使用。

（3）注意这个函数不返回什么，只是维护了ReservedRefCountEntry，让这个指针指向一个空闲的entry存本地ref_count。

static void
ReservePrivateRefCountEntry(void)
{/* Already reserved (or freed), nothing to do */if (ReservedRefCountEntry != NULL)return;/** First search for a free entry the array, that'll be sufficient in the* majority of cases.*/{int         i;for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++){PrivateRefCountEntry *res;res = &PrivateRefCountArray[i];if (res->buffer == InvalidBuffer){ReservedRefCountEntry = res;return;}}}/** No luck. All array entries are full. Move one array entry into the hash* table.*/{/** Move entry from the current clock position in the array into the* hashtable. Use that slot.*/PrivateRefCountEntry *hashent;bool      found;/* select victim slot */ReservedRefCountEntry =&PrivateRefCountArray[PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES];/* Better be used, otherwise we shouldn't get here. */Assert(ReservedRefCountEntry->buffer != InvalidBuffer);/* enter victim array entry into hashtable */hashent = hash_search(PrivateRefCountHash,(void *) &(ReservedRefCountEntry->buffer),HASH_ENTER,&found);Assert(!found);hashent->refcount = ReservedRefCountEntry->refcount;/* clear the now free array slot */ReservedRefCountEntry->buffer = InvalidBuffer;ReservedRefCountEntry->refcount = 0;PrivateRefCountOverflowed++;}
}

NewPrivateRefCountEntry

（1）填充buffer_id

（2）返回一个PrivateRefCountEntry，新的ref_count=0

static PrivateRefCountEntry *
NewPrivateRefCountEntry(Buffer buffer)
{PrivateRefCountEntry *res;/* only allowed to be called when a reservation has been made */Assert(ReservedRefCountEntry != NULL);/* use up the reserved entry */res = ReservedRefCountEntry;ReservedRefCountEntry = NULL;/* and fill it */res->buffer = buffer;res->refcount = 0;return res;
}

GetPrivateRefCountEntry

（1）传入buffer_id找PrivateRefCountEntry

（2）如果buffer_id已经在数组中了，直接返回数组中元素的指针(buffer_id, ref_count)

（3）如果没在数组中，去hash表中查询，如果没有直接返回NULL

（4）如果查到了do_move==true ？在数组中清理一个位置然后把哈希表中的记录放到数组中：直接返回查到的(buffer_id, ref_count)

static PrivateRefCountEntry *
GetPrivateRefCountEntry(Buffer buffer, bool do_move)
{PrivateRefCountEntry *res;int          i;Assert(BufferIsValid(buffer));Assert(!BufferIsLocal(buffer));/** First search for references in the array, that'll be sufficient in the* majority of cases.*/for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++){res = &PrivateRefCountArray[i];if (res->buffer == buffer)return res;}/** By here we know that the buffer, if already pinned, isn't residing in* the array.** Only look up the buffer in the hashtable if we've previously overflowed* into it.*/if (PrivateRefCountOverflowed == 0)return NULL;res = hash_search(PrivateRefCountHash,(void *) &buffer,HASH_FIND,NULL);if (res == NULL)return NULL;else if (!do_move){/* caller doesn't want us to move the hash entry into the array */return res;}else{/* move buffer from hashtable into the free array slot */bool      found;PrivateRefCountEntry *free;/* Ensure there's a free array slot */ReservePrivateRefCountEntry();/* Use up the reserved slot */Assert(ReservedRefCountEntry != NULL);free = ReservedRefCountEntry;ReservedRefCountEntry = NULL;Assert(free->buffer == InvalidBuffer);/* and fill it */free->buffer = buffer;free->refcount = res->refcount;/* delete from hashtable */hash_search(PrivateRefCountHash,(void *) &buffer,HASH_REMOVE,&found);Assert(found);Assert(PrivateRefCountOverflowed > 0);PrivateRefCountOverflowed--;return free;}
}

5 SRC

ReadBufferExtended

/** ReadBufferExtended -- returns a buffer containing the requested*     block of the requested relation.  If the blknum*        requested is P_NEW, extend the relation file and*       allocate a new block.  (Caller is responsible for*      ensuring that only one backend tries to extend a*       relation at the same time!)
【返回请求的PAGE，如果blknum==P_NEW，扩展表文件申请一个新的页面读到内存里】** Returns: the buffer number for the buffer containing*      the block read.  The returned buffer has been pinned.*      Does not return on error --- elog's instead.*
【返回请求的、可用的页面，注意该页面已经PIN了】* Assume when this function is called, that reln has been opened already.** In RBM_NORMAL mode, the page is read from disk, and the page header is* validated.  An error is thrown if the page header is not valid.  (But* note that an all-zero page is considered "valid"; see PageIsVerified().)【RBM_NORMAL模式，页面从磁盘上读取出来并验证page header】* RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not* valid, the page is zeroed instead of throwing an error. This is intended* for non-critical data, where the caller is prepared to repair errors.*【RBM_ZERO_ON_ERROR模式，如果page header验证失败，直接清零不报错，适用于非核心数据场景】* In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's* filled with zeros instead of reading it from disk.  Useful when the caller* is going to fill the page from scratch, since this saves I/O and avoids* unnecessary failure if the page-on-disk has corrupt page headers.* The page is returned locked to ensure that the caller has a chance to* initialize the page before it's made visible to others.* Caution: do not use this mode to read a page that is beyond the relation's* current physical EOF; that is likely to cause problems in md.c when* the page is modified and written out. P_NEW is OK, though.【RBM_ZERO_AND_LOCK高性能模式：页面不在缓冲区中，不从磁盘读，直接填0。页面锁定避免别人读到初始化之前的一堆0】* RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires* a cleanup-strength lock on the page.** RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.** If strategy is not NULL, a nondefault buffer access strategy is used.* See buffer/README for details.*/
Buffer
ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,ReadBufferMode mode, BufferAccessStrategy strategy)
{bool       hit;Buffer      buf;/* Open it at the smgr level if not already done */RelationOpenSmgr(reln);

这里会使用mdopen打开物理文件，并在reln的md_seg_fds中记录打开的vfd。

1、从reln里面把记录的文件名拿出来return &reln->md_seg_fds[forknum][0]，如果没有需要用VFD打开

2、打开文件PathNameOpenFile，并记录VFD到md_seg_fds中。

 /** Reject attempts to read non-local temporary relations; we would be* likely to get wrong data since we have no visibility into the owning* session's local buffers.*/if (RELATION_IS_OTHER_TEMP(reln))ereport(ERROR,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),errmsg("cannot access temporary tables of other sessions")));/** Read the buffer, and update pgstat counters to reflect a cache hit or* miss.*/pgstat_count_buffer_read(reln);buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,forkNum, blockNum, mode, strategy, &hit);if (hit)pgstat_count_buffer_hit(reln);return buf;
}

ReadBufferExtended

/** ReadBuffer_common -- common logic for all ReadBuffer variants** *hit is set to true if the request was satisfied from shared buffer cache.*/
static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum, ReadBufferMode mode,BufferAccessStrategy strategy, bool *hit)
{...
...bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,strategy, &found);
...
...
}

BufferAlloc

/** BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared*       buffer.  If no buffer exists already, selects a replacement*        victim and evicts the old page, but does NOT read in new page.*
【找到一个buffer，如果没有淘汰一个】* "strategy" can be a buffer replacement strategy object, or NULL for* the default strategy.  The selected buffer's usage_count is advanced when* using the default strategy, but otherwise possibly not (see PinBuffer).** The returned buffer is pinned and is already marked as holding the* desired page.  If it already did have the desired page, *foundPtr is* set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked* as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.*
【返回的buffer会pin住，并且页面数据已经填充好可用】
【如果页面在缓冲区里面已经有了，直接返回】
【如果没有则把foundPtr=false，buffer标记成IO_IN_PROGRESS，上层函数“无需”在做IO】* *foundPtr is actually redundant with the buffer's BM_VALID flag, but* we keep it for simplicity in ReadBuffer.** No locks are held either at entry or exit.*/
static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,BlockNumber blockNum,BufferAccessStrategy strategy,bool *foundPtr)
{BufferTag  newTag;         /* identity of requested block */uint32     newHash;        /* hash value for newTag */LWLock      *newPartitionLock;   /* buffer partition lock for it */BufferTag oldTag;         /* previous identity of selected buffer */uint32        oldHash;        /* hash value for oldTag */LWLock      *oldPartitionLock;   /* buffer partition lock for it */uint32        oldFlags;int            buf_id;BufferDesc *buf;bool     valid;uint32        buf_state;/* create a tag so we can lookup the buffer */INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);/* determine its hash code and partition lock ID */newHash = BufTableHashCode(&newTag);

用hash值 % NUM_BUFFER_PARTITIONS(128)，然后去MainLWLockArray数组里面拿锁：

#define BufTableHashPartition(hashcode) ((hashcode) % NUM_BUFFER_PARTITIONS)

#define BufMappingPartitionLock(hashcode) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + BufTableHashPartition(hashcode)].lock)

 newPartitionLock = BufMappingPartitionLock(newHash);/* see if the block is in the buffer pool already */LWLockAcquire(newPartitionLock, LW_SHARED);buf_id = BufTableLookup(&newTag, newHash);【hash表中tag --> buf_id，找到了说明已经在buffer中了】if (buf_id >= 0){/** Found it.  Now, pin the buffer so no one can steal it from the* buffer pool, and check to see if the correct data has been loaded* into the buffer.*/
【找到了！直接pinbuffer】buf = GetBufferDescriptor(buf_id);【下面有这个函数的展开分析，共享内存中desc和本地缓存ref_count都做++】valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */
【PIN住就可以放锁了】LWLockRelease(newPartitionLock);*foundPtr = TRUE;【哈希表中找到了，但是锁完了发现页面是不可用的，需要把页面重新读上来】if (!valid){/** We can only get here if (a) someone else is still reading in* the page, or (b) a previous read attempt failed.  We have to* wait for any active read attempt to finish, and then set up our* own read attempt if the page is still not BM_VALID.* StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer must* have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}【走到这了说明在hash表中没找到，没在缓存中】/** Didn't find it in the buffer pool.  We'll have to initialize a new* buffer.  Remember to unlock the mapping lock while doing the work.*/
【需要初始化一个新的buffer，先把锁放了在初始化】LWLockRelease(newPartitionLock);/* Loop here in case we have to try another victim buffer */for (;;){/** Ensure, while the spinlock's not yet held, that there's a free* refcount entry.*/
【从私有一级缓存数组中拿出来一个（buffer_id，ref_count）位置】
【如果没位置了，换出一个到二级缓存哈希表中，然后拿出来一个位置】ReservePrivateRefCountEntry();/** Select a victim buffer.  The buffer is returned with its header* spinlock still held!*/
【找一个buffer位置，返回ID和buf_state，先找freelist没有就clocksweep淘汰】
【这个函数下面有展开】buf = StrategyGetBuffer(strategy, &buf_state);Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);/* Must copy buffer flags while we still hold the spinlock */oldFlags = buf_state & BUF_FLAG_MASK;/* Pin the buffer and then release the buffer spinlock */
【共享缓冲区和本地缓冲区都的ref都++】PinBuffer_Locked(buf);/** If the buffer was dirty, try to write it out.  There is a race* condition here, in that someone might dirty it after we released it* above, or even while we are writing it out (since our share-lock* won't prevent hint-bit updates).  We will recheck the dirty bit* after re-locking the buffer header.*/
【需要刷脏】if (oldFlags & BM_DIRTY){/** We need a share-lock on the buffer contents to write it out* (else we might write invalid data, eg because someone else is* compacting the page contents while we write).  We must use a* conditional lock acquisition here to avoid deadlock.  Even* though the buffer was not pinned (and therefore surely not* locked) when StrategyGetBuffer returned it, someone else could* have pinned and exclusive-locked it by the time we get here. If* we try to get the lock unconditionally, we'd block waiting for* them; if they later block waiting for us, deadlock ensues.* (This has been observed to happen when two backends are both* trying to split btree index pages, and the second one just* happens to be trying to split the page the first one got from* StrategyGetBuffer.)*/if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),LW_SHARED)){/** If using a nondefault strategy, and writing the buffer* would require a WAL flush, let the strategy decide whether* to go ahead and write/reuse the buffer or to choose another* victim.  We need lock to inspect the page LSN, so this* can't be done inside StrategyGetBuffer.*/if (strategy != NULL){XLogRecPtr  lsn;/* Read the LSN while holding buffer header lock */buf_state = LockBufHdr(buf);lsn = BufferGetLSN(buf);UnlockBufHdr(buf, buf_state);if (XLogNeedsFlush(lsn) &&StrategyRejectBuffer(strategy, buf)){/* Drop lock/pin and loop around for another buffer */LWLockRelease(BufferDescriptorGetContentLock(buf));UnpinBuffer(buf, true);continue;}}/* OK, do the I/O */TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);FlushBuffer(buf, NULL);LWLockRelease(BufferDescriptorGetContentLock(buf));ScheduleBufferTagForWriteback(&BackendWritebackContext,&buf->tag);TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,smgr->smgr_rnode.node.spcNode,smgr->smgr_rnode.node.dbNode,smgr->smgr_rnode.node.relNode);}else{/** Someone else has locked the buffer, so give it up and loop* back to get another one.*/UnpinBuffer(buf, true);continue;}}
【拿出来的块不需要刷脏了 或者上面已经刷完了】【然后这个块的TAG如果BM_TAG_VALID，重新加到hash表里面】/** To change the association of a valid buffer, we'll need to have* exclusive lock on both the old and new mapping partitions.*/if (oldFlags & BM_TAG_VALID){/** Need to compute the old tag's hashcode and partition lock ID.* XXX is it worth storing the hashcode in BufferDesc so we need* not recompute it here?  Probably not.*/oldTag = buf->tag;oldHash = BufTableHashCode(&oldTag);oldPartitionLock = BufMappingPartitionLock(oldHash);/** Must lock the lower-numbered partition first to avoid* deadlocks.*/if (oldPartitionLock < newPartitionLock){LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}else if (oldPartitionLock > newPartitionLock){LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);}else{/* only one partition, only one lock */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);}}else
【否则这个旧TAG是无效的，不需要管以前的了，锁新的分区就好】{/* if it wasn't valid, we need only the new partition */LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);/* remember we have no old-partition lock or tag */oldPartitionLock = NULL;/* this just keeps the compiler quiet about uninit variables */oldHash = 0;}/** Try to make a hashtable entry for the buffer under its new tag.* This could fail because while we were writing someone else* allocated another buffer for the same block we want to read in.* Note that we have not yet removed the hashtable entry for the old* tag.*/
【新TAG插入哈希表】buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);if (buf_id >= 0){/** Got a collision. Someone has already done what we were about to* do. We'll just handle this as if it were found in the buffer* pool in the first place.  First, give up the buffer we were* planning to use.*/UnpinBuffer(buf, true);/* Can give up that buffer's mapping partition lock now */if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);/* remaining code should match code at top of routine */buf = GetBufferDescriptor(buf_id);valid = PinBuffer(buf, strategy);/* Can release the mapping lock as soon as we've pinned it */LWLockRelease(newPartitionLock);*foundPtr = TRUE;if (!valid){/** We can only get here if (a) someone else is still reading* in the page, or (b) a previous read attempt failed.  We* have to wait for any active read attempt to finish, and* then set up our own read attempt if the page is still not* BM_VALID.  StartBufferIO does it all.*/if (StartBufferIO(buf, true)){/** If we get here, previous attempts to read the buffer* must have failed ... but we shall bravely try again.*/*foundPtr = FALSE;}}return buf;}/** Need to lock the buffer header too in order to change its tag.*/buf_state = LockBufHdr(buf);/** Somebody could have pinned or re-dirtied the buffer while we were* doing the I/O and making the new hashtable entry.  If so, we can't* recycle this buffer; we must undo everything we've done and start* over with a new victim buffer.*/oldFlags = buf_state & BUF_FLAG_MASK;if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))break;UnlockBufHdr(buf, buf_state);BufTableDelete(&newTag, newHash);if (oldPartitionLock != NULL &&oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);LWLockRelease(newPartitionLock);UnpinBuffer(buf, true);}/** Okay, it's finally safe to rename the buffer.** Clearing BM_VALID here is necessary, clearing the dirtybits is just* paranoia.  We also reset the usage_count since any recency of use of* the old content is no longer relevant.  (The usage_count starts out at* 1 so that the buffer can survive one clock-sweep pass.)** Make sure BM_PERMANENT is set for buffers that must be written at every* checkpoint.  Unlogged buffers only need to be written at shutdown* checkpoints, except for their "init" forks, which need to be treated* just like permanent relations.*/buf->tag = newTag;buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |BUF_USAGECOUNT_MASK);if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;elsebuf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;UnlockBufHdr(buf, buf_state);【如果老TAG有有效的，需要从哈希表里面删掉】if (oldPartitionLock != NULL){BufTableDelete(&oldTag, oldHash);if (oldPartitionLock != newPartitionLock)LWLockRelease(oldPartitionLock);}LWLockRelease(newPartitionLock);/** Buffer contents are currently invalid.  Try to get the io_in_progress* lock.  If StartBufferIO returns false, then someone else managed to* read it before we did, so there's nothing left for BufferAlloc() to do.*/if (StartBufferIO(buf, true))*foundPtr = FALSE;else*foundPtr = TRUE;return buf;
}

PinBuffer

总结：

用buf_id在本地缓存中查是不是已经pin了
如果已经pin了本地ref_count++
如果本地没pin，在desc共享内存中更新state（ref_count++，usage_count++最大到5），本地ref_count++
注意：锁完了页面数据不一定是可用的，返回值是：(buf_state & BM_VALID) != 0

static bool
PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
{Buffer     b = BufferDescriptorGetBuffer(buf);

这里b比desc的buf_id多1。

desc的buf_id是从0开始算的。

#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)

 bool        result;PrivateRefCountEntry *ref;
【从数字和哈希表两级缓存中找（buffer_id, ref_count）】ref = GetPrivateRefCountEntry(b, true);if (ref == NULL){uint32        buf_state;uint32        old_buf_state;
【没找到在数组中申请一个，如果数组满8个了踢一个到哈希表中】ReservePrivateRefCountEntry();【数组中把b填进去】ref = NewPrivateRefCountEntry(b);【更新buf->state】old_buf_state = pg_atomic_read_u32(&buf->state);for (;;){【如果锁了，轮询然后把新状态拿出来】if (old_buf_state & BM_LOCKED)old_buf_state = WaitBufHdrUnlocked(buf);buf_state = old_buf_state;/* increase refcount */buf_state += BUF_REFCOUNT_ONE;【这里strategy == NULL指的clock sweep正常淘汰，否则用ring buffer批量读写数据用的】if (strategy == NULL){/* Default case: increase usagecount unless already max. */if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)buf_state += BUF_USAGECOUNT_ONE;}else{/** Ring buffers shouldn't evict others from pool.  Thus we* don't make usagecount more than 1.*/if (BUF_STATE_GET_USAGECOUNT(buf_state) == 0)buf_state += BUF_USAGECOUNT_ONE;}if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,buf_state)){result = (buf_state & BM_VALID) != 0;break;}}}else
【如果已经pin了，本地的refcount++即可，对于共享内存中的desc来说，一个进程pin多少次都算一次】{/* If we previously pinned the buffer, it must surely be valid */result = true;}ref->refcount++;Assert(ref->refcount > 0);ResourceOwnerRememberBuffer(CurrentResourceOwner, b);return result;
}

StrategyGetBuffer

/** StrategyGetBuffer**  Called by the bufmgr to get the next candidate buffer to use in*    BufferAlloc(). The only hard requirement BufferAlloc() has is that* the selected buffer must not currently be pinned by anyone.**   strategy is a BufferAccessStrategy object, or NULL for default strategy.**  To ensure that no one else can pin the buffer before we do, we must*    return the buffer with the buffer header spinlock still held.*/
BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{BufferDesc *buf;int            bgwprocno;int           trycounter;uint32       local_buf_state;    /* to avoid repeated (de-)referencing *//** If given a strategy object, see whether it can select a buffer. We* assume strategy objects don't need buffer_strategy_lock.*/【如果strategy不为空，说明走ring buffer，strategy结构体里面自带ring buffer的虽有结构】if (strategy != NULL){buf = GetBufferFromRing(strategy, buf_state);if (buf != NULL)return buf;}/** If asked, we need to waken the bgwriter. Since we don't want to rely on* a spinlock for this we force a read from shared memory once, and then* set the latch based on that value. We need to go through that length* because otherwise bgprocno might be reset while/after we check because* the compiler might just reread from memory.** This can possibly set the latch of the wrong process if the bgwriter* dies in the wrong moment. But since PGPROC->procLatch is never* deallocated the worst consequence of that is that we set the latch of* some arbitrary process.*/【StrategyControl这个是clocksweep的核心算法数据结构，上面有介绍】// (gdb) p   *StrategyControl
// $6 = {buffer_strategy_lock = 0 '\000', nextVictimBuffer = {value = 0}, firstFreeBuffer = 324, lastFreeBuffer = 16383, completePasses = 0, numBufferAllocs = {value = 0}, bgwprocno = 113}bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);if (bgwprocno != -1){/* reset bgwprocno first, before setting the latch */StrategyControl->bgwprocno = -1;/** Not acquiring ProcArrayLock here which is slightly icky. It's* actually fine because procLatch isn't ever freed, so we just can* potentially set the wrong process' (or no process') latch.*/SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);}/** We count buffer allocation requests so that the bgwriter can estimate* the rate of buffer consumption.  Note that buffers recycled by a* strategy object are intentionally not counted here.*/pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);/** First check, without acquiring the lock, whether there's buffers in the* freelist. Since we otherwise don't require the spinlock in every* StrategyGetBuffer() invocation, it'd be sad to acquire it here -* uselessly in most cases. That obviously leaves a race where a buffer is* put on the freelist but we don't see the store yet - but that's pretty* harmless, it'll just get used during the next buffer acquisition.** If there's buffers on the freelist, acquire the spinlock to pop one* buffer of the freelist. Then check whether that buffer is usable and* repeat if not.** Note that the freeNext fields are considered to be protected by the* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to* manipulate them without holding the spinlock.*/if (StrategyControl->firstFreeBuffer >= 0){while (true){/* Acquire the spinlock to remove element from the freelist */SpinLockAcquire(&StrategyControl->buffer_strategy_lock);【这是一个临界场景，firstFreeBuffer上面>0但是可能进来之后上完了锁就被用光了】if (StrategyControl->firstFreeBuffer < 0){SpinLockRelease(&StrategyControl->buffer_strategy_lock);break;}buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);【freelist的第一个desc不在链表中了： 1、头指针修改 2、拿出来的desc的freeNext失效】/* Unconditionally remove buffer from freelist */StrategyControl->firstFreeBuffer = buf->freeNext;buf->freeNext = FREENEXT_NOT_IN_LIST;/** Release the lock so someone else can access the freelist while* we check out this buffer.*/SpinLockRelease(&StrategyControl->buffer_strategy_lock);/** If the buffer is pinned or has a nonzero usage_count, we cannot* use it; discard it and retry.  (This can only happen if VACUUM* put a valid buffer in the freelist and then someone else used* it before we got to it.  It's probably impossible altogether as* of 8.3, but we'd better check anyway.)*/
【拿到state并且加锁BM_LOCKED】local_buf_state = LockBufHdr(buf);【这里不太可能失败，相当于做个异常检测】
【失败的话只能是vacuum放一个进去，别人先于我们拿到直接用掉了】if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0){if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}UnlockBufHdr(buf, local_buf_state);}}【到这说明freelist没有能用的了，需要淘汰】/* Nothing on the freelist, so run the "clock sweep" algorithm */trycounter = NBuffers;for (;;){【ClockSweepTick函数拿StrategyControl->nextVictimBuffer】
【因为要保证原子性，所以写了一大坨】【后面循环逻辑比较简单，遍历所有buffer，找没有pinned的】
【找usage_count==0的返回，遍历到的buffer的usage_count都会--，所以第一轮没有遍历5轮肯定会有】
【usage_count的上限就是5了，防止内卷～】buf = GetBufferDescriptor(ClockSweepTick());/** If the buffer is pinned or has a nonzero usage_count, we cannot use* it; decrement the usage_count (unless pinned) and keep scanning.*/local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0){if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0){local_buf_state -= BUF_USAGECOUNT_ONE;trycounter = NBuffers;}else{/* Found a usable buffer */if (strategy != NULL)AddBufferToRing(strategy, buf);*buf_state = local_buf_state;return buf;}}else if (--trycounter == 0){/** We've scanned all the buffers without making any state changes,* so all the buffers are pinned (or were when we looked at them).* We could hope that someone will free one eventually, but it's* probably better to fail than to risk getting stuck in an* infinite loop.*/UnlockBufHdr(buf, local_buf_state);elog(ERROR, "no unpinned buffers available");}UnlockBufHdr(buf, local_buf_state);}
}