Redis的rdb格式学习

rdb格式背景

在redis中，rdb格式是经过压缩之后，保存redis的数据的一种格式，该格式主要就是通过一定的压缩算法，将redis服务端中的内存数据落盘到文件中，本文主要就是分析一下该协议的具体格式，并解析一下。

rdb格式

rdb的格式的详细格式可参考官网，其中最主要的格式如下所示，

----------------------------# RDB is a binary format. There are no new lines or spaces in the file.
52 45 44 49 53              # Magic String "REDIS"
30 30 30 37                 # 4 digit ASCCII RDB Version Number. In this case, version = "0007" = 7
----------------------------
FE 00                       # FE = code that indicates database selector. db number = 00
----------------------------# Key-Value pair starts
FD $unsigned int            # FD indicates "expiry time in seconds". After that, expiry time is read as a 4 byte unsigned int
$value-type                 # 1 byte flag indicating the type of value - set, map, sorted set etc.
$string-encoded-key         # The key, encoded as a redis string
$encoded-value              # The value. Encoding depends on $value-type
----------------------------
FC $unsigned long           # FC indicates "expiry time in ms". After that, expiry time is read as a 8 byte unsigned long
$value-type                 # 1 byte flag indicating the type of value - set, map, sorted set etc.
$string-encoded-key         # The key, encoded as a redis string
$encoded-value              # The value. Encoding depends on $value-type
----------------------------
$value-type                 # This key value pair doesn't have an expiry. $value_type guaranteed != to FD, FC, FE and FF
$string-encoded-key
$encoded-value
----------------------------
FE $length-encoding         # Previous db ends, next db starts. Database number read using length encoding.
----------------------------
...                         # Key value pairs for this database, additonal databaseFF                          ## End of RDB file indicator
8 byte checksum             ## CRC 64 checksum of the entire file.

看了这个图之后，大致知道了rdb格式的过程，

首先，写入redis，然后接下来四个字节就是rdb的版本号。
如果读到的是FE，则是数据库编号。
解析数据库中每一个的key-value，每一对的key-value的形式可能有三种形式，第一，没有过期时间的就时间是value-type，然后再就是编码的key，接着就是编码的value，第二，有过期时间为秒的，过期时间为秒的则是头四位是时间，接下来是value-type，然后是key，最后是value，第三，有过期时间为毫秒的，过期时间为头八位是时间，接下来是value-type，然后是key，最后是value。
如果还有其他数据库则继续重复从第二步开始。
最后读到的是FF，这标致这rdb文件结束，最后八位就是一个checksum的标识符。

看了文档之后，大致给的说明是这样，那我们深入查看一下redis是如何写rdb文件的。

redis写rdb文件的过程

首先查看redis源码中的rdb.c文件

int rdbSaveRio(rio *rdb, int *error, int flags, rdbSaveInfo *rsi) {dictIterator *di = NULL;dictEntry *de;char magic[10];int j;uint64_t cksum;size_t processed = 0;if (server.rdb_checksum)                                                // 检查是否配置了rdb_checksum 这个功能在redis5之后才有rdb->update_cksum = rioGenericUpdateChecksum;snprintf(magic,sizeof(magic),"REDIS%04d",RDB_VERSION);                  // 编写魔术 redis和版本号 if (rdbWriteRaw(rdb,magic,9) == -1) goto werr;                          // 写入rdb文件中if (rdbSaveInfoAuxFields(rdb,flags,rsi) == -1) goto werr;               // 添加aux字段值，该添加内容没有再rdb文档中说明for (j = 0; j < server.dbnum; j++) {                                    // 编写每个数据库的内容到rdb文件中redisDb *db = server.db+j;dict *d = db->dict;                                                 // 如果数据库大小为0， 则跳过该数据库if (dictSize(d) == 0) continue;di = dictGetSafeIterator(d);                                        // 获取迭代器/* Write the SELECT DB opcode */if (rdbSaveType(rdb,RDB_OPCODE_SELECTDB) == -1) goto werr;          // 想rdb中写入数据库的标识if (rdbSaveLen(rdb,j) == -1) goto werr;                             // 并写入当前数据库/* Write the RESIZE DB opcode. We trim the size to UINT32_MAX, which* is currently the largest type we are able to represent in RDB sizes.* However this does not limit the actual size of the DB to load since* these sizes are just hints to resize the hash tables. */uint64_t db_size, expires_size;db_size = dictSize(db->dict);expires_size = dictSize(db->expires);if (rdbSaveType(rdb,RDB_OPCODE_RESIZEDB) == -1) goto werr;          // 写入resize db 的标志位if (rdbSaveLen(rdb,db_size) == -1) goto werr;                       // 写入大小if (rdbSaveLen(rdb,expires_size) == -1) goto werr;                  // 写入过期的大小/* Iterate this DB writing every entry */while((de = dictNext(di)) != NULL) {                                // 遍历每一个数据库sds keystr = dictGetKey(de);                                    // 获取key的string robj key, *o = dictGetVal(de);                                  // 获取valuelong long expire;initStaticStringObject(key,keystr);expire = getExpire(db,&key);                                    // 获取过期时间，如果没有则不会写入rdb文件中if (rdbSaveKeyValuePair(rdb,&key,o,expire) == -1) goto werr;    // 写入过期时间/* When this RDB is produced as part of an AOF rewrite, move* accumulated diff from parent to child while rewriting in* order to have a smaller final write. */if (flags & RDB_SAVE_AOF_PREAMBLE &&rdb->processed_bytes > processed+AOF_READ_DIFF_INTERVAL_BYTES)      // 当在重写aof文件的时候，移动不同的到文件中，以达到写的rdb文件最为近似{processed = rdb->processed_bytes;aofReadDiffFromParent();}}dictReleaseIterator(di);                                            // 释放该迭代器di = NULL; /* So that we don't release it again on error. */}/* If we are storing the replication information on disk, persist* the script cache as well: on successful PSYNC after a restart, we need* to be able to process any EVALSHA inside the replication backlog the* master will send us. */if (rsi && dictSize(server.lua_scripts)) {di = dictGetIterator(server.lua_scripts);while((de = dictNext(di)) != NULL) {robj *body = dictGetVal(de);if (rdbSaveAuxField(rdb,"lua",3,body->ptr,sdslen(body->ptr)) == -1)   // 写入lua_scripts 相关的内容goto werr;}dictReleaseIterator(di);di = NULL; /* So that we don't release it again on error. */}/* EOF opcode */if (rdbSaveType(rdb,RDB_OPCODE_EOF) == -1) goto werr;          // 写入EOF到最后一位/* CRC64 checksum. It will be zero if checksum computation is disabled, the* loading code skips the check in this case. */cksum = rdb->cksum;                                            // 获取cksummemrev64ifbe(&cksum);if (rioWrite(rdb,&cksum,8) == 0) goto werr;                    // 写入cksum值，八位 再在导入的时候会检查该值return C_OK;werr:if (error) *error = errno;if (di) dictReleaseIterator(di);return C_ERR;
}

从rdbSaveRio的函数执行流程来看，跟文档描述的基本吻合，我们着重先查看一下rdbSaveLen两个函数；

int rdbSaveLen(rio *rdb, uint64_t len) {                // 保存长度unsigned char buf[2];size_t nwritten;if (len < (1<<6)) {                                     // 查看长度没有超过了64/* Save a 6 bit len */buf[0] = (len&0xFF)|(RDB_6BITLEN<<6);               // 使用6位来保存该长度if (rdbWriteRaw(rdb,buf,1) == -1) return -1;nwritten = 1;} else if (len < (1<<14)) {                             // 查看长度是否大于64小于16384/* Save a 14 bit len */buf[0] = ((len>>8)&0xFF)|(RDB_14BITLEN<<6);         // 使用14位来保存长度信息buf[1] = len&0xFF;if (rdbWriteRaw(rdb,buf,2) == -1) return -1;nwritten = 2;} else if (len <= UINT32_MAX) {                         // 如果长度超过16384 小于32位 使用32位保存长度/* Save a 32 bit len */buf[0] = RDB_32BITLEN;if (rdbWriteRaw(rdb,buf,1) == -1) return -1;uint32_t len32 = htonl(len);if (rdbWriteRaw(rdb,&len32,4) == -1) return -1;nwritten = 1+4;} else {/* Save a 64 bit len */buf[0] = RDB_64BITLEN;                              // 使用64位来保存长度if (rdbWriteRaw(rdb,buf,1) == -1) return -1;len = htonu64(len);if (rdbWriteRaw(rdb,&len,8) == -1) return -1;nwritten = 1+8;}return nwritten;
}

从该函数保存长度来看，通过不同的长度选择不同的位数来保存该长度信息从而优化rdb减少rdb文件的大小，接下来我们着重查看一下rdbSaveRawString函数，该函数主要就是在保存完长度之后，保存接下来的string的内容；

ssize_t rdbSaveRawString(rio *rdb, unsigned char *s, size_t len) {int enclen;ssize_t n, nwritten = 0;/* Try integer encoding */if (len <= 11) {                                                            // 如果长度小于11则写整形unsigned char buf[5];if ((enclen = rdbTryIntegerEncoding((char*)s,len,buf)) > 0) {           // 保存整形编码if (rdbWriteRaw(rdb,buf,enclen) == -1) return -1;                   // 写入对应的数据return enclen;}}/* Try LZF compression - under 20 bytes it's unable to compress even* aaaaaaaaaaaaaaaaaa so skip it */if (server.rdb_compression && len > 20) {                                   // 如果长度大于20 并且配置了可压缩n = rdbSaveLzfStringObject(rdb,s,len);                                  // 使用lzf压缩算法压缩if (n == -1) return -1;if (n > 0) return n;/* Return value of 0 means data can't be compressed, save the old way */}/* Store verbatim */if ((n = rdbSaveLen(rdb,len)) == -1) return -1;                             // 11 到20之间则直接保存nwritten += n;if (len > 0) {if (rdbWriteRaw(rdb,s,len) == -1) return -1;                            // 写入数据 nwritten += len;}return nwritten;
}

从该函数的保存方式来看，保存的格式分成了三种小于11大小则尝试整形方式编码，如果超过20大小则使用lzf方式压缩，在11到20之间则直接保存。

在rdb保存的过程中，保存key-value类型的处理函数是rdbSaveKeyValuePair，

int rdbSaveKeyValuePair(rio *rdb, robj *key, robj *val, long long expiretime) {int savelru = server.maxmemory_policy & MAXMEMORY_FLAG_LRU;             // 是否是lru格式int savelfu = server.maxmemory_policy & MAXMEMORY_FLAG_LFU;             // 是否是lfu格式/* Save the expire time */if (expiretime != -1) {                                                 // 是否有过期时间if (rdbSaveType(rdb,RDB_OPCODE_EXPIRETIME_MS) == -1) return -1;     // 保存过期时间类型if (rdbSaveMillisecondTime(rdb,expiretime) == -1) return -1;        // 保存过期时间}/* Save the LRU info. */if (savelru) {uint64_t idletime = estimateObjectIdleTime(val);idletime /= 1000; /* Using seconds is enough and requires less space.*/if (rdbSaveType(rdb,RDB_OPCODE_IDLE) == -1) return -1;if (rdbSaveLen(rdb,idletime) == -1) return -1;}/* Save the LFU info. */if (savelfu) {uint8_t buf[1];buf[0] = LFUDecrAndReturn(val);/* We can encode this in exactly two bytes: the opcode and an 8* bit counter, since the frequency is logarithmic with a 0-255 range.* Note that we do not store the halving time because to reset it* a single time when loading does not affect the frequency much. */if (rdbSaveType(rdb,RDB_OPCODE_FREQ) == -1) return -1;if (rdbWriteRaw(rdb,buf,1) == -1) return -1;}/* Save type, key, value */if (rdbSaveObjectType(rdb,val) == -1) return -1;                        // 保存val的类型if (rdbSaveStringObject(rdb,key) == -1) return -1;                      // 保存keyif (rdbSaveObject(rdb,val) == -1) return -1;                        // 保存val的内容return 1;
}

其中最主要的就是rdbSaveObject，该函数就是保存了redis的各种的对应的数据结构的数据，大家有兴趣可以自行翻阅一下该函数的流程。

总结

Python相关的rdb解析工具现在用的比较多的是rdbtools，查看了协议格式可以看出，格式的解析确实相对有些繁琐并没有redis协议那么容易去实现，大家可看一下rdbtools有关协议解析的核心代码，位于rdbtools/parser.py中，主要的解析逻辑都位于其中，跟redis写rdb格式的逻辑对接起来就可以大致知道协议的生成与解析。