RGW Bucket Shard优化
1.bucket index背景简介
bucket index是整个RGW里面一个非常关键的数据结构,用于存储bucket的索引数据,默认情况下单个bucket的index全部存储在一个shard文件(shard数量为0,主要以OMAP-keys方式存储在leveldb中),随着单个bucket内的Object数量增加,整个shard文件的体积也在不断增长,当shard文件体积过大就会引发各种问题。
2. 问题及故障
2.1 故障现象描述
- Flapping OSD's when RGW buckets have millions of objects
- ● Possible causes
- ○ The first issue here is when RGW buckets have millions of objects their
- bucket index shard RADOS objects become very large with high
- number OMAP keys stored in leveldb. Then operations like deep-scrub,
- bucket index listing etc takes a lot of time to complete and this triggers
- OSD's to flap. If sharding is not used this issue become worse because
- then only one RADOS index objects will be holding all the OMAP keys.
RGW的index数据以omap形式存储在OSD所在节点的leveldb中,当单个bucket存储的Object数量高达百万数量级的时候,
deep-scrub和bucket list一类的操作将极大的消耗磁盘资源,导致对应OSD出现异常,
如果不对bucket的index进行shard切片操作(shard切片实现了将单个bucket index的LevelDB实例水平切分到多个OSD上),数据量大了以后很容易出事。
- ○ The second issue is when you have good amount of DELETEs it causes
- loads of stale data in OMAP and this triggers leveldb compaction all the
- time which is single threaded and non optimal with this kind of workload
- and causes osd_op_threads to suicide because it is always compacting
- hence OSD’s starts flapping.
RGW在处理大量DELETE请求的时候,会导致底层LevelDB频繁进行数据库compaction(数据压缩,对磁盘性能损耗很大)操作,而且刚好整个compaction在LevelDB中又是单线程处理,很容易到达osdopthreads超时上限而导致OSD自杀。
常见的问题有:
- 对index pool进行scrub或deep-scrub的时候,如果shard对应的Object过大,会极大消耗底层存储设备性能,造成io请求超时。
- 底层deep-scrub的时候耗时过长,会出现request blocked,导致大量http请求超时而出现50x错误,从而影响到整个RGW服务的可用性。
- 当坏盘或者osd故障需要恢复数据的时候,恢复一个大体积的shard文件将耗尽存储节点性能,甚至可能因为OSD响应超时而导致整个集群出现雪崩。
2.2 根因跟踪
当bucket index所在的OSD omap过大的时候,一旦出现异常导致OSD进程崩溃,这个时候就需要进行现场"救火",用最快的速度恢复OSD服务。
先确定对应OSD的OMAP大小,这个过大会导致OSD启动的时候消耗大量时间和资源去加载levelDB数据,导致OSD无法启动(超时自杀)。
特别是这一类OSD启动需要占用非常大的内存消耗,一定要注意预留好内存。(物理内存40G左右,不行用swap顶上)
![](https://upload-images.jianshu.io/upload_images/2099201-1d0af9180a9fc1e9.png)
3. 临时解决方案
3.1 关闭集群scrub, deep-scrub提升集群稳定性
$ ceph osd set noscrub
$ ceph osd set nodeep-scrub
3.2 调高timeout参数,减少OSD自杀的概率
osd_op_thread_timeout = 90 #default is 15
osd_op_thread_suicide_timeout = 2000 #default is 150
If filestore op threads are hitting timeout
filestore_op_thread_timeout = 180 #default is 60
filestore_op_thread_suicide_timeout = 2000 #default is 180
Same can be done for recovery thread also.
osd_recovery_thread_timeout = 120 #default is 30
osd_recovery_thread_suicide_timeout = 2000
3.2 手工压缩OMAP
在可以停OSD的情况下,可以对OSD进行compact操作,推荐在ceph 0.94.6以上版本,低于这个版本有bug。 https://github.com/ceph/ceph/pull/7645/files
- ○ The third temporary step could be taken if OSD's have very large OMAP
- directories you can verify it with command: du -sh /var/lib/ceph/osd/ceph-$id/current/omap, then do manual leveldb compaction for OSD's.
- ■ ceph tell osd.$id compact or
- ■ ceph daemon osd.$id compact or
- ■ Add leveldb_compact_on_mount = true in [osd.$id] or [osd] section
- and restart the OSD.
- ■ This makes sure that it compacts the leveldb and then bring the
- OSD back up/in which really helps.
#开启noout操作
$ ceph osd set noout#停OSD服务
$ systemctl stop ceph-osd@<osd-id>#在ceph.conf中对应的[osd.id]加上下面配置
leveldb_compact_on_mount = true#启动osd服务
$ systemctl start ceph-osd@<osd-id>#使用ceph -s命令观察结果,最好同时使用tailf命令去观察对应的OSD日志.等所有pg处于active+clean之后再继续下面的操作
$ ceph -s
#确认compact完成以后的omap大小:
$ du -sh /var/lib/ceph/osd/ceph-$id/current/omap#删除osd中临时添加的leveldb_compact_on_mount配置#取消noout操作(视情况而定,建议线上还是保留noout):
$ ceph osd unset noout
4. 永久解决方案
4.1 提前规划好bucket shard
index pool一定要上SSD,这个是本文优化的前提,没硬件支撑后面这些操作都是白搭。
合理设置bucket 的shard 数量
shard的数量并不是越多越好,过多的shard会导致部分类似list bucket的操作消耗大量底层存储IO,导致部分请求耗时过长。
shard的数量还要考虑到你OSD的故障隔离域和副本数设置。比如你设置index pool的size为2,并且有2个机柜,共24个OSD节点,理想情况下每个shard的2个副本都应该分布在2个机柜里面,比如当你shard设置为8的时候,总共有8*2=16个shard文件需要存储,那么这16个shard要做到均分到2个机柜。同时如果你shard超过24个,这很明显也是不合适的。控制好单个bucket index shard的平均体积,目前推荐单个shard存储的Object信息条目在10-15W左右,过多则需要对相应的bucket做单独reshard操作(注意这个是高危操作,谨慎使用)。比如你预计单个bucket最多存储100W个Object,那么100W/8=12.5W,设置shard数为8是比较合理的。shard文件中每条omapkey记录大概占用200 byte的容量,那么150000*200/1024/1024 ≈ 28.61 MB,也就是说要控制单个shard文件的体积在28MB以内。
业务层面控制好每个bucket的Object上限,按每个shard文件平均10-15W Object为宜。
4.1.1 配置Bucket Index Sharding
To enable and configure bucket index sharding on all new buckets, use: redhat-bucket_sharding
- the
rgw_override_bucket_index_max_shards
setting for simple configurations, - the
bucket_index_max_shards
setting for federated configurations
Simple configurations:
#1. 修改配置文件设置相应的参数。 Note that maximum number of shards is 7877.
[global]
rgw_override_bucket_index_max_shards = 10
#2. 重启rgw服务,让其生效
systemctl restart ceph-radosgw.target#3. 查看bucket shard数
rados -p default.rgw.buckets.index ls | wc -l
1000
Federated configurations
In federated configurations, each zone can have a different index_pool setting to manage failover. To configure a consistent shard count for zones in one region, set the bucket_index_max_shards setting in the configuration for that region. To do so:
#1. Extract the region configuration to the region.json file:
$ radosgw-admin region get > region.json#2. In the region.json file, set the bucket_index_max_shards setting for each named zone.#3. Reset the region:
$ radosgw-admin region set < region.json#4. Update the region map:
$ radosgw-admin regionmap update --name <name>#5. Replace <name> with the name of the Ceph Object Gateway user, for example:
$ radosgw-admin regionmap update --name client.rgw.ceph-client
上传文件Demo:
#_*_coding:utf-8_*_
#yum install python-boto
import boto
import boto.s3.connection
#pip install filechunkio
from filechunkio import FileChunkIO
import math
import threading
import os
import Queue
class Chunk(object):num = 0offset = 0len = 0def __init__(self,n,o,l):self.num=nself.offset=oself.length=l
class CONNECTION(object):def __init__(self,access_key,secret_key,ip,port,is_secure=False,chrunksize=8<<20): #chunksize最小8M否则上传过程会报错self.conn=boto.connect_s3(aws_access_key_id=access_key,aws_secret_access_key=secret_key,host=ip,port=port,is_secure=is_secure,calling_format=boto.s3.connection.OrdinaryCallingFormat())self.chrunksize=chrunksizeself.port=port#查询def list_all(self):all_buckets=self.conn.get_all_buckets()for bucket in all_buckets:print u'容器名: %s' %(bucket.name)for key in bucket.list():print ' '*5,"%-20s%-20s%-20s%-40s%-20s" %(key.mode,key.owner.id,key.size,key.last_modified.split('.')[0],key.name)def list_single(self,bucket_name):try:single_bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' %bucket_namereturnprint u'容器名: %s' % (single_bucket.name)for key in single_bucket.list():print ' ' * 5, "%-20s%-20s%-20s%-40s%-20s" % (key.mode, key.owner.id, key.size, key.last_modified.split('.')[0], key.name)#普通小文件下载:文件大小<=8Mdef dowload_file(self,filepath,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)try:key.get_contents_to_filename(filepath)except Exception:pass# 普通小文件上传:文件大小<=8Mdef upload_file(self,filepath,key_name,bucket_name):try:bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag = raw_input('Do you want to create the bucket %s: (Y/N)?' % bucket_name).strip()while tag not in ['Y', 'N']:tag = raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag = raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y', 'N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnkey=bucket.new_key(key_name)if not os.path.exists(filepath):print 'File %s does not exist, please make sure you want to upload file path and try again' %(key_name)returntry:f=file(filepath,'rb')data=f.read()key.set_contents_from_string(data)except Exception:passdef delete_file(self,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:bucket.delete_key(key.name)except Exception:passdef delete_bucket(self,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)try:self.conn.delete_bucket(bucket.name)except Exception:pass#队列生成def init_queue(self,filesize,chunksize): #8<<20 :8*2**20chunkcnt=int(math.ceil(filesize*1.0/chunksize))q=Queue.Queue(maxsize=chunkcnt)for i in range(0,chunkcnt):offset=chunksize*ilength=min(chunksize,filesize-offset)c=Chunk(i+1,offset,length)q.put(c)return q#分片上传objectdef upload_trunk(self,filepath,mp,q,id):while not q.empty():chunk=q.get()fp=FileChunkIO(filepath,'r',offset=chunk.offset,bytes=chunk.length)mp.upload_part_from_file(fp,part_num=chunk.num)fp.close()q.task_done()#文件大小获取---->S3分片上传对象生成----->初始队列生成(--------------->文件切,生成切分对象)def upload_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):filesize=os.stat(filepath).st_sizetry:bucket=self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag=raw_input('Do you want to create the bucket %s: (Y/N)?' %bucket_name).strip()while tag not in ['Y','N']:tag=raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list=[i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag=raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y','N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnmp=bucket.initiate_multipart_upload(key_name)q=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.upload_trunk,args=(filepath,mp,q,i))t.setDaemon(True)t.start()q.join()mp.complete_upload()#文件分片下载def download_chrunk(self,filepath,key_name,bucket_name,q,id):while not q.empty():chrunk=q.get()offset=chrunk.offsetlength=chrunk.lengthbucket=self.conn.get_bucket(bucket_name)resp=bucket.connection.make_request('GET',bucket_name,key_name,headers={'Range':"bytes=%d-%d" %(offset,offset+length)})data=resp.read(length)fp=FileChunkIO(filepath,'r+',offset=chrunk.offset,bytes=chrunk.length)fp.write(data)fp.close()q.task_done()def download_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):all_bucket_name_list=[i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' %(bucket_name)returnelse:bucket=self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' %(key_name)returnelse:key=bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)filesize=key.sizeq=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.download_chrunk,args=(filepath,key_name,bucket_name,q,i))t.setDaemon(True)t.start()q.join()def generate_object_download_urls(self,key_name,bucket_name,valid_time=0):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:key.set_canned_acl('public-read')download_url = key.generate_url(valid_time, query_auth=False, force_http=True)if self.port != 80:x1=download_url.split('/')[0:3]x2=download_url.split('/')[3:]s1=u'/'.join(x1)s2=u'/'.join(x2)s3=':%s/' %(str(self.port))download_url=s1+s3+s2print download_urlexcept Exception:pass
if __name__ == '__main__':#约定:#1:filepath指本地文件的路径(上传路径or下载路径),指的是绝对路径#2:bucket_name相当于文件在对象存储中的目录名或者索引名#3:key_name相当于文件在对象存储中对应的文件名或文件索引access_key = "FYT71CYU3UQKVMC8YYVY"secret_key = "rVEASbWAytjVLv1G8Ta8060lY3yrcdPTsEL0rfwr"ip='127.0.0.1'port=7480conn=CONNECTION(access_key,secret_key,ip,port)#查看所有bucket以及其包含的文件#conn.list_all()#简单上传,用于文件大小<=8M#conn.upload_file('/etc/passwd','passwd','test_bucket01')conn.upload_file('/tmp/test.log','test1','test_bucket12')#查看单一bucket下所包含的文件信息conn.list_single('test_bucket12')#简单下载,用于文件大小<=8M# conn.dowload_file('/lhf_test/test01','passwd','test_bucket01')# conn.list_single('test_bucket01')#删除文件# conn.delete_file('passwd','test_bucket01')# conn.list_single('test_bucket01')##删除bucket# conn.delete_bucket('test_bucket01')# conn.list_all()#切片上传(多线程),用于文件大小>8M,8M可修改,但不能小于8M,否则会报错切片太小# conn.upload_file_multipart('/etc/passwd','passwd_multi_upload','test_bucket01')# conn.list_single('test_bucket01')# 切片下载(多线程),用于文件大小>8M,8M可修改,但不能小于8M,否则会报错切片太小# conn.download_file_multipart('/lhf_test/passwd_multi_dowload','passwd_multi_upload','test_bucket01')#生成下载url#conn.generate_object_download_urls('passwd_multi_upload','test_bucket01')#conn.list_all()
4.2 对bucket做reshard操作
To reshard the bucket index pool: redhat-bucket_sharding
#注意下面的操作一定要确保对应的bucket相关的操作都已经全部停止,之后使用下面命令备份bucket的index
$ radosgw-admin bi list --bucket=<bucket_name> > <bucket_name>.list.backup#通过下面的命令恢复数据
$ radosgw-admin bi put --bucket=<bucket_name> < <bucket_name>.list.backup#查看bucket的index id
$ radosgw-admin bucket stats --bucket=bucket-maillist
{"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1", #注意这个id"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#1,1#1","master_ver": "0#0,1#0","mtime": "2017-08-23 13:42:59.007081","max_marker": "0#,1#","usage": {},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1}
}#Reshard对应bucket的index操作如下:
#使用命令将"bucket-maillist"的shard调整为4,注意命令会输出osd和new两个bucket的instance id$ radosgw-admin bucket reshard --bucket="bucket-maillist" --num-shards=4
*** NOTICE: operation will not remove old bucket index objects ***
*** these will need to be removed manually ***
old bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1
new bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1
total entries: 3#之后使用下面的命令删除旧的instance id$ radosgw-admin bi purge --bucket="bucket-maillist" --bucket-id=0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1#查看最终结果
$ radosgw-admin bucket stats --bucket=bucket-maillist
{"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1", #id已经变更"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#2,1#1,2#1,3#2","master_ver": "0#0,1#0,2#0,3#0","mtime": "2017-08-23 14:02:19.961205","max_marker": "0#,1#,2#,3#","usage": {"rgw.main": {"size_kb": 50,"size_kb_actual": 60,"num_objects": 3}},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1}
}
RGW Bucket Shard优化相关推荐
- rgw bucket sync
1. c创建bucket à(元数据是否同步到对端?)->disableà(已同步的元数据是否还在)->写数据(数据是否同步到对端)àenable(刚才写入的数据是否同步到对端了)àdis ...
- rgw bucket reshard流程
ceph version ceph version 12.2.10 ceph.conf 配置项 说明 "rgw_dynamic_resharding": "true&qu ...
- 玲珑杯 1009 Spoon Devil's Bucket 矩阵优化DP
[题意]有n个桶,初始装有一些水,然后每一份钟都可以用已经有的关系相互倒,问经过m时间后每个桶里面的水的量分别是多少? [分析]题意读懂之后很容易想明白,其实就是个递推的问题.但是由于M过于大,递推问 ...
- rgw bucket 防盗链
防盗链实现方式 https://yq.aliyun.com/articles/57931 签名URL可以和Referer白名单功能一起使用,可以增加防盗链的效果. 设置Referer 原理:https ...
- ceph rgw java_ceph rgw multisite基本用法
Realm: Zonegroup: 理解为数据中心,由一个或多个Zone组成,每个Realm有且仅有 一个Master Zonegroup,用于处理系统变更,其他的称为Slave Zonegroup, ...
- radosgw bucket index sharding
每个key在其对应的dir/bucket下都会占有200B左右的空间.当dir/bucket下面的key数量 很多时,这将使得dir对象很大.不仅包含该dir对象的osd会使用很多内存,而且当dir ...
- Ceph优化系列(二):Ceph主要配置参数详解
转载:Ceph配置参数详解 概述 Ceph的配置参数很多,从网上也能搜索到一大批的调优参数,但这些参数为什么这么设置?设置为这样是否合理?解释的并不多 本文从当前我们的ceph.conf文件入手,解释 ...
- bucket list 函数解析
cls_bucket_list 函数 librados::IoCtx index_ctx; // key - oid (for different shards if there is any) ...
- maxcompute操作_MaxCompute表设计最佳实践
MaxCompute表设计最佳实践 产生大量小文件的操作 MaxCompute表的小文件会影响存储和计算性能,因此我们先介绍下什么样的操作会产生大量小文件,从 而在做表设计的时候考虑避开此类操作. 使 ...
最新文章
- 如何制作风格迁移图?
- 模型可解释性技术概览
- 海康、大华摄像头网页无插件直播监控流媒体服务EasyNVR使用web和安卓浏览器能正常使用,苹果手机无法打开问题解决
- python工作-python工作中总结
- Java中的==和equals区别
- 面试官:你对Kafka比较熟? 那说说kafka日志段如何读写的吧?
- 【STM32】GPIO概述
- 陈曦:超低延迟下的实时合唱体验升级
- SAP Analytics Cloud学习笔记(一):从CSV文件导入数据到Analytics Cloud里创建模型和Story
- 367.有效的完全平方数
- for linux pdf转mobi_下载安装Calibre在Linux CLI中将ePub文件转换为PDF格式
- Python模块selenium实例:电影分类排名数据生成json,sqlite和excel(openpyxl)文件(二)
- Prioritizing Web Usability
- html请假条源码,请假条(事假) 的例子
- 地铁三号线 - 为什么哭的时候总是叫我带娃?
- 记录各大牛个人主页,紧密跟踪其学术动态
- uni-app - 监听用户滚动屏幕开始与结束(解决方案)
- iOS开发初级错误 Expression is not assignable
- java 利用网页显示摄像头_web网页调用摄像头拍照
- 叮咚!您有一份2017杭州云栖大会参会指南待签收【持续更新中】