K8S节点故障针对数据恢复的问题
K8S节点故障针对数据恢复的问题
如下是我整个集群现有情况及mysql集群的分布情况:
mysql-0作为整个mysql集群写入数据的唯一节点,mysql-1和mysql-2作为mysql对外提供读取功能的节点;
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 8m30s
mysql-1 2/2 Running 0 7d19h
mysql-2 2/2 Running 0 7d19h
现有pod在集群的分布
[root@master1 mysql]# kubectl get po -n ztw -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql-0 2/2 Running 0 29m 192.168.180.61 master2 <none> <none>
mysql-1 2/2 Running 0 7d19h 192.168.166.164 node1 <none> <none>
mysql-2 2/2 Running 0 7d19h 192.168.104.63 node2 <none> <none>
node情况
[root@master1 mysql]# kubectl get no
NAME STATUS ROLES AGE VERSION
master1 Ready master 16d v1.18.6
master2 Ready master 16d v1.18.6
master3 Ready master 16d v1.18.6
node1 Ready worker 16d v1.18.6
node2 Ready worker 16d v1.18.6
模拟测试:
1、mysql-0 节点的kubelet挂掉,mysql无法正常写入
K8S集群的 master2节点的kubelet挂掉
[root@master1 mysql]# kubectl get no
NAME STATUS ROLES AGE VERSION
master1 Ready master 17d v1.18.6
master2 NotReady master 17d v1.18.6
master3 Ready master 17d v1.18.6
node1 Ready worker 17d v1.18.6
node2 Ready worker 17d v1.18.6
检查mysql集群(kubelet挂掉后需要5分钟,才能看到mysql集群的真实状况)
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Terminating 0 59m
mysql-1 2/2 Running 0 7d20h
mysql-2 2/2 Running 0 7d20h
本次模拟测试,针对kubelet没法恢复,之后对mysql集群进行恢复的情况:
从前面信息中获取到mysql-0是在master2节点,先检查mysql的pv
[root@master1 mysql]# kubectl get pv |grep ztw/data-mysql
pvc-1b2be124-db4a-4220-9a75-bcd9d7ef26fd 10Gi RWO Delete Bound ztw/data-mysql-2 ceph-rbd 7d20h
pvc-2f3248b3-a163-4f6d-964c-744e46cd899a 10Gi RWO Delete Bound ztw/data-mysql-1 ceph-rbd 7d20h
pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856 10Gi RWO Delete Bound ztw/data-mysql-0 ceph-rbd 7d20h
获取到ztw/data-mysql-0 使用的csi镜像
[root@master1 mysql]# kubectl get pv pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856 -o yaml|grep imageNamef:imageName: {}imageName: csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
这里使用的csi为: csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
检查该镜像的挂在情况
[root@master1 mysql]# kubectl exec -it -n rook-ceph rook-ceph-tools-6b4889fdfd-86dp5 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# rbd showmapped
id pool namespace image snap device
0 replicapool csi-vol-b0547ce2-9061-11eb-9dfa-c2bce0e658d7 - /dev/rbd0
1 replicapool csi-vol-4abead3a-978d-11eb-b4aa-ee49455e3e6d - /dev/rbd1
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# rbd status replicapool/csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
Watchers:watcher=172.16.25.185:0/2783421160 client.860678 cookie=18446462598732840969
这里能发现该RBD的挂载情况及所属服务器
找到pv挂载所在服务器(master2),在master2上通过
[root@master2 ~]# docker ps|grep k8s_csi-rbdplugin_csi
6ee7aa256fbd 3d66848c3c6f "/usr/local/bin/ceph…" 45 minutes ago Up 45 minutes k8s_csi-rbdplugin_csi-rbdplugin-provisioner-b4d4bc45d-z45dd_rook-ceph_a7328719-bfea-42b1-9715-2378be38a514_0
6f8307339942 3d66848c3c6f "/usr/local/bin/ceph…" 45 minutes ago Up 45 minutes k8s_csi-rbdplugin_csi-rbdplugin-mll4k_rook-ceph_8a34b870-a840-47eb-9324-3e062fc0bdb9_0
选择 k8s_csi-rbdplugin_csi-rbdplugin-mll4k_rook-ceph 这个容器进行操作
[root@master2 ~]# docker exec -it 6f8307339942 bash
[root@master2 /]# rbd showmapped|grep csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
3 replicapool csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d - /dev/rbd3
[root@master2 /]# rbd unmap -o force /dev/rbd3
[root@master2 ~]# umount /dev/rbd3
通过以上命令检查该csi使用的是那个 rbd目录,如上使用的是/dev/rbd3,同时使用rbd unmap卸载掉挂载
在通过umount /dev/rbd3 卸载掉挂载的目录
本以为卸载掉了pv,pod会自动飘逸,但是时间证明了他不会
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Terminating 0 174m
mysql-1 2/2 Running 0 7d22h
mysql-2 2/2 Running 0 7d22h
本地动用强制手段进行迁移,通过etcd将之前pod信息清理掉;
[root@master1 mysql]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key \
--endpoints=172.10.25.184:2379,172.10.25.185:2379,172.10.25.186:2379 \
del /registry/pods/ztw/mysql-0
1
之后在进行观察,发现pod重新创建
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 0/2 Init:0/2 0 74s
mysql-1 2/2 Running 0 7d22h
mysql-2 2/2 Running 0 7d22h
在重新创建
[root@master1 mysql]# kubectl describe po -n ztw mysql-0
.....Events:Type Reason Age From Message---- ------ ---- ---- -------Normal Scheduled <unknown> default-scheduler Successfully assigned ztw/mysql-0 to master3Warning FailedAttachVolume 6m54s attachdetach-controller Multi-Attach error for volume "pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856" Volume is already exclusively attached to one node and can't be attached to anotherWarning FailedMount 4m51s kubelet, master3 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[conf config-map default-token-d9qmq data]: timed out waiting for the conditionWarning FailedMount 2m34s kubelet, master3 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-d9qmq data conf config-map]: timed out waiting for the conditionNormal SuccessfulAttachVolume 54s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856"Normal Created 37s kubelet, master3 Created container init-mysqlNormal Pulled 37s kubelet, master3 Container image "hub.youedata.com/rds/mysql:5.7" already present on machineNormal Started 37s kubelet, master3 Started container init-mysqlNormal Pulled 37s kubelet, master3 Container image "hub.youedata.com/base/xtrabackup:1.0" already present on machineNormal Started 36s kubelet, master3 Started container clone-mysqlNormal Created 36s kubelet, master3 Created container clone-mysqlNormal Pulled 35s kubelet, master3 Container image "hub.youedata.com/base/xtrabackup:1.0" already present on machineNormal Created 35s kubelet, master3 Created container xtrabackupNormal Started 35s kubelet, master3 Started container xtrabackupNormal Pulled 15s (x3 over 36s) kubelet, master3 Container image "hub.youedata.com/rds/mysql:5.7" already present on machineNormal Created 15s (x3 over 35s) kubelet, master3 Created container mysqlNormal Started 15s (x3 over 35s) kubelet, master3 Started container mysql
此刻的mysql集群恢复,
[root@master1 ~]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 15m
mysql-1 2/2 Running 0 15m
mysql-2 2/2 Running 0 14m
本章讲解关于kubelet挂掉无法恢复的情况下对statefulset的mysql进行恢复处理;
后续会讲解在docker挂掉的情况对statefulset的服务进行恢复的情况
K8S节点故障针对数据恢复的问题相关推荐
- Kubernetes K8S节点选择(nodeName、nodeSelector、nodeAffinity、podAffinity、Taints以及Tolerations用法)
感谢以下文章的支持: 容器编排系统K8s之Pod Affinity - Linux-1874 - 博客园 容器编排系统K8s之节点污点和pod容忍度 - Linux-1874 - 博客园 Kubern ...
- 数仓集群管理:单节点故障RTO机制分析
摘要:大规模分布式系统中的故障无法避免.发生单点故障时,集群状态和业务是如何恢复的? 本文分享自华为云社区<GaussDB (DWS) 集群管理系列:单节点故障RTO机制分析(集群状态恢复篇)& ...
- K8s入门-K8s节点notReady状态解决
K8s节点notReady状态解决 挂掉原因:我想要通过externalIP来发布一个service,同时要删除旧的pod,删除命令执行后,节点就不可用了. 错误操作复现 创建externalIP类型 ...
- 《童虎学习笔记》14分钟结合ProxySQL处理超半数MGR节点故障
本文章配套视频 https://www.ixigua.com/7086085500540289572?id=7088719800846778910 本专栏全部文章 https://blog.csdn. ...
- Kubernetes:3步排查K8S Deployment故障
目录 Kubernetes中部署一个应用程序 3步排查K8S Deployment故障 1. 故障排查Pod 常见的Pod错误 启动错误包括: 运行时错误包括: ImagePullBackOff Cr ...
- k8s节点状态异常思路
要解决和了解节点状态为何会发生异常需要先了解k8s体系组件的基本知识与原理 在k8s容器集群运行过程,时长遇到节点运行状态异常的问题和因为组件异常.系统内核死锁.资源不足等原因引起节点状态不可知. 该 ...
- calico分配网络使k8s节点指定固定网段
文章目录 calico分配网络使k8s节点指定固定网段 1.配置calicoctl 1.1 下载calicoctl 1.2 配置calicoctl 1.3 测试calicoctl 2.配置ippool ...
- Deepgreen Greenplum 高可用(一) - Segment节点故障转移
尚书中云:惟事事,乃其有备,有备无患.这教导我们做事一定要有准备,做事尚且如此,在企事业单位发展中处于基础地位的数据仓库软件在运行过程中,何尝不需要有备无患呢? 今天别的不表,主要来谈谈企业级数据仓库 ...
- 【服务器数据恢复】戴尔某型号服务器raid故障的数据恢复案例
服务器故障: 戴尔某型号服务器由于raid损坏而崩溃,由于服务器数据涉密,管理员联系我们数据恢复中心上门进行数据恢复. 服务器故障检测: 数据恢复工程师携带相关设备到现场对服务器进行检测,发现导致服务 ...
- k8s节点NotReady问题定位
步骤一:在master节点上执行kubelet get nodes命令,可以看到某节点的状态一直是notready. 步骤二:k8s上可以使用命令kubectl describe nodes 10-X ...
最新文章
- Sysinternals Suite 2012.06.28软件简介
- orm 通用方法——RunProc调用存储过程
- 华为交换机netstream配置
- 【专访】小米产品经理颠覆早教行业,欲送给孩子1000万美金的人生
- 推理集 —— 思维的误区
- 关于主键的设计、primary key
- Educational Codeforces Round 96 (Rated for Div. 2) C. Numbers on Whiteboard///思维
- nginx出现 500 Internal Server Error的解决办法
- w ndows太卡,用Windows 10太卡?教你快速干掉Windows Defender
- 华为旗下首款弹出式前置摄像头新机发布:或归属荣耀旗下...
- oracle导入步骤,Oracle导入dmp文件步骤
- XSS-Game level 9
- mysql 安装参考
- (转)解决RabbitMQ service is already present - only updating service parameters
- ASP.NET——上传文件超过了最大请求长度
- 用python 画炫酷的图并讲解-使用Python的turtle画炫酷图形
- Atitit 粘贴路径参数法 跨进程通讯法 目录 1. .IPC(Inter-Process Communication,跨进程通信)	1 1.1. .IPC的使用场景:	2 2. 传统的进程间通
- ibm336服务器显示brd,ibmx3850x5服务器故障BRD报警|升级主板微码
- Android 登陆界面
- w10计算机,打开win10自带计算器的多种方法