文章目录

  • 故障描述
  • 排查思路
    • 1.尝试重启Pod
    • 2.查看pod events事件
    • 3.查看kubelet日志
    • 4.检查pvc与pv资源对象
    • 5.检查磁盘挂载
  • 解决方案

故障描述

内部环境收到Pod异常告警

[Alerting] Pod 状态告警
集群中存在 Pod 处于异常状态超过  1 分钟1. ti-inf/etcd-1 (Pending): 1.000
详请链接, http://xx.xx.xx.xx/grafana/d/default/alert-dashboard?tab=alert&viewPanel=19&orgId=1

查看k8s集群中异常Pod,发现为数据组件pod

排查思路

1.尝试重启Pod

~]# kubectl delete pod etcd-1 -nti-inf
发现还是处于异常状态。

2.查看pod events事件

~]# kubectl describe pod redis-server-2 -nti-inf
Events:Type     Reason       Age                     From     Message----     ------       ----                    ----     -------Normal   Scheduled    28m                     volcano  Successfully assigned ti-inf/redis-server-2 to x.x.x.xWarning  FailedMount  3m17s (x3599 over 28m)  kubelet  MountVolume.SetUp failed for volume "pvc-9d1c0e76-6d56-439d-8070-741d8846d569" : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error
从events事件中可以看到,kubelet程序在MountVolume这一步骤Failed,暴露出来的信息为“pvc input/output error”

3.查看kubelet日志

[root@VM-2-29-centos prometheus-db]# grep -i error /var/log/messages| tail -n 5
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.819828  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.319804053 +0800 CST m=+11760883.388055363 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"etcd-1\" (UID: \"1c99773c-3845-4141-ac30-1c3d26f1f30a\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.901519  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName:4c5d9bdf-498a-4456-9c6c-e6f7b456e693 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.401482582 +0800 CST m=+11760883.469733942 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\" (UID: \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/4c5d9bdf-498a-4456-9c6c-e6f7b456e693/volumes/kubernetes.io~csi/pvc-668750fa-cc0a-4105-96f3-7fa184db4ada/mount: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.018249  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.518217097 +0800 CST m=+11760883.586468437 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-9d1c0e76-6d56-439d-8070-741d8846d569\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"redis-server-2\" (UID: \"5550e257-2245-4401-bd9a-cf275ff94675\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.102735  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName:daea4ba4-b97c-46c6-866b-aa7cc29af0a8 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.602692068 +0800 CST m=+11760883.670943428 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\" (UID: \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/daea4ba4-b97c-46c6-866b-aa7cc29af0a8/volumes/kubernetes.io~csi/pvc-9d1c0e76-6d56-439d-8070-741d8846d569/mount: input/output error"经过日志分析可以看到是磁盘出现了部分阻塞,出现以上大量报错信息。

4.检查pvc与pv资源对象

[root@VM-2-29-centos ~]# kubectl get pvc -nti-inf |grep redis
data-redis-server-0                  Bound    pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-1                  Bound    pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-2                  Bound    pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            csi-localpv-tidb   136d
[root@VM-2-29-centos ~]#
[root@VM-2-29-centos ~]# kubectl get pv |grep redis
pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-0                                                    csi-localpv-tidb            136d
pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-1                                                    csi-localpv-tidb            136d
pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-2                                                    csi-localpv-tidb            136dpvc与pv资源均正常。

5.检查磁盘挂载

dmesg(display message) [or display driver],即看内核相关信息

[二 6月 28 20:22:47 2022] buffer_io_error: 6 callbacks suppressed
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971392, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971393, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971394, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971395, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971396, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971397, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971398, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971399, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971392, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971393, async page read

因pvc对应磁盘为/dev/vdc,而且系统做了lvm逻辑卷,显然是逻辑卷故障了

通过系统终端查询此目录,已经无法正常访问
~]# ls /data/ti-database
ls: 无法访问/data/ti-database: 输入/输出错误说明:缓冲区 I/O 错误,逻辑块20971393 异步页面读取失败

解决方案

因平台数据组件(etcd/redis/es)均为3个副本,可容忍单点故障,并且此逻辑卷在起初规划设计时只给数据组件使用,所以对其他服务没有影响,只需要重新制作lvm逻辑卷即可。

详细操作流程:
1、mysql/etcd/es 数据备份
2、卸载逻辑卷挂载
3、使用lvremove删除逻辑卷LV
4、使用vgremove删除卷组VG
5、使用pvremove删除物理卷设备
在上述操作执行完毕之后,再执行 lvdisplay、vgdisplay、pvdisplay 命令来查看 LVM 的信息时就不会再看到信息了
6、删除此节点pv与pvc
7、重新制作lvm逻辑卷并进行挂载
8、创建pv、pvc资源对象,与Pod进行关联绑定
9、验证Pod状态
10、检查redis与etcd组件集群健康状态,及数据一致性校验

参考资料:
https://github.com/longhorn/longhorn/issues/1210
https://developer.aliyun.com/article/521158

Error: “MountVolume.SetUp failed for volume pvc 故障处理相关推荐

  1. pod一直处于ContainerCreating,查看报错信息为挂载错误MountVolume.SetUp failed for volume

    背景,在搭建redis集群时,使用的是nfs挂载卷,中途我好像把挂载盘的文件移走了,当我再次启动pod时就出现挂载错误. [root@master redis-cluster-sts]# kubect ...

  2. 解决argo workflow报错:MountVolume.SetUp failed for volume “docker-sock“ : hostPath type check failed

    提交workflow时报错: MountVolume.SetUp failed for volume "docker-sock" : hostPath type check fai ...

  3. MountVolume.MountDevice failed for volume “pvc“ ...问题解决

    一.问题描述 Warning FailedMount 44s (x2 over 108s) kubelet MountVolume.MountDevice failed for volume &quo ...

  4. MountVolume.NewMounter initialization failed for volume “pvc-61dedc85-ea5a-4ac7-aaf3-e072e2e46e18“

    报错 本地测试环境k8s重启后,stateful set报错了 # 报错信息 MountVolume.NewMounter initialization failed for volume " ...

  5. repo sync error.GitError: manifests rev-list : fatal: revision walk setup failed

    更新代码是repo sync 出错:error.GitError: manifests rev-list ('^HEAD', u'a78728c68089372c3ce03a76f10143d7a5d ...

  6. pip install nmslib 失败 (error: command ‘x86_64-linux-gnu-gcc‘ failed with exit status 1)

    1. 问题现象 使用 pip 安装 nmslib 命令时出现如下错误: sudo pip install nmslib ....ERROR: Complete output from command ...

  7. python mysql gcc_MySQL-python “error: command 'gcc' failed with exit status 1”错误

    安装MySQL-python-1.2.3c1出现"error: command 'gcc' failed with exit status 1"错误 具体报错信息如下: _mysq ...

  8. 安装MySQL-python报错 error: command 'gcc' failed with exit status 1解决方法

    错误如: _mysql.c:2331: error: '_mysql_ConnectionObject' has no member named 'open' _mysql.c:2338: error ...

  9. pycuda installation error: command 'gcc' failed with exit status 1

    原文:python采坑之路 Setup script exited with error: command 'gcc' failed with exit status 1 伴随出现"cuda ...

最新文章

  1. Flutter开发之iOS真机调试(六)
  2. Android 面试题(转)
  3. 【Netty】Netty 核心组件 ( ServerBootstrap | Bootstrap )
  4. python中for循环语句格式_Python基础-10循环语句
  5. OpenCV通过填充修复损坏的图像的实例(附完整代码)
  6. ubuntu MySQL安装指南
  7. 亚马逊招聘,无人超市研发部门
  8. 马云:格局不够大,人生成就再高也有限!
  9. windows10系统右键新建菜单的自定义
  10. Confluence 6 示例 - https://confluence.atlassian.com/
  11. 老牌语言依然强势,GO、Kotlin 等新语言为何不能破局?
  12. Integrated Security = True和Integrated Security = SSPI有什么区别?
  13. 【服务器托管单线、双线以及多线如何区别】
  14. ASP.NET DATETIME
  15. 第一序列任小粟的能力_《第一序列》陈无敌刚烈正义,自封大圣,可任小粟做不得慈悲唐僧...
  16. github支持php_github怎么使用
  17. 计算机网络管理员绩效考核,网络工程师专业考核方案
  18. 深入理解git内部原理
  19. 俄罗斯最大银行宣布加入区块链联盟…
  20. Linux统计文件行数的几种方法

热门文章

  1. 2019/7/14(下午)学习内容【ESPCN、VDSR、DRCN、RED、DRRN、LapSRN】
  2. 硕士论文校外导师意见计算机专业,专业硕士校外实践导师评语
  3. 数据成员是reference或const时该如何赋值?
  4. jQuery手风琴特效(含完整源码)
  5. 疫情期间网络攻击花样翻新,全年 81748 起安全事件背后暗藏规律
  6. Modelsim仿真使用教程
  7. pycharm 主题设置
  8. 匹配追踪MP和正交匹配追踪OMP算法
  9. Ambari 功能简介
  10. mysql查询手机号199开头_使用199开头的号码是怎样的体验?