早上9点多业务通知数据库无法登陆,赶紧查看数据库的状态。(库为12.1.0.2.180417,4节点RAC,pdb mode)

cdb和其中一个pdb1的等待事件:

         1 library cache lock                       WAITING                    6743 library cache lock                       WAITING                    6702 library cache lock                       WAITING                    6564 library cache lock                       WAITING                    6521 log file switch (checkpoint incomplete)  WAITING                     701 buffer busy waits                        WAITING                     374 gc buffer busy acquire                   WAITING                     264 gc current request                       WAITING                     212 JS kgl get object wait                   WAITING                     122 gc buffer busy acquire                   WAITING                     113 gc current request                       WAITING                      93 JS kgl get object wait                   WAITING                      93 gc buffer busy acquire                   WAITING                      81 enq: SQ - contention                     WAITING                      71 gc buffer busy release                   WAITING                      51 JS kgl get object wait                   WAITING                      32 gc current request                       WAITING                      12 log switch/archive                       WAITING                      11 latch: redo writing                      WAITED SHORT TIME            14 enq: TX - row lock contention            WAITING                      14 buffer busy waits                        WAITING                      12 gc current grant 2-way                   WAITED SHORT TIME            12 SQL*Net message to client                WAITED SHORT TIME            11 write complete waits                     WAITING                      12 enq: TM - contention                     WAITING                      14 gc current grant 2-way                   WAITED SHORT TIME            1

pdb2的等待事件:

         1 log file switch (checkpoint incomplete)  WAITING                     424 gc current request                       WAITING                     164 gc buffer busy acquire                   WAITING                     131 buffer busy waits                        WAITING                     103 gc current request                       WAITING                      32 gc buffer busy acquire                   WAITING                      11 latch: redo writing                      WAITED SHORT TIME            14 enq: TX - row lock contention            WAITING                      14 buffer busy waits                        WAITING                      13 enq: TM - contention                     WAITING                      12 enq: TM - contention                     WAITING                      11 control file sequential read             WAITING                      1

cdb和pdb1可以看到很高的library cache lock,而pdb2看不到library cache lock,只能看到checkpoint incomplete。

在来看下cdb的final blocking session,追溯其源头:

   INST_ID EVENT                                    FINAL_BLOCKING_INSTANCE FINAL_BLOCKING_SESSION   COUNT(*)
---------- ---------------------------------------- ----------------------- ---------------------- ----------2 enq: TM - contention                                           1                   2129          14 enq: TX - row lock contention                                  1                   2129          14 enq: TX - row lock contention                                  1                   1901          11 write complete waits                                           1                   1901          11 control file sequential read                                                                     12 gc current grant 2-way                                                                           11 buffer busy waits                                              1                   1977          14 buffer busy waits                                              1                   2129          11 enq: WF - contention                                           1                   2129          12 log switch/archive                                                                               14 enq: WF - contention                                           1                   2129          14 gc current grant 2-way                                                                           11 write complete waits                                           1                   1977          11 latch: redo writing                                                                              12 gc current request                                             1                   2129          23 gc buffer busy acquire                                         1                   2129          91 gc buffer busy release                                         1                   2129         103 gc current request                                             1                   2129         102 gc buffer busy acquire                                         1                   2129         124 gc current request                                             1                   2129         214 gc buffer busy acquire                                         1                   2129         264 library cache lock                                                                              303 library cache lock                                                                              312 library cache lock                                                                              311 buffer busy waits                                              1                   2129         381 enq: SQ - contention                                           1                   2129         471 library cache lock                                                                              531 log file switch (checkpoint incomplete)                        1                   2129         702 library cache lock                                             1                   2129        6624 library cache lock                                             1                   2129        6683 library cache lock                                             1                   2129        6781 library cache lock                                             1                   2129        709

4个节点的library cache lock等待事件的final blocking session都是1节点的2129会话。

查询1节点的2129会话在干什么:

SQL> select INST_ID,USERNAME,MACHINE,EVENT,BLOCKING_SESSION,program from gv$session where  sid=2129;INST_ID USERNAME             MACHINE    EVENT                                    BLOCKING_SESSION PROGRAM
---------- -------------------- ---------- ---------------------------------------- ---------------- ------------------------------1                      ybnode01   control file sequential read                              oracle@ybnode01 (LGWR)4                      ybnode04   rdbms ipc message                                         oracle@ybnode04 (LGWR)3                      ybnode03   rdbms ipc message                                         oracle@ybnode03 (LGWR)2                      ybnode02   rdbms ipc message                                         oracle@ybnode02 (LGWR)

1节点的2129会话是lgwr。

联想到checkpoint incomplete有70多个(1节点),需要分析redo的切换频率和lgwr的trace信息。(12c共用的redo)

告警日志:

Mon Jul 16 09:20:47 2018
Thread 1 advanced to log sequence 13188 (LGWR switch)Current log# 3 seq# 13188 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_3.263.954587031Current log# 3 seq# 13188 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_3.264.954587033
Mon Jul 16 09:20:48 2018
Archived Log entry 29643 added for thread 1 sequence 13187 ID 0x2ac86e19 dest 1:
Mon Jul 16 09:21:05 2018
Thread 1 cannot allocate new log, sequence 13189
Checkpoint not completeCurrent log# 3 seq# 13188 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_3.263.954587031Current log# 3 seq# 13188 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_3.264.954587033
Mon Jul 16 09:21:08 2018
Thread 1 advanced to log sequence 13189 (LGWR switch)Current log# 4 seq# 13189 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_4.265.954587033Current log# 4 seq# 13189 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_4.266.954587035
Mon Jul 16 09:21:10 2018
Archived Log entry 29647 added for thread 1 sequence 13188 ID 0x2ac86e19 dest 1:
Mon Jul 16 09:21:56 2018
Thread 1 cannot allocate new log, sequence 13190
Checkpoint not completeCurrent log# 4 seq# 13189 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_4.265.954587033Current log# 4 seq# 13189 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_4.266.954587035
Mon Jul 16 09:21:59 2018
Thread 1 advanced to log sequence 13190 (LGWR switch)Current log# 5 seq# 13190 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_5.267.954587035Current log# 5 seq# 13190 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_5.268.954587037
Mon Jul 16 09:22:00 2018
Archived Log entry 29648 added for thread 1 sequence 13189 ID 0x2ac86e19 dest 1:
Mon Jul 16 09:22:50 2018
Thread 1 cannot allocate new log, sequence 13191
Checkpoint not completeCurrent log# 5 seq# 13190 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_5.267.954587035Current log# 5 seq# 13190 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_5.268.954587037
Mon Jul 16 09:22:53 2018
Thread 1 advanced to log sequence 13191 (LGWR switch)Current log# 1 seq# 13191 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_1.259.954587027Current log# 1 seq# 13191 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_1.260.954587027
Mon Jul 16 09:22:54 2018
Archived Log entry 29649 added for thread 1 sequence 13190 ID 0x2ac86e19 dest 1:
Mon Jul 16 09:23:35 2018
Thread 1 cannot allocate new log, sequence 13192
Checkpoint not completeCurrent log# 1 seq# 13191 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_1.259.954587027Current log# 1 seq# 13191 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_1.260.954587027
Mon Jul 16 09:23:38 2018
Thread 1 advanced to log sequence 13192 (LGWR switch)Current log# 2 seq# 13192 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_2.261.954587029Current log# 2 seq# 13192 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_2.262.954587029
Mon Jul 16 09:23:40 2018
Archived Log entry 29653 added for thread 1 sequence 13191 ID 0x2ac86e19 dest 1:
Mon Jul 16 09:24:05 2018
Thread 1 cannot allocate new log, sequence 13193
Checkpoint not completeCurrent log# 2 seq# 13192 mem# 0: +DG_DATA/YNODEDB/ONLINELOG/group_2.261.954587029Current log# 2 seq# 13192 mem# 1: +DG_DATA/YNODEDB/ONLINELOG/group_2.262.954587029

在09:20:47到09:24:05,4分钟的时间切换完全部5组日志,至此之后1节点知道实例重启前再也没有切换过日志(1节点在10:34重启,重启后数据库恢复正常)。

以上有2点问题:

1.在09:20:47到09:24:05日志切换过于频繁,伴随大量的Checkpoint not complete,官方建议是15分钟切一次。

2.之后再也不切换日志,可能存在io或者lgwr进程的问题。(用户登陆时数据库要记录用户登陆信息,redo不切换的话,用户登陆也会发生异常,这也对应了开头说的用户登陆不了数据库的问题)

1.针对第一个问题,Checkpoint not complete表示当时的redo没有inactive状态的了,当redo需要进行切换时,没有找到inactive状态的redo,就会造成Checkpoint incomplete等待。这个问题一般是redo配置不当或者io性能问题导致。我们这里先需要增加日志组的大小或者日志组的个数。

当前库的配置是5个日志组,每组1G,每组2个member。

调整之后变为7个日志组,每组4G,1个member。

参考命令:

ALTER DATABASE ADD LOGFILE  GROUP 7 '/data/oradata/bomc6db/redo07.log' SIZE 4g;
alter database drop logfile group 13 ;
Alter system switch logfile;
Alter system checkpoint;

状态切换为inactive的redo才可以删除。可以参考我之前的文章https://blog.csdn.net/qq_40687433/article/details/79491954

查询日志切换频率:

SET LINE 150 PAGES 9999
COL "00" FOR A3
COL "01" FOR A3
COL "02" FOR A3
COL "03" FOR A3
COL "04" FOR A3
COL "05" FOR A3
COL "06" FOR A3
COL "07" FOR A3
COL "08" FOR A3
COL "09" FOR A3
COL "10" FOR A3
COL "11" FOR A3
COL "12" FOR A3
COL "13" FOR A3
COL "14" FOR A3
COL "15" FOR A3
COL "16" FOR A3
COL "17" FOR A3
COL "18" FOR A3
COL "19" FOR A3
COL "20" FOR A3
COL "21" FOR A3
COL "22" FOR A3
COL "23" FOR A3
COL DAY FOR A15
SELECT DAY,TO_CHAR(SUM(DECODE(H, '00', T, 0))) AS "00",TO_CHAR(SUM(DECODE(H, '01', T, 0))) AS "01",TO_CHAR(SUM(DECODE(H, '02', T, 0))) AS "02",TO_CHAR(SUM(DECODE(H, '03', T, 0))) AS "03",TO_CHAR(SUM(DECODE(H, '04', T, 0))) AS "04",TO_CHAR(SUM(DECODE(H, '05', T, 0))) AS "05",TO_CHAR(SUM(DECODE(H, '06', T, 0))) AS "06",TO_CHAR(SUM(DECODE(H, '07', T, 0))) AS "07",TO_CHAR(SUM(DECODE(H, '08', T, 0))) AS "08",TO_CHAR(SUM(DECODE(H, '09', T, 0))) AS "09",TO_CHAR(SUM(DECODE(H, '10', T, 0))) AS "10",TO_CHAR(SUM(DECODE(H, '11', T, 0))) AS "11",TO_CHAR(SUM(DECODE(H, '12', T, 0))) AS "12",TO_CHAR(SUM(DECODE(H, '13', T, 0))) AS "13",TO_CHAR(SUM(DECODE(H, '14', T, 0))) AS "14",TO_CHAR(SUM(DECODE(H, '15', T, 0))) AS "15",TO_CHAR(SUM(DECODE(H, '16', T, 0))) AS "16",TO_CHAR(SUM(DECODE(H, '17', T, 0))) AS "17",TO_CHAR(SUM(DECODE(H, '18', T, 0))) AS "18",TO_CHAR(SUM(DECODE(H, '19', T, 0))) AS "19",TO_CHAR(SUM(DECODE(H, '20', T, 0))) AS "20",TO_CHAR(SUM(DECODE(H, '21', T, 0))) AS "21",TO_CHAR(SUM(DECODE(H, '22', T, 0))) AS "22",TO_CHAR(SUM(DECODE(H, '23', T, 0))) AS "23"FROM (SELECT TO_CHAR(FIRST_TIME, 'YYYY-MM-DD') DAY,TO_CHAR(FIRST_TIME, 'HH24') H,COUNT(1) TFROM V$LOG_HISTORYWHERE FIRST_TIME > SYSDATE - 8GROUP BY TO_CHAR(FIRST_TIME, 'YYYY-MM-DD'),TO_CHAR(FIRST_TIME, 'HH24')ORDER BY 1)GROUP BY DAY
ORDER BY DAY DESC;
DAY             00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23
--------------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
2018-07-24      10  13  12  4   12  4   4   4   4   12  0   0   0   0   0   0   0   0   0   0   0   0   0   0
2018-07-23      10  10  5   4   12  4   4   4   4   31  17  30  12  5   10  21  14  11  5   4   4   4   4   4
2018-07-22      10  4   4   4   12  4   4   4   4   4   4   5   5   4   4   4   4   4   4   4   4   4   4   4
2018-07-21      10  6   7   4   4   4   12  4   4   4   4   8   8   4   4   4   4   4   4   4   4   4   4   4
2018-07-20      10  17  10  4   12  4   4   4   6   23  17  25  11  4   4   11  10  9   4   9   4   4   4   4
2018-07-19      37  49  37  16  12  4   4   4   30  59  76  112 39  17  44  83  61  28  18  5   4   4   4   4
2018-07-18      40  73  28  4   12  4   4   4   15  65  43  148 28  19  16  78  41  40  52  26  5   6   7   10
2018-07-17      36  54  41  4   12  4   4   4   26  69  108 116 82  22  44  62  77  42  30  10  5   7   4   10
2018-07-16      0   0   0   0   0   0   0   0   0   0   25  96  87  21  28  62  57  103 21  14  13  13  6   10

在20号调整之后,redo切换频率有大幅缓解,日志组中可以看到有inactive状态的日志组,目前redo配置满足业务需求。

2.针对第二个问题,需要检查lgwr的trace和当时的io状况。

奇怪的事情就来了,1节点redo不切换导致ash信息无法写入可以理解,但是lgwr的trace信息都没有,diag和其他trace信息都在,就是没有lgwr的trace,按理来说就算redo不切换也不影响写trace。

然后更悲剧的是,没有部署主机层的监控,没有有效的主机层面的信息,message中也没有可用信息。

分析到这里简直难受,要抓到源头了,但是有没有源头的资料。

linux和aix的区别也出来了,如果是aix就肯定可以看到io是否异常,如果没有部署监控,linux的记录就非常有限。

分析问题的根源还需要更多的资料,目前只有部署nbu或者osw,监控主机层。

闲话不多说了,做个总结。

问题表象是数据库无法登陆,library cache lock等待事件非常高,问题的源头是1节点lgwr没有写日志,redo不切换。由重启1节点解决紧急故障。为防止问题再次出现,目前先调整redo的配置,降低redo切换频率,部署osw,监控主机。

(osw我会后面再讲述怎么部署,还是非常简单的)

lgwr与library cache lock--故障分析相关推荐

  1. 实战课堂:数据库高Library Cache Lock导致Hang的故障分析

    编辑手记:在现实的生产环境中,DBA可能遭遇到各种各样的异常,或简单.或复杂,但是无一不考验DBA的经验和能力,在『实战课堂』栏目中,我们将整理和分享来自云和恩墨一线的各种案例,以其帮助走在DBA道路 ...

  2. oracle library cache lock,【案例】Oracle等待事件library cache lock产生原因和解决办法...

    [案例]Oracle等待事件library cache lock产生原因和解决办法 时间:2016-12-07 18:56   来源:Oracle研究中心   作者:网络   点击: 次 天萃荷净 O ...

  3. oracle library cache lock,【DB】彻底搞清楚library cache lock的成因和解决方法(一)

    问题描述: 接到应用人员的报告,说是在任何对表CSNOZ629926699966的操作都会hang,包括desc CSNOZ629926699966,例如: > sqlplus SQL*Plus ...

  4. oracle package lock,Oracle 11g下重现library cache lock等待事件

    从下面的例子中可以看到,在生产数据库中对象的重新编译会导致library cache lock,所以应该尽量避免在业务高峰期编译对象.如果是package或过程中存在复杂的依赖关系就极易导致libra ...

  5. 记一次library cache lock/library cache pin导致的函数编译hang住分析及处理过程

    墨墨导读:业务在进行alter function my_function_name compile时,有两个函数编译无法通过,现象就是会hang住,这里分享处理的整个过程. 一.前言 业务在进行alt ...

  6. 五月数据库技术通讯丨Oracle 12c因新特性引发异常Library Cache Lock等待

    每月关注:35页数据库技术干货,汇总一个月数据库行业热点事件.新的产品特性,包括重要数据库产品发布.警报.更新.新版本.补丁等. 亲爱的读者朋友: 为了及时共享行业案例,通知共性问题,达成共享和提前预 ...

  7. 经验:Library Cache Lock之异常分析-云和恩墨技术通讯精选

    亲爱的读者朋友: 为了及时共享行业案例,通知共性问题,达成共享和提前预防,我们整理和编辑了<云和恩墨技术通讯>,通过对过去一段时间的知识回顾,故障归纳,以期提供有价值的信息供大家参考.同时 ...

  8. 并行insert出现library cache lock与cursor: pin S wait on X等待问题记录

    一. 故障现象与紧急处理 开发反馈凌晨5点左右应用出现大量报错 ORA-04021: timeout occurred while waiting to lock object,并且集中出现在inse ...

  9. 密码错误频繁登录引发的“library cache lock”或“row cache lock”等待

    密码错误频繁登录引发的"library cache lock"或"row cache lock"等待 对于正常的系统,由于密码的更改,可能存在某些被遗漏的客户端 ...

最新文章

  1. 京东电商搜索中的语义检索与商品排序
  2. php一个英文几个字符,PHP指定截取字符串中的中英文或数字字符的实例分享
  3. etcd分布式之负载均衡
  4. css笔记 - transform学习笔记(二)
  5. 如何利用OpenCV自带的级联分类器训练程序训练分类器
  6. rust-let 不可变绑定与可变绑定(4)
  7. 建议收藏!最新的(2019年)电子/计算机领域SCI期刊影响因子大全
  8. php 内置mail 包,配置php自带的mail功能
  9. mysql文件说明_MySQL进阶之配置文件说明
  10. mysql 索引 原理_MySQL索引实现原理分析
  11. [JavaWeb-Servlet]Servlet_执行原理
  12. Apollo进阶课程 ② | 开源模块讲解(上)
  13. Tesla P100
  14. ftp,http YUM库
  15. 万字长文,知识图谱构建技术综述
  16. 魔兽争霸3冰封王座十大经典战役全集
  17. Warning: Procedure created with compilation errors.
  18. 熬夜肝出囊括Java后端95%的面试题解析
  19. 深度学习第四次培训(SVM算法)
  20. android 播放资源mp4,android肿么实现播放资源文件中的MP4文件??

热门文章

  1. DELL Latitude 3400进入BIOS
  2. 电脑如何共享mysql数据库_如何共享mysql数据库?
  3. windows之可以通过局域网备份手机照片的方法
  4. 开源框架JNA的使用
  5. 大数据专家Bernard Marr:大数据是如何对抗癌症的?
  6. 什么是政府采购?政府采购需要具备哪些资格条件?
  7. LINUX学习基础篇(二十六)swap分区
  8. java-eclipse项目出现错误的解决方案
  9. html中给地址栏添加icon图标
  10. 通过WMIC指令获取CPU、主板及BIOS序列号