追根溯源-数据库deadlock重启

每周日必出故障系列，似乎是看不惯我周末休息，老美的圣诞节也没镇住老美的数据库。

数据库在圣诞节最后一刻2节点宕机了！！！！当然按照重启法则似乎能解决一切，但是这里并不适用，重启之后故障重现，给了我足够的取证和分析时间。

Oracle database version 19.5 no CDB

下面看一组alert

2021-12-25T23:28:05.264842+08:00
ARC0 (PID:12469): Archived Log entry 452730 added for T-2.S-25671 ID 0xce9c4479 LAD:1
2021-12-25T23:28:08.840050+08:00 <<<<<<<<<<<< start LMD deadlock 23:28:08
Global Enqueue Services Deadlock detected (DID = 12_1_1572). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:28:19.648620+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1573). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:28:50.740136+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1574). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:29:01.455790+08:00
Thread 2 advanced to log sequence 25673 (LGWR switch)
Current log# 8 seq# 25673 mem# 0: +DATAC1/xxxx/ONLINELOG/group_8.884.1090597299
2021-12-25T23:29:02.815247+08:00
ARC1 (PID:12477): Archived Log entry 452732 added for T-2.S-25672 ID 0xce9c4479 LAD:1
2021-12-25T23:29:04.722593+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1575). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.…省略…2021-12-25T23:46:50.130826+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1628). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:46:52.969550+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lms1_8986_8992.trc (incident=2633432): <<<<< shared pool 4031 heap(4,0),无法获取持久内存
ORA-04031: unable to allocate 168 bytes of shared memory (“shared pool”,“unknown object”,“sga heap(4,0)”,“gcs dynamic shadows lms”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:46:53.052028+08:00
ORA-04031 heap dump being written to trace file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
2021-12-25T23:46:54.659917+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1629). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:47:04.677474+08:00
Global Enqueue Services Deadlock detected (DID = 12_1_1630). More information in file
/u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmd0_8982.trc.
2021-12-25T23:47:08.264118+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j006_276744.trc (incident=2634840):
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^11005”,“kglseshtSegs”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2634840/xxxx2_j006_276744_i2634840.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:47:08.980300+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j006_276744.trc:
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^11005”,“kglseshtSegs”)
2021-12-25T23:47:09.696941+08:00
Process J006 died, see its trace file
2021-12-25T23:47:09.697025+08:00
kkjcre1p: unable to spawn jobq slave process
2021-12-25T23:47:09.697096+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_cjq0_12905.trc:
2021-12-25T23:47:11.501606+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_mmon_9046.trc (incident=2633640):
ORA-04031: unable to allocate 40 bytes of shared memory (“shared pool”,“select size_for_estimate, …”,“KGLH0^51522e0b”,“kglHeapInitialize:temp”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633640/xxxx2_mmon_9046_i2633640.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:48:03.057455+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_279833.trc (incident=2635032):
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^9837”,“kglseshtSegs”)
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2635032/xxxx2_j009_279833_i2635032.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-25T23:48:03.470720+08:00
LMS1 (ospid: 8986_8992) has not called a wait for 50 secs.
2021-12-25T23:48:04.135748+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_279833.trc:
ORA-04031: unable to allocate 56 bytes of shared memory (“shared pool”,“unknown object”,“KKSSP^9837”,“kglseshtSegs”)
2021-12-25T23:48:04.481052+08:00
Process J009 died, see its trace file
2021-12-25T23:48:04.481136+08:00
kkjcre1p: unable to spawn jobq slave process
2021-12-25T23:48:04.481196+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_cjq0_12905.trc:
<<<<<< LMS hang
2021-12-25T23:48:06.155760+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_lmhb_9009.trc (incident=2633480):
ORA-29770: global enqueue process LMS1 (OSID 8986_8992) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633480/xxxx2_lmhb_9009_i2633480.trc
2021-12-25T23:48:08.689426+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxx by memno 1
2021-12-25T23:48:09.325166+08:00
Process termination requested for pid 8986_8992 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
2021-12-25T23:48:09.325191+08:00
Process termination requested for pid 8984_8991 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
2021-12-25T23:48:09.325281+08:00
Process termination requested for pid 9020 [source = rdbms], [info = 0] [request issued by pid: 9009, uid: 1001]
LMHB (ospid: 9009): terminating the instance due to ORA error 29770
2021-12-25T23:48:09.350502+08:00
Cause - ‘ERROR: Some process(s) is not making progress.
LMHB (ospid: 9009) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s)’
2021-12-25T23:48:09.393157+08:00
Dumping diagnostic data in directory=[cdmp_20211225234809], requested by (instance=2, osid=9009 (LMHB)), summary=[abnormal instance termination].
2021-12-25T23:48:09.662023+08:00
ORA-1092 : opitsk aborting process
2021-12-25T23:48:09.682359+08:00
ORA-1092 : opitsk aborting process
2021-12-25T23:48:10.708035+08:00
License high water mark = 211
2021-12-25T23:48:11.562372+08:00
Instance terminated by LMHB, pid = 9009
2021-12-25T23:48:11.663362+08:00
Warning: 2 processes are still attacheded to shmid 2135162902:
(size: 2097152 bytes, creator pid: 7506, last attach/detach pid: 280327)
2021-12-25T23:48:12.708585+08:00
USER(prelim) (ospid: 280324): terminating the instance
2021-12-25T23:48:12.710943+08:00
Instance terminated by USER(prelim), pid = 280324
2021-12-25T23:48:15.646596+08:00
Starting ORACLE instance (normal) (OS id: 280415)
2021-12-25T23:48:15.665292+08:00

2021-12-25T23:28:08.840050开始数据库出现Deadlock ，从LMDtrace可以看到是业务相关SQL,从LMDtrace可以分析得出是1节点该SQL阻塞了2节点，在这一瞬间，1节点持有latch，2节点想要获取，产生了Deadlock，继续分析。

  Global Wait-For-Graph(WFG) for GES Deadlock ID=[12_1_1630]
------------------------------------------------------------------------Victim : (instance=2, lock=0x5a206ff10)Start (master) Instance : 2Number of Locks involved : 4Number of Sessions involved : 2User session identified by:
{User Name : fedcUser Machine : localhost.localdomaincnOS Terminal Name : UNKNOWNOS Process ID : 977078OS Program Name : sqlldr@localhost.localdomain (TNS V1-V3)Application Name : SQL Loader Conventional Path Load V1-V3)V3)Action Name : 5/load.ctl1640446941377_182108eanup SlaveCurrent SQL : INSERT INTO xxxxxxx   (ID,UDN,TIME) VALUES (:1,:2,CAST(TO_DATE(:TIME, 'yyyy-mm-dd hh24:mi:ss') AS DATE))Session Number : 2918Session Serial Number : 16119Server Process ORAPID : 252Server Process OSPID : 251878Instance : 2
}
waiting for Lock 0x5a206ff10 (Transaction):
{Lock Level : KJUSERPRResource Name : TX 0x2b8000e.0x7cee(ext 0x0,0x0)GES Transaction ID : FC000-0002-00000B1C
}
which is blocked by Lock 0xaf1b8a50 (Transaction):
{Lock Level : KJUSERPRResource Name : TX 0x2b8000e.0x7cee(ext 0x0,0x0)GES Transaction ID : 21B000-0001-0000001E
}
owned by the
User session identified by:
{User Name : fedcUser Machine : localhost.localdomaincnOS Terminal Name : UNKNOWNOS Process ID : 32302OS Program Name : sqlldr@localhost.localdomain (TNS V1-V3)Application Name : SQL Loader Conventional Path Load V1-V3)Action Name : 5/load.ctl1640446930728_722208Current SQL : INSERT INTO xxxxxxx   (ID,UDN,TIME) VALUES (:1,:2,CAST(TO_DATE(:TIME, 'yyyy-mm-dd hh24:mi:ss') AS DATE))Session Number : 11008Session Serial Number : 1635Server Process ORAPID : 539Server Process OSPID : 88755Instance : 1
}
waiting for Lock 0x2e5942bc8 (Transaction):
{Lock Level : KJUSERPRResource Name : TX 0xe590011.0xfcec(ext 0x0,0x0)GES Transaction ID : 21B000-0001-0000001E
}

2021-12-25T23:46:52.969550数据库出现内存不足报ORA-4031,无法获取heap(4,0)即第四个subpool的0号持久内存。从trace中可以看到lms进程在等待gc message
以下是LMS trace

*** 2021-12-25T23:46:50.433900+08:00
2021-12-25 23:46:50.433 : GSIPC:LOOP: cycle time(us) 544177 active 544155 rl/rq/wq/mq 0/0/0/0
2021-12-25 23:46:50.433 : GSIPC:LOOP: wait/busy(us) 22/0 sndf 0 pmsg 0 sync 0 frez 0 fbufs 0 rcvm 544177 othr 0
===[ LM History ]===KJM HIST LMS1 (most recent first): 7:0u 6:0u 10:0u 17:0u 16:0u 18:0u 12:544177u 7:0u 6:0u 10:0u 17:0u 16:0u 14:32:0u 1:0u 12:1:0u 7:0u 6:0u 10:0u 17:0u 16:0u 12:0u 7:0u 6:0u 10:0u 17:0u 16:0u 13:65521:0u 20:32:0u 20:32:0u 20:32:0u 1:0u 14:32:0u 1:0u 12:2:0u 7:0u 6:0u 10:0u 17:0u 16:0u 12:0u 7:0u 6:0u 10:0u 17:0u 16:0u 14:32:0u 1:0u 12:1:0u 7:0u 6:0u 10:0u 17:0u 16:0u 12:0u 7:0u 6:0u 10:0u 17:0u 16:0u 14:32:0u 1:0u 12:1:0u 7:0u 6:0u
===[ Process Comm State Object ]===GES MSG BUFFERS: st=emp chunk=(nil) hdr=(nil) lnk=(nil) flags=0x1000 inc=12outq=0 sndq=0 rmsq=0 opid=25prmb=(nil) : (id 0 st 0x0 where 0 curq (nil))smp[i]=(1 0x60dd02980) smp[b]=(1 0x60dd02e10) pmp=0x503aba6e0mbg[i]=(1 2482) mbg[b]=(4 0) mbg[r]=(0 0)fmq[i]=(30 72) fmq[b]=(20 33) fmq[r]=(0 0)mop[s]=1068966156 mop[q]=769906694 pendq=0 zmbq=0 tmoq=0 nonksxp_recvs=0SO dump_time_stamp=0SO SendQ=0  start_tm=0  end_tm=0SO FreeQ reg=72  start_tm=3deb6c  end_tm=1edf46SO FreeQ reg : 32:1 43:2 44:66 114:3 SO FreeQ bth=33  start_tm=3debe3  end_tm=3ce553SO FreeQ bth : rsv:(tot 33 ib 0 pb 33)SO FreeQ bth : embed: 32:20 34:1327 36:16 SO FreeQ rsv=0  start_tm=0  end_tm=0
===[ Session Wait History ]===Current Wait Stack:Not in wait; last wait ended 0.000393 sec ago Wait State:fixed_waits=0 flags=0x21 boundary=(nil)/-1Session Wait History:elapsed time of 0.000410 sec since last wait0: waited for 'gcs remote message'waittime=0x1e, poll=0x0, event=0x0wait_id=307140422 seq_num=43764 snap_id=1wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 secwait times: max=infinitewait counts: calls=1 os=1occurred after 0.000005 sec of elapsed time1: waited for 'gcs remote message'waittime=0x1e, poll=0x0, event=0x0wait_id=307140421 seq_num=43763 snap_id=1wait times: snap=0.000007 sec, exc=0.000007 sec, total=0.000007 secwait times: max=infinitewait counts: calls=1 os=1occurred after 0.000001 sec of elapsed time.......省略.........
*** 2021-12-25T23:46:52.970095+08:00
2021-12-25T23:46:52.970082+08:00
Incident 2633432 created, dump file: /u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_2633432/xxxx2_lms1_8986_8992_i2633432.trc
ORA-04031: unable to allocate 168 bytes of shared memory ("shared pool","unknown object","sga heap(4,0)","gcs dynamic shadows lms")TOC00000 - Table of contents
TOC00001 - Error Stack
TOC00002 - Dump for incident 2633432 (ORA 4031)
| TOC00003 - Beginning of Customized Incident Dump(s)
| | TOC00004 - Call Stack Trace
| | TOC00005 - Kernel Stack Trace
| | TOC00006 - Process Map Dump
| | TOC00007 - LIBRARY CACHE DUMP
End of TOC*** 2021-12-25T23:48:06.100278+08:00
Received ORADEBUG command (#11) 'dump KSTDUMPCURPROC 1' from process '9009'
-------------------------------------------------------------------------------
Trace Bucket Dump Begin: default bucket for process 25 (osid: 8986_8992, LMS1)
TIME(*=approx):SEQ:COMPONENT:FILE@LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA
-------------------------------------------------------------------------------
2021-12-25 23:45:54.768 :A649BD31:db_trace:ksl2.c@3492:kslwtectx():14451:2949: [10005:25] KSL WAIT END [latch: gcs resource hash] 25702449152/0x5fbfc4000 253/0xfd 0/0x0 wait_id=306599374 seq_num=26996 snap_id=1   <<<<<<<<<<<<
2021-12-25 23:45:54.768 :A649BD32:db_trace:ksl2.c@3496:kslwtectx():14451:2949: [10005:25] KSL WAIT END wait times (usecs) - snap=316, exc=316, tot=316
.....省略......
2021-12-25 23:46:51.261 :A79A1459:db_trace:ksl2.c@3496:kslwtectx():14451:2949: [10005:25] KSL WAIT END wait times (usecs) - snap=103, exc=103, tot=103
-------------------------------------------------------------------------------
Trace Bucket Dump End: default bucket for process 25 (osid: 8986_8992, LMS1)*** 2021-12-25T23:48:06.101690+08:00
Finished processing ORADEBUG command (#11) 'dump KSTDUMPCURPROC 1'

2021-12-25T23:48:06.155760 LMHD（LOCAL MANAGER HeartBeat）进程监控到LMS进程超时达到70s，且等待事件是sga相关的。之后数据库宕机。
以下是LMDB trace

*** 2021-12-25T23:48:03.470524+08:00
==============================
LMS1 (ospid: 8986_8992) has not moved for 71 sec (1640447283.1640447212): heartbeat check status 6 (no-heartbeat) (threshold 70 sec)
==================================================
=== LMS1 (ospid: 8986_8992) Heartbeat Report
==================================================
LMS1 (ospid: 8986_8992) has no heartbeats for 71 sec. (threshold 70)   <<<<: heartbeat state 0x2.ffff () pso-flag 0x100: Not in wait; last wait ended 50 secs ago.: last wait_id 307162779 at 'latch: shared pool'.
==============================
Dumping PROCESS LMS1 (ospid: 8986_8992) States
==============================
===[ System Load State ]===CPU Total 25 Raw 25 Core 48 Socket 2Load normal: Cur 2856 Highmark 122880 (11.15 480.00)
===[ Latch State ]===Not in Latch Get
===[ Session State Object ]===----------------------------------------SO: 0x623e95c00, type: session (4), map: 0x61e1d07d0state: LIVE (0x4532), flags: 0x5owner: 0x623fe0458, proc: 0x623fe0458link: 0x623e95c20[0x61bd43ab8, 0x54bed7c28]child list count: 1, link: 0x623e95c70[0x603fb84c0, 0x603fb84c0]pg: 1SOC: 0x61e1d07d0, type: session (4), map: 0x623e95c00state: LIVE (0x99fc), flags: INIT (0x1)(session) sid: 14451 ser: 2949 trans: (nil), creator: 0x62000f978flags: (0x80000011) USR/- flags2: (0x409) -/-/INCflags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-DID: 0000-0019-000000030000-0000-00000000, short-term DID: txn branch: (nil)edition#: 0            user#/name: 0/SYSoct: 0, prv: 0, sql: (nil), psql: (nil)stats: 0x5fede9c98, PX stats: 0x129fa544service name: SYS$BACKGROUNDCurrent Wait Stack:Not in wait; last wait ended 51.130431 sec ago Wait State:fixed_waits=0 flags=0x20 boundary=(nil)/-1Session Wait History:elapsed time of 51.130459 sec since last wait0: waited for 'latch: shared pool'            <<<<address=0x608c20c8, number=0x26b, why=0x60187f18wait_id=307162779 seq_num=611 snap_id=1wait times: snap=0.001530 sec, exc=0.001530 sec, total=0.001530 secwait times: max=infinitewait counts: calls=0 os=0occurred after 19.636973 sec of elapsed time1: waited for 'PGA memory operation'            <<<<=0x130000, =0x1, =0x0wait_id=307162778 seq_num=610 snap_id=1wait times: snap=0.000006 sec, exc=0.000006 sec, total=0.000006 secwait times: max=infinitewait counts: calls=0 os=0occurred after 0.078517 sec of elapsed time2: waited for 'SGA: allocation forcing component growth'           <<<<=0x0, =0x0, =0x0wait_id=307162752 seq_num=609 snap_id=26wait times: snap=0.000000 sec, exc=0.259441 sec, total=0.263625 secwait times: max=infinitewait counts: calls=25 os=25occurred after 0.000000 sec of elapsed time.......省略......

分析到这里思路略有清晰，开始怀疑是不是内存问题导致的，DeadLock 和sga到底谁是因谁是果，陷入沉思!!!既然要确定因果那么就看下dba_hist_active_sess_history吧.
分析完日志似乎并不能得到准确答案，接下来进入数据库查一些问题，从下图看到在23:27分开始两个节点执行次数是开始增长的，对比alert日志（23:28:08开始抛出的死锁信息）

select  to_char(sample_time,'YYYY-MM-DD hh24:mi:ss')  mtime,instance_number,count(*) from  awr_1225 where
sample_time > to_date('2021-12-25 23:10:00','YYYY-MM-DD hh24:mi:ss')
and
sample_time< to_date('2021-12-25 23:50:00','YYYY-MM-DD hh24:mi:ss')
group by to_char(sample_time,'YYYY-MM-DD hh24:mi:ss')   ,instance_number
order by 1,2,3

继续分析，发现执行次数突增的SQL并没有一些规律，似乎所有SQL都慢。

select * from (
select  to_char(sample_time,'YYYY-MM-DD hh24:mi:ss')  mtime,instance_number,sql_id,count(*)  counfrom awr_1225 where
sample_time > to_date('2021-12-25 23:10','YYYY-MM-DD hh24:mi:ss')
and  sample_time< to_date('2021-12-25 23:50','YYYY-MM-DD hh24:mi:ss')
group by to_char(sample_time,'YYYY-MM-DD hh24:mi:ss')   ,instance_number,sql_id
order by 1,2,4 ) where coun >8

针对以上圈出的三个SQL我们查一下等待事件

等待事件过多的是gc freelist，看到这里一点一点确认内存可能真的有些问题了，但分析问题还是要论证清楚。
DeadLock报出的SQL我们也要看下。

select  to_char(sample_time,'YYYY-MM-DD hh24:mi:ss')  mtime,instance_number,sql_id, blocking_inst_id,
blocking_session,blocking_session_serial#,time_waited,eventfrom awr_1225 where   sql_id=' c2n8va5bpatuu ' and
sample_time > to_date('2021-12-25 23:10','YYYY-MM-DD hh24:mi:ss')
and  sample_time< to_date('2021-12-25 23:50','YYYY-MM-DD hh24:mi:ss')
order by 1

而阻塞c2n8va5bpatuu的三个session又十分有意思，从sql_exec_id确认这是一次执行，执行了至少4分钟没有执行完，这个SQL也又必须要看下。

接下来看下这个SQL的历史执行情况，由于是sqlloader导入的，sql_plan_hash_value是0，一瞬间我看一个insert要需要数毫秒至数百毫秒（这是一小时内的平均值，上面那三个session执行数分钟），—这一定有问题。
可当我看到21号的执行情况时打消了我对这个SQL的怀疑，21号也有398毫秒的情况，这个跟一次导入的数据量有一定关系，但是这里看到的insert需要数毫秒至数百毫秒也不是根。

追根溯源，那么这个根在哪

2021-12-25T23:28:08.840050 数据库开始deadlock ，产生deadlock的SQL在这个时间段内增长并不典型，且执行效率慢并不能确定是根。
2021-12-25T23:46:52.969550 数据库开始出现内存不足，直接导致数据库宕机，产生的等待是sga相关内存等待。
在故障之前active session在300左右，主要的SQL等待事件23:26分开始是gc freelist，甚至在23:18分也有gc freelist等待。

问题开始浮出水面

24G的sga，shared pool居然占到了21G，buffer cache可用不足1G，难怪一直报free list。继续看占用sharedpool比较大的是 SQLA和KGLHO，SQLA存放的是SQL文本，KGLHO（Kernel Generic Library Handle）是用来管理library cache内存分配的。这两个占了9G，需要看下数据库硬解析了。

看到这里问题已经明确了，硬解析比率占到了4%，这已经很高了

select substr(a.SQL_TEXT,1,80),count(*) cont
from v$sqlarea a
group by substr(a.SQL_TEXT,1,80)
having count(*) > 10
order by  cont desc ;

追根溯源，到底是硬解析占用了内存导致圣诞节也不平安。

温故而知新。

数据库硬解析较高，占用了大量内存（9G,sga 24G）,致使ASMM管理下的sharedpool在极具膨胀，受罪的就是buffer（不足1G），一条SQL执行是需要buffer来缓存storage数据块的，那么buffer cache中的LRU list就会快速老化，不断频繁的从buffer中申请内存和等待申请内存，原本正常执行的SQL就增加了等待内存分配的时间，反过来 shared pool持有latch时间也随之边长，需要更多的shared pool，在不断地占地盘，直至4031,2节点无法分配内存来存储1节点返回来的gc message，lms等待71s，自动重启。

追根溯源-数据库deadlock重启相关推荐

地税某数据库异常重启和重启后数据库运行缓慢问题的解决过程
数据库版本:10.2.0.3 数据库架构:双节点RAC 主机平台:IBM小型机,AIX5.3 这次问题解决过程确认辽宁目前存在2个问题,记录如下: 第一,数据库实例异常重启.第二,数据库异常重启后造成 ...
Oracle数据库产重启服务、监听程序命令
Oracle数据库产重启服务.监听程序命令文章目录 Oracle数据库产重启服务.监听程序命令前言一.重启Oracle数据库总结前言提示:以下是本篇文章正文内容,下面案例可供参考一.重启 ...
服务器重启oracle数据库服务器,oracle数据库怎么重启_网站服务器运行维护,oracle,数据库,重启...
linux操作系统好学吗_网站服务器运行维护学习大多类似鹿丁解牛,对事物的认识一般都是由浅入深.由表及里的过程,循序才能渐进.学习Linux同样要有一定的顺序和方法,这样学起来就不会感觉到难了. o ...
通过服务重启oracle数据库,oracle数据库怎么重启_网站服务器运行维护
linux操作系统好学吗_网站服务器运行维护学习大多类似鹿丁解牛,对事物的认识一般都是由浅入深.由表及里的过程,循序才能渐进.学习Linux同样要有一定的顺序和方法,这样学起来就不会感觉到难了. o ...
mysql重启数据库_重启MySQL数据库的实际操作步骤概述
以下的文章主要介绍的是重启MySQL数据库的实际操作方案,当我们修改了my.cnf,就需要对MySQL数据库进行重启服务,以下的文章就是对重启MySQL数据库的实际操作方案的具体描述,望你会有所收获. ...
数据库服务器主机重启故障诊断分析
墨墨导读:某客户RAC数据库服务器主机轮流发生集群与主机重启,数据库连接不上问题,如下为故障诊断思路. 一.故障现象告警日志: Sun Feb 09 14:18:42 2020 Auto-tunin ...
mysql命令面板数据更改_宝塔面板数据库自动停止解决办法，宝塔面板MySQL数据库自动重启shell脚本...
宝塔的数据库经常性自动停止,是因为网站频繁的请求数据库,而服务器内存又不足,为了保证服务器不彻底卡死,保护性的自动停止数据库,特别是有些程序比如wordpress的数据库查询次数尤为突出,wordpr ...
查看oracle关闭重启,oracle 数据库关闭重启
Oracle 11G DataGuard重启详细过程场景,重启数据库,不重启linux系统,所以不用考虑监听程序, #linux输入lsnrctl start 1 数据库关闭1.1 关闭主库 SHU ...
mysql数据库服务器重启_重启mysql数据库服务器
Mysql错误代码大全 1016错误:文件无法打开,使用后台修复或者使用phpmyadmin进行修复. 1044错误:数据库用户权限不足,请联系空间商解决 1045错误:数据库服务器/数据库用户名/数 ...

追根溯源-数据库deadlock重启

追根溯源-数据库deadlock重启相关推荐

最新文章

热门文章