精华内容
下载资源
问答
  • 一次意外宕机后的“意外”现场环境:手机银行系统两台P6 550 PowerHA环境,某晚运维发现告警,一台主机意外宕机了 .接到电话赶到现场,发现P6前面板已经亮起了刺眼的黄灯,为了保护现场,先不动,先看看另外一台主机...

    一次意外宕机后的“意外”

    现场环境:

    手机银行系统两台P6 550 PowerHA环境,某晚运维发现告警,一台主机意外宕机了 .接到电话赶到现场,发现P6前面板已经亮起了刺眼的黄灯,为了保护现场,先不动,先看看另外一台主机哪里能不能找到宕机线索.

    1、errpt 相关报错

    Description

    Possible malfunction on local adapter

    Probable Causes

    Local adapter mal-functioned

    Local adapter lost connection to network

    Local adapter mis-configured

    Failure Causes

    Local adapter mal-functioned

    Local adapter lost connection to network

    Local adapter mis-configured

    RecommendedActions

    Verify adapterconfiguration

    Verify networkconnectivity

    2 、Powerha报错日志

    May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT START: node_down SJbank2

    May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: node_down SJbank2 0

    May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT START: node_down_complete SJbank2

    May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete SJbank2 0

    May 14 23:38:34 SJbank1daemon:notice topsvcs[181226]: (Recorded using libct_ffdc.a cv 2):::Error ID:6zV5DL.myHL9/i0x/6LF.4....................:::Reference ID: :::Template ID:173c787f:::Details File: :::Location:rsct,nim_control.C,1.39.1.18,4303 :::TS_LOC_DOWN_ST Possible malfunction on local adapter Adapterinterface name tty0 Adapter offset 2 Adapter IP address 255.255.0.0

    May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT START: network_down minus 1 net_rs232_01

    May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: network_down minus 1 net_rs232_01 0

    May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT START: network_down_complete minus 1net_rs232_01

    May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: network_down_complete minus 1net_rs232_01 0

    通过如上的一些日志,基本锁定了元凶

    就是因为PowerHA当时的串口心跳异常导致一台主机宕机发生。

    找到了原因,那就把主机启动起来吧,结果意外发生了,这台主机无法启动了,最终定格在了11002630了。似乎是硬件问题了,赶紧call来原厂商处理。

    厂商说这是因为CPU Regulator导致的,调来了备件更换完成,主机顺利启动。

    由社区会员“hp_hp”分享

    展开全文
  • 客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误:现象:客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误:Sun Jun 23 01:00:06 2013Exception [type: ...

    客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误:

    现象:

    客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误:

    Sun Jun 23 01:00:06 2013

    Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x755773D, kcbw_get_bh()+67]

    Errors in file /Oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_mman_2015.trc (incident=298938):

    ORA-07445: exception encountered: core dump [kcbw_get_bh()+67] [SIGSEGV] [ADDR:0xF] [PC:0x755773D] [Address not mapped to object] []

    Incident details in: /oracle/app/11gR1/diag/rdbms/xij/xij1/incident/incdir_298938/xij1_mman_2015_i298938.trc

    Sun Jun 23 01:00:07 2013

    Trace dumping is performing id=[cdmp_20130623010007]

    Sun Jun 23 01:00:09 2013

    Sweep Incident[298938]: completed

    Sun Jun 23 01:00:09 2013

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_pmon_1981.trc:

    ORA-00822: MMAN process terminated with error

    PMON (ospid: 1981): terminating the instance due to error 822

    Sun Jun 23 01:00:09 2013

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:

    ORA-00822: MMAN process terminated with error

    Sun Jun 23 01:00:09 2013

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:

    ORA-00822: MMAN process terminated with error

    System state dump is made for local instance

    System State dumped to trace file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_diag_1987.trc

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opiodr aborting process unknown ospid (11096_47524616916112)

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opitsk aborting process

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opiodr aborting process unknown ospid (6317_47353365785744)

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opitsk aborting process

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opiodr aborting process unknown ospid (28698_47056912551056)

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opitsk aborting process

    Sun Jun 23 01:00:09 2013

    ORA-1092 : opiodr aborting process unknown ospid (18927_47567504653456)

    Sun Jun 23 01:00:10 2013

    ORA-1092 : opitsk aborting process

    Sun Jun 23 01:00:10 2013

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_q001_3487.trc:

    ORA-00822: MMAN process terminated with error

    ORA-1092 : opidrv aborting process Q001 ospid (3487_47252506410128)

    Sun Jun 23 01:00:11 2013

    ORA-1092 : opitsk aborting process

    Sun Jun 23 01:00:11 2013

    License high water mark = 510

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:

    ORA-00822: MMAN process terminated with error

    ORA-00822: MMAN process terminated with error

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:

    ORA-00449: background process 'LGWR' unexpectedly terminated with error 822

    ORA-00822: MMAN process terminated with error

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:

    ORA-00449: background process 'LGWR' unexpectedly terminated with error 822

    ORA-00822: MMAN process terminated with error

    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:

    ORA-00604: error occurred at recursive SQL level 1

    ORA-00822: MMAN process terminated with error

    ORA-06512: at "WKSYS.WK_JOB", line 442

    ORA-00449: background process 'MMON' unexpectedly terminated with error 822

    ORA-00822: MMAN process terminated with error

    ORA-06512: at line 1

    ORA-1092 : opidrv aborting process J000 ospid (22268_47357930925200)

    Sun Jun 23 01:00:20 2013

    Instance terminated by PMON, pid = 1981

    Sun Jun 23 01:00:21 2013

    USER (ospid: 22527): terminating the instance

    Instance terminated by USER, pid = 22527

    Sun Jun 23 01:00:26 2013

    Starting ORACLE instance (normal)

    分析:

    Ora-07445通常是Oracle自身的BUG导致的,

    首先使用IPS收集了alert中的错误信息(IPS使用方法见我的另一篇文章《IPS简单使用方法》):

    搜寻了一下metalink,发现客户的问题跟以下三篇Note中描述的BUG类似:

    ORA-7445 (kcbw_get_bh) [ID 1341402.1]

    Bug 9728912 [https://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9728912] - PMON terminates instance due to ORA-7445 [kcbw_numperchunk] / ORA-7445 [kcbw_get_bh]] [ID 9728912.8]

    Instance Crashed On ORA-7445 kcbw_numperchunk [ID 1364264.1]

    但根据Note可以看到,相关的BUG已经在11.1.0.6中fix掉了。

    看看客户数据库中的其余严重错误信息:

    Node1:

    adrci> show problem

    ADR Home = /oracle/app/11gR1/diag/rdbms/xij/xij1:

    *************************************************************************

    PROBLEM_ID PROBLEM_KEY LAST_INCIDENT LASTINC_TIME

    -------------------- ----------------------------------------------------------- -------------------- ----------------------------------------

    5 ORA 7445 [kcbw_get_bh()+67] 298938 2013-06-23 01:00:06.373716 +08:00

    11 ORA 600 276161 2013-06-04 18:12:12.709933 +08:00

    10 ORA 600 [729] 276160 2013-06-04 18:09:27.857128 +08:00

    7 ORA 7445 [kgghash()+367] 253234 2013-06-03 15:27:04.349337 +08:00

    9 ORA 7445 [kksMapCursor()+323] 256538 2013-05-27 09:54:58.684956 +08:00

    8 ORA 7445 [qkabxo()+22] 251194 2013-05-01 22:03:37.715416 +08:00

    2 ORA 600 [kghfrh:ds] 238818 2013-01-28 11:35:23.755034 +08:00

    6 ORA 7445 [eoa_pm_push()+31] 239218 2013-01-28 11:24:42.835685 +08:00

    3 ORA 7445 [ioei_get_method_counts()+39] 71129 2012-10-17 11:17:39.735719 +08:00

    4 ORA 7445 [jol_calculate_transitive_interface_set()+1165] 74233 2012-10-17 11:05:51.570021 +08:00

    1 ORA 600 [kghfru:ds] 6369 2012-09-07 17:35:55.001585 +08:00

    11 rows fetched

    Node2:

    [oracle@XIJ02 ~]$ adrci

    ADRCI: Release 11.1.0.6.0 - Beta on Mon Jun 24 14:59:37 2013

    本文原创发布php中文网,转载请注明出处,感谢您的尊重!

    展开全文
  • Oracle BUG导致实例宕机:ORA-07445

    千次阅读 2013-12-23 21:22:26
    客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误: Sun Jun 23 01:00:06 2013 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x755773D, kcbw_get_bh...
    现象:
    
    客户的数据库(RAC环境:11.1.0.6)发生了实例异常宕机现象,伴随有ORA-07445错误:
    Sun Jun 23 01:00:06 2013
    Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF] [PC:0x755773D, kcbw_get_bh()+67]
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_mman_2015.trc (incident=298938):
    ORA-07445: exception encountered: core dump [kcbw_get_bh()+67] [SIGSEGV] [ADDR:0xF] [PC:0x755773D] [Address not mapped to object] []
    Incident details in: /oracle/app/11gR1/diag/rdbms/xij/xij1/incident/incdir_298938/xij1_mman_2015_i298938.trc
    Sun Jun 23 01:00:07 2013
    Trace dumping is performing id=[cdmp_20130623010007]
    Sun Jun 23 01:00:09 2013
    Sweep Incident[298938]: completed
    Sun Jun 23 01:00:09 2013
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_pmon_1981.trc:
    ORA-00822: MMAN process terminated with error
    PMON (ospid: 1981): terminating the instance due to error 822
    Sun Jun 23 01:00:09 2013
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
    ORA-00822: MMAN process terminated with error
    Sun Jun 23 01:00:09 2013
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:
    ORA-00822: MMAN process terminated with error
    System state dump is made for local instance
    System State dumped to trace file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_diag_1987.trc
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opiodr aborting process unknown ospid (11096_47524616916112)
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opitsk aborting process
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opiodr aborting process unknown ospid (6317_47353365785744)
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opitsk aborting process
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opiodr aborting process unknown ospid (28698_47056912551056)
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opitsk aborting process
    Sun Jun 23 01:00:09 2013
    ORA-1092 : opiodr aborting process unknown ospid (18927_47567504653456)
    Sun Jun 23 01:00:10 2013
    ORA-1092 : opitsk aborting process
    Sun Jun 23 01:00:10 2013
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_q001_3487.trc:
    ORA-00822: MMAN process terminated with error
    ORA-1092 : opidrv aborting process Q001 ospid (3487_47252506410128)
    Sun Jun 23 01:00:11 2013
    ORA-1092 : opitsk aborting process
    Sun Jun 23 01:00:11 2013
    License high water mark = 510
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:
    ORA-00822: MMAN process terminated with error
    ORA-00822: MMAN process terminated with error
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
    ORA-00449: background process 'LGWR' unexpectedly terminated with error 822
    ORA-00822: MMAN process terminated with error
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
    ORA-00449: background process 'LGWR' unexpectedly terminated with error 822
    ORA-00822: MMAN process terminated with error
    Errors in file /oracle/app/11gR1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
    ORA-00604: error occurred at recursive SQL level 1
    ORA-00822: MMAN process terminated with error
    ORA-06512: at "WKSYS.WK_JOB", line 442
    ORA-00449: background process 'MMON' unexpectedly terminated with error 822
    ORA-00822: MMAN process terminated with error
    ORA-06512: at line 1
    ORA-1092 : opidrv aborting process J000 ospid (22268_47357930925200)
    Sun Jun 23 01:00:20 2013
    Instance terminated by PMON, pid = 1981
    Sun Jun 23 01:00:21 2013
    USER (ospid: 22527): terminating the instance
    Instance terminated by USER, pid = 22527
    Sun Jun 23 01:00:26 2013
    Starting ORACLE instance (normal)

    分析:
    Ora-07445通常是Oracle自身的BUG导致的,
    首先使用IPS收集了alert中的错误信息(IPS使用方法见我的另一篇文章《IPS简单使用方法》):
    搜寻了一下metalink,发现客户的问题跟以下三篇Note中描述的BUG类似:
    ORA-7445 (kcbw_get_bh) [ID 1341402.1]
    Bug 9728912 [https://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9728912] - PMON terminates instance due to ORA-7445 [kcbw_numperchunk] / ORA-7445 [kcbw_get_bh]] [ID 9728912.8]
    Instance Crashed On ORA-7445 kcbw_numperchunk [ID 1364264.1]
    但根据Note可以看到,相关的BUG已经在11.1.0.6中fix掉了。
    看看客户数据库中的其余严重错误信息:
    Node1:
    adrci> show problem

    ADR Home = /oracle/app/11gR1/diag/rdbms/xij/xij1:
    *************************************************************************
    PROBLEM_ID PROBLEM_KEY LAST_INCIDENT LASTINC_TIME
    -------------------- ----------------------------------------------------------- -------------------- ----------------------------------------
    5 ORA 7445 [kcbw_get_bh()+67] 298938 2013-06-23 01:00:06.373716 +08:00
    11 ORA 600 276161 2013-06-04 18:12:12.709933 +08:00
    10 ORA 600 [729] 276160 2013-06-04 18:09:27.857128 +08:00
    7 ORA 7445 [kgghash()+367] 253234 2013-06-03 15:27:04.349337 +08:00
    9 ORA 7445 [kksMapCursor()+323] 256538 2013-05-27 09:54:58.684956 +08:00
    8 ORA 7445 [qkabxo()+22] 251194 2013-05-01 22:03:37.715416 +08:00
    2 ORA 600 [kghfrh:ds] 238818 2013-01-28 11:35:23.755034 +08:00
    6 ORA 7445 [eoa_pm_push()+31] 239218 2013-01-28 11:24:42.835685 +08:00
    3 ORA 7445 [ioei_get_method_counts()+39] 71129 2012-10-17 11:17:39.735719 +08:00
    4 ORA 7445 [jol_calculate_transitive_interface_set()+1165] 74233 2012-10-17 11:05:51.570021 +08:00
    1 ORA 600 [kghfru:ds] 6369 2012-09-07 17:35:55.001585 +08:00
    11 rows fetched
    Node2:
    [oracle@XIJ02 ~]$ adrci

    ADRCI: Release 11.1.0.6.0 - Beta on Mon Jun 24 14:59:37 2013

    Copyright (c) 1982, 2007, Oracle. All rights reserved.
    ADR base = "/oracle/app/11gR1"
    adrci>
    adrci>
    adrci> set homepath diag/rdbms/xij/xij2
    adrci>
    adrci> show problem
    ADR Home = /oracle/app/11gR1/diag/rdbms/xij/xij2:
    *************************************************************************
    PROBLEM_ID PROBLEM_KEY LAST_INCIDENT LASTINC_TIME
    -------------------- ----------------------------------------------------------- -------------------- ----------------------------------------
    1 ORA 7445 [kgghash()+367] 209965 2013-06-16 23:34:39.333982 +08:00
    2 ORA 7445 [kksMapCursor()+323] 190129 2013-05-27 09:54:56.121652 +08:00
    2 rows fetched
    adrci>
    解决方法:
    在客户的2个节点中一共发现了13个疑似BUG引起的数据库故障,总体而言,Oracle 11.1.0.6不算太稳定的版本,存在着各种BUG,
    Oracle在11.1.0.7中Fix掉了11.1.0.6中发现的大部分BUG,所以相对而言要稳定得多,因此建议客户升级数据库至11.1.0.7或者11.2.0.3。



    附:
    (Triage Tool 3.01, routed by file analysis):
    Failing Function: kcbw_get_bh
    Route To: BUFFER CACHE:MANAGEABILITY
    Error Argument: [kcbw_get_bh]
    Type of Error: ORA-07445
    File Name: xij1_mman_2015_i298938.trc
    Comment: Routed by Error Argument, Conventional routing
    DB Version: 11.1.0.6.0
    Platform: Linux CPU: x86_64
    OS Version: 2.6.18-194.el5
    Stack Trace: kcbw_get_bh kcbw_get_first_buffer kcbw_next_free kmgs_extract_mem_from_granule kmgs_process_request_immediate kmgs_process_request kmgsdrv ksbabs ksbrdp opirip
     




    展开全文
  • [转自Oracle官方博客]一次服务器时间调整引发的实例宕机

    https://blogs.oracle.com/database4cn/%E4%B8%80%E6%AC%A1%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%97%B6%E9%97%B4%E8%B0%83%E6%95%B4%E5%BC%95%E5%8F%91%E7%9A%84%E5%AE%9E%E4%BE%8B%E5%AE%95%E6%9C%BA%E3%80%82

    By:  Sam Zhao

    问题描述:

    1. 数据库实例突然crash,原因是ASMB有200多秒没有响应:

    Mon Sep 04 15:07:47 2017
    WARNING: ASMB has not responded for 200 seconds <<<<<<<<<<<< ASMB has not responsed for 200 seconds.
    NOTE: ASM umbilicus running slower than expected, ASMB diagnostic requested after 200 seconds 
    NOTE: ASMB process state dumped to trace file /u01/app/oracle/diag/rdbms/iadw/iadw3/trace/iadw3_gen0_19179.trc
    Mon Sep 04 15:07:49 2017
    NOTE: ASMB terminating
    Mon Sep 04 15:07:49 2017
    Errors in file /u01/app/oracle/diag/rdbms/iadw/iadw3/trace/iadw3_asmb_19501.trc:
    ORA-15064: communication failure with ASM instance
    ORA-03113: end-of-file on communication channel
    Process ID:
    Session ID: 170 Serial number: 65161
    Mon Sep 04 15:07:49 2017
    Errors in file /u01/app/oracle/diag/rdbms/iadw/iadw3/trace/iadw3_asmb_19501.trc:
    ORA-15064: communication failure with ASM instance
    ORA-03113: end-of-file on communication channel
    Process ID:
    Session ID: 170 Serial number: 65161
    USER (ospid: 19501): terminating the instance due to error 15064

    2. 从system state dump上看,ASMB看起来没有什么问题:

    Current Wait Stack:
    Not in wait; last wait ended 3.321392 sec ago  <<<<<<<<<<<<<<< Not in wait.
    Wait State:
    fixed_waits=0 flags=0x21 boundary=(nil)/-1
    Session Wait History:
    elapsed time of 3.321404 sec since last wait
    0: waited for 'ASM background timer'
    =0x0, =0x0, =0x0
    wait_id=37936676 seq_num=57511 snap_id=1
    wait times: snap=2.682436 sec, exc=2.682436 sec, total=2.682436 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000022 sec of elapsed time
    1: waited for 'ASM file metadata operation'
    msgop=0xc, locn=0x3, =0x0
    wait_id=37936675 seq_num=57510 snap_id=1
    wait times: snap=0.000454 sec, exc=0.000454 sec, total=0.000454 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000017 sec of elapsed time

    3. 但是从OSW上看,没有发现明显的资源匮乏情况,但是中间却缺了三分多钟的断档:

    zzz ***Mon Sep 4 15:04:13 CST 2017
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    3 0 0 529160192 19412 31514216 0 0 82 48 0 0 1 0 99 0 0
    0 0 0 529124032 19412 31514784 0 0 1545 23119 36620 37705 1 1 99 0 0
    2 0 0 529126784 19412 31514712 0 0 1601 9056 28083 30263 1 0 99 0 0
    zzz ***Mon Sep 4 15:04:23 CST 2017
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    5 0 0 529095360 19412 31514996 0 0 82 48 0 0 1 0 99 0 0
    3 0 0 529118368 19412 31515228 0 0 1517 4540 20402 27856 1 1 98 0 0
    52 0 0 529107936 19412 31515400 0 0 1206 3961 21105 31254 1 0 98 0 0
    zzz ***Mon Sep 4 15:07:51 CST 2017 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<  15:04:23 到15:07:51之间没有任何记录
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    41 0 0 570421952 19412 31556616 0 0 82 48 0 0 1 0 99 0 0
    16 0 0 578182976 19412 31575888 0 0 2129 35 25702 15760 1 8 91 0 0
    5 0 0 582348800 19412 31607740 0 0 5209 40002 22122 19062 1 4 96 0 0
    zzz ***Mon Sep 4 15:08:02 CST 2017
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 617279552 19412 31615300 0 0 82 48 0 0 1 0 99 0 0
    2 0 0 624415168 19412 31617816 0 0 922 2 25322 20023 1 2 98 0 0
    2 0 0 631768448 19412 31615728 0 0 1497 3 25405 22582 1 1 98 0 0

    看到这里,一般的思考是OSW中间断档了3分多钟,是不是系统性能太差导致OSW没法生成?但是一般来讲,在断档之前一般都能看到一些先兆,比如block queue 剧增。但是这个案例里面没有此现象。 继续看OS log:

    4. 在OSlog中看到关键性的一句话:

    Sep 4 15:04:01 hnpb05nc crond: /usr/sbin/postdrop: /lib64/libssl.so.10: no version information available (required by /usr/lib64/mysql/libmysqlclient.so.18)
    Sep 4 15:04:21 hnpb05nc init.tfa: Checking/Starting TFA..
    Sep 4 15:07:47 hnpb05nc systemd: Time has been changed <<<<<<<<<<<<<<<<<<< 系统时间修改了。

    5. 继续看看CTSSD 的trace:

    2017-09-04 15:04:25.799241 : CTSS:3933169408: ctssslave_swm19: The offset is [2311562070 usec] and sync interval set to [4]<<< 偏移量为2311秒
    2017-09-04 15:04:25.799251 : CTSS:3933169408: ctsselect_msm: Sync interval returned in [4]
    2017-09-04 15:04:25.799260 : CTSS:3937371904: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler

    2017-09-04 15:04:26.800845 : CTSS:3933169408: ctssslave_swm19: The offset is [2311562609 usec] and sync interval set to [4]<<< 偏移量为2311秒
    2017-09-04 15:04:26.800856 : CTSS:3933169408: ctsselect_msm: Sync interval returned in [4]
    2017-09-04 15:04:26.800864 : CTSS:3937371904: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler

    2017-09-04 15:04:27.802328 : CTSS:3933169408: ctssslave_swm19: The offset is [2311563057 usec] and sync interval set to [4]<<< 偏移量为2311秒
    2017-09-04 15:04:27.802337 : CTSS:3933169408: ctsselect_msm: Sync interval returned in [4]
    2017-09-04 15:04:27.802346 : CTSS:3937371904: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler

    2017-09-04 15:07:47.065051 : CTSS:3933169408: ctssslave_swm19: The offset is [2509824742 usec] and sync interval set to [4]<<< 偏移量剧增到2509秒
    2017-09-04 15:07:47.065068 : CTSS:3933169408: ctsselect_msm: Sync interval returned in [4]
    2017-09-04 15:07:47.065077 : CTSS:3937371904: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting

    很明显,偏移量在问题期间发生了200秒左右的增长,而在之前,我们可以看到偏移量是相对稳定的!这个也间接说明了系统时间的调整。

    这个故事:

    事情是这样的,系统配置了ntp,由于一些问题ntp没有启动,但是由于已经配置了ntp,ctssd发现了ntp的配置文件所以ctssd只运行在观察者的角色。造成的结果就是系统时间不断出现偏差,直到系统管理员发现了这个问题并手工把系统时间往前调了200秒。。。 然后ASMB通过系统时间判断有200秒没有响应(当然情况不是这样了),然后就。。。

    建议:

    当然我们应该尽可能monitor系统并确保NTP的正常运行。如果我们确实需要手工大幅度调整系统时间,那么我们也应该先把RAC数据库关闭以后在做调整。


    展开全文
  • 应用无法访问,报错无法获取数据库连接,应用宕机。 数据库报错同期有报错,超过最大连接及异常被kill。 Mon Nov 23 00:06:23 2020 ORA-00020: maximum number of processes (4000) exceeded ORA-20 errors...
  • 一次Oracle rac宕机分析

    千次阅读 2012-11-30 12:08:11
    一次Oracle rac宕机分析 (2012-11-01 23:54) 标签: Oracle 分类: oracle 某客户rac数据库2号节点实例自动宕节点,以下为分析报告  一、现象回顾:  2号节点发生故障时,alert日志显示如下:  ...
  • Oracle rac宕机分析报告

    2011-03-30 19:02:28
    某客户rac数据库2号节点实例自动节点,以下为分析报告 一、现象回顾: 2号节点发生故障时,alert日志显示如下: Thread 2 advanced to log sequence 77740 (LGWR switch) Current log# 24 seq# 77740 mem# ...
  • oracle 执行job:expdp,数据库实例直接掉,报错如下: ORA-27300: OS system dependent operation:semctl failed with status: 22 ORA-27301: OS failure message: Invalid argument ORA-27302: failure occurred at:...
  • 事故第二天进行了数据库的alert log分析,从日志中可以看到数据库在实例NODE2发生宕机后,RAC已经做出了实例切换步骤,但在切换的1.环境描述OS:AIX6.1Oracle :11.2.0.3.0 RAC2.事故发生数据库NODE 2所在小型机发生...
  • Oracle 执行job:expdp,数据库实例直接掉,报错如下: ORA-27300: OS system dependent operation:semctl failed with status: 22 ORA-27301: OS failure message: Invalid argument ORA-27302: failure occurred ...
  • 1.环境描述OS:AIX6.1Oracle :11.2.0.3.0 RAC2.事故发生 数据库NODE 2所在小型机发生宕机事故,本应正常切换至NODE1,但切换失败,重...
  • 某用户数据库数据库突然宕机,查看日志发现宕机前大量出现如下错误: Errors in file /u01/oracle/admin/orcl/bdump/orcl2_smon_14347.trc: ORA-00604: Message 604 not found; No message file for product=...
  • 调整SGA后oracle实例无法打开,通过修改pfile的参数重新导入修复
  • Oracle11g 修改系统时间导致实例宕

    千次阅读 2011-07-20 11:55:56
    Sun Microsystems sun4u Sun Fire E6900Oracle 11.1.0.7.0 ASM 数据库修改系统时间,导致数据库实例宕机, ASM和ORACLE实例都unmountTue Jul 14 20:11:00 1970Errors in file
  • Oracle宕机案例汇总(一) 案例一:UNDO坏块导致 Oracle 无法 Open ...
  • https://blogs.oracle.com/database4cn/%E4%B8%80%E6%AC%A1%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%97%B6%E9%97%B4%E8%B0%83%E6%95%B4%E5%BC%95%E5%8F%91%E7%9A%84%E5%AE%9E%E4%BE%8B%E5%AE%95%E6%9C%BA%E3%80%82By:...
  • 一、故障情况应用无法连接数据库,检查oracle发现已经宕机。startup 后报错ORA-03113: end-of-file on communication channel二、查找原因查看alter日志tail -500 /oracle/database/oracle/diag/rdbms/udb/udb1/...
  • 一、故障情况 应用无法连接数据库,检查oracle发现已经宕机。 startup 后报错 ORA-03113: end-of-file on communication channel ...
  • 环境:22.188.20.196 早上8:30左右SIT测试进行系统时间切换,机器时间由2014-9-21切换到了2014-9-30 AIX版本:6100-07-05-1228 ORACLE版本:11.2.0.3.0 ...
  • 简介 ORA-10458: standby database requires recovery ORA-01196: 文件 1 由于介质恢复会话...一个项目做了Oracle主从数据库同步,通过Dataguard实现,从库服务器宕机,再开机的时候,从库无法启动,报“ORA-01196: ...
  • 最近在维护客户oracle数据库系统时,发现数据库频繁宕机,每次重启实例后大约运行1小时30分钟左右就宕机了,检查发现监听还在,实例被杀。 通过查看日志发现smon进程被杀掉,在更新smon_scn_time系统表时回滚出现...
  • Oracle 实例恢复

    万次阅读 2010-07-27 10:39:00
    --=======================-- Oracle 实例恢复--======================= 一、Oracle实例失败 Oracle实例失败多为实例非一致性关闭所致,通常称为崩溃(crash)。实例失败的结果等同于shutdown abort。 实例失败的...
  • 事件描述及影响:2018年9月30日04:43点,zabbix告警odsdb2数据库疑似宕机,机房值班人员通过堡垒机无法登录数据库服务器,从其他机器也无法ssh登录该机器,同时odsdb1数据库也HANG住,通过命令无法登录数据库。...
  • dbca安装实例时,安装中途集群宕机,记录一下。 首先查看dbca安装日志: [root@rac1 ~]# tail -100f /oracle/app/cfgtoollogs/dbca/orcl/trace.log_2021-01-19_05-50-20PM [progressPage.flowWorker] [ 2021-01-19 17:...
  • Oracle实例恢复

    2013-09-25 17:01:10
    --=======================-- Oracle 实例恢复--=======================一、Oracle实例失败Oracle实例失败多为实例非一致性关闭所致,通常称为崩溃(crash)。实例失败的结果等同于shutdown abort。实例失败的原因...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 10,509
精华内容 4,203
关键字:

oracle实例宕机