Reliability, Availability, and Serviceability (RAS) Features

完整RAS设计五大系统支柱:

  • Processor

  • Hardware

  • BIOS

  • FirmWare

  • Software

Machine Check Architecture

1、MCA overeview

  • 通过内部逻辑和寄存器,实现数据通道或控制通道的错误检测、记录和纠正。(Logic and Registers : detect,log and correct errors in data or control paths)

  • MCA 定义此类基础功能以记录和上报处理器和硬件错误给系统软件。

  • 系统软件在硬件错误的诊断和恢复中发挥战略作用

参考资料: AMD64 Architecture Programmer's Manual

MCE generate:

Core::X86::Cpuid::FeatureIdEdx[MCE] or Core::X86::Cpuid::FeatureExtIdEdx[MCE] ->MCE

2、MCA Extension

MCA Extension is indicated by:

Core::X86::Cpuid::RasCap[ScalableMca]->Machine Check Architecture (MCA) Extensions

Machine Check Architecture (MCA) Extensions

• Increased MCA Bank Count

• MCA Extension Registers Expanded information loged

• MCA DOER/SEER Roles:

Error Management (Dynamic Operational Error Handling, or DOER) for managing running programs,

Fault Management (Symptom Elaboration of Errors, or SEER) for hardware diagnosability and reconfiguration.

3、Machine Check Global Registers

• Core::X86::Msr::MCG_CAP Reports how many machine check register banks are supported..

• Core::X86::Msr::MCG_STAT Provides basic information about processor state after the occurrence of a machine check error.

• Core::X86::Msr::MCG_CTL Used by software to enable or disable the logging and reporting of machine check errors in the error

4、Machine Check Banks

5、Machine Check Bank Registers

5.1 The legacy MCA registers include:

• MCA_CTL Enables error reporting via machine check exception.

• MCA_STATUS Logs information associated with errors.

• MCA_ADDR Logs address information associated with errors. The use of Hygon Secure Memory Encryption may

change the information logged in the address register. See 2.1.3 [Memory Encryption] for more details.

• MCA_MISC0 Logs miscellaneous information associated with errors.

5.2 The MCA Extension registers include:

• MCA_CONFIG Provide configuration capabilities for this MCA bank.

• MCA_IPID Provides information on the block associated with this MCA bank.

• MCA_SYND Logs physical location information associated with a logged error.

• MCA_DESTATUS Logs status information associated with a deferred error.

• MCA_DEADDR Logs address information associated with a deferred error.

• MCA_MISC[1:4] Provides additional threshold counters within an MCA bank.

• MCA_TRANSSYND Logs physical location information associated with a transparent error.

• MCA_TRANSADDR Logs address associated with a transparent error.

6、Legacy MCA MSRs

address space is MSR0000_04[7F:00] 4 registers 32 bank

MSR0000_0000 is aliased to MCA_ADDR for MCA Bank 0, and MSR0000_0001 is aliased to MCA_STATUS for MCA Bank 0.

MCA Extensions are not available in this legacy space How tu use new MSR space ?

7、Determining Bank Type

  • MCA_CONFIG[McaX]=1-》MCA_IPID可用

  • MCA_IPID[HardwareID] Block,for example ,LS/IF/L2/DE/EX/FP/L3 belong to core block

  • MCA_IPID[McaType] an identifier for the type of MCA bank

An MCA bank type can be identified by the value of {MCA_IPID[Hwid], MCA_IPID[McaType]}.

MCA_IPID[HardwareID] value of 0 indicates an unpopulated MCA bank that is ensured to be RAZ/WRIG.

MCA_IPID[InstanceId] provides a unique instance number to allow software to differentiate blocks with multiple

identical instances within a processor.

8、Machine Check Errors

The classes of machine check errors are, in priority order from highest to lowest:

• Uncorrected

• Deferred

• Corrected

• Transparent *

*When enabled for logging in the MCA, transparent errors are treated identically to corrected errors

8.1 MCA Overflow Recovery

Core::X86::Cpuid::RasCap[McaOverflowRecov]=1

MCA Overflow Recovery is a feature allowing recovery of the system when the overflow bit is set

if supported, MCA_STATUS[PCC]=1 -> system-fatal

if not, MCA_STATUS[OF]=1 -> system-fatal

8.2 MCA Recovery

Core::X86::Cpuid::RasCap[SUCCOR]=1

MCA Recovery is a feature allowing recovery of the system when the hardware cannot correct an error

if supported,MCA_STATUS[UC]=1, MCA_STATUS[PCC]=0 ,kill process

8.3 Machine Check Error Handling

• Data collection:

• Read Core::X86::Msr::MCG_CAP[Count] to determine the number of status registers visible to the thread.

• All status registers in all error reporting banks must be examined to identify the cause of the machine check exception.

• Check the valid bit in each status register (MCA_STATUS[Val]). The remainder of the status register should be examined

only when its valid bit is set.

• When identifying the error condition and determining how to handle the error, portable exception handlers should

examine only DOER fields in machine check registers.

• Error handlers should collect all available MCA information, but should only interrogate details to the

level which affects their actions. Lower level details may be useful for diagnosis and root cause analysis,

but not for error handling.

• Error handlers should save the values in MCA_ADDR, MCA_MISC0, and MCA_SYND evenif

MCA_STATUS[AddrV], MCA_STATUS[MiscV], and MCA_STATUS[SyndV] are zero.

• DOER Error Management:

• Check MCA_STATUS[PCC].

• If PCC is set, error recovery is not possible. The handler should log the error information and

terminate the system. If PCC is clear, the handler may continue with the following recoverysteps.

• Check MCA_STATUS[UC].

• If UC is set, the processor did not correct the error. Continue with the following recovery steps.

• If MCA Overflow Recovery is not supported, and MCA_STATUS[OF]=1, error recovery

is not possible; follow the steps for PCC=1. See 3.1.10 [MCA OverflowRecovery].

• If MCA Recovery is not supported, error recovery is not possible; follow the steps for

PCC=1. See 3.1.11 [MCA Recovery].

• If MCA Recovery is supported:

• Check MCA_STATUS[TCC].

• If TCC is set, the context of the process thread executing on the

interrupted logical core may be corrupt and the thread cannot be

recovered. The rest of the system is unaffected; it is possible to terminate

only the affected process thread.

• If TCC is clear, the context of the process thread executing on the

interrupted logical core is not corrupt. Recovery of the process thread

may be possible, but only if the uncorrected error condition is first

corrected by software; otherwise, the interrupted process thread must be

terminated.

• Legacy exception handlers can check

Core::X86::Msr::MCG_STAT[RIPV] and

Core::X86::Msr::MCG_STAT[EIPV] in place ofMCA_STATUS[TCC].

If RIPV=EIPV=1, the interrupted program can be restarted reliably.

Otherwise, the program cannot be restarted reliably.

• If UC is clear, the processor either corrected or deferred the error and no software action is

needed. The handler can log the error information and continue process execution.

Exit

• When an exception handler is able to successfully log an error condition, clear theMCA_STATUS

registers prior to exiting the machine check handler.

• Prior to exiting the machine check handler, clear Core::X86::Msr::MCG_STAT[MCIP]. MCIP indicates

that a machine check exception is in progress. If this bit is set when another machine check exception

occurs, the processor enters the shutdown state.

国产Hygon处理器MCA架构总结相关推荐

  1. 【每日新闻】桥水达里奥:人工智能造成贫富差距 | 国产x86处理器开始生产:基于AMD Zen架构...

    点击关注中国软件网 最新鲜的企业级干货聚集地 趋势洞察 坚持是种信念,努力是种精神! 2018中国软件生态大会 趋势洞察 桥水达里奥:人工智能造成贫富差距 桥水基金创始人雷·达里奥--美国对冲基金教父 ...

  2. 处理器架构 (十五) 国产cpu芯片与架构

    国产cpu芯片 x86架构 :海光,兆芯 arm架构 :海思,飞腾(ARMv8) mips架构 :龙芯(loongISA) alpha架构 :申威 PowerPC架构 :中晟宏芯 龙芯 2001年,计 ...

  3. 再谈6大国产CPU处理器

    点击上方"码农突围",马上关注 这里是码农充电第一站,回复"666",获取一份专属大礼包 真爱,请设置"星标"或点个"在看&quo ...

  4. (干货)全面分析6大国产CPU处理器

    点击上方 "后端架构师"关注, 星标或置顶一起成长 后台回复"大礼包"有惊喜礼包! 关注订阅号「后端架构师」,收看更多精彩内容 每日英文 When you ar ...

  5. CPU0 处理器的架构及应用

    CPU0 处理器的架构及应用 简介 CPU0 是一个 32 位的处理器,包含 R0-R15, IR, MAR, MDR 等缓存器,结构如下图所示. 图 1 :CPU0 处理器的结构 其中各个缓存器的用 ...

  6. ARM系列处理器和架构

    从一只ARM到另一只ARM! ARM处理器和架构 当前可用的处理器 ARM1 ARM2 ARM3 ARM4和5 ARM6 ARM7 ARM8 强壮有力的ARM ARM9 ARM10 ARM架构 v1 ...

  7. [转帖]兆芯发布国产X86处理器KX-6000和KH-30000,性能提升达50%,附详情介绍

    兆芯发布国产X86处理器KX-6000和KH-30000,性能提升达50%,附详情介绍 2019-06-20 09:56:38作者:linux人稿源:快科技 https://ywnz.com/linu ...

  8. Qualcomm 处理器 Krait架构

    Krait是美国高通公司基于ARMv7-A指令集.自主设计的采用28纳米工艺的全新处理器微架构.能够实现每个内核最高运行速度可达2.5GHz,较高通第一代的Scorpion CPU微架构在性能上提高6 ...

  9. 图像信号处理器及其架构演进

    图像信号处理器及其架构演进 对于一个成像系统,其图像信号采集部分包括光学器件的镜头组和光电器件的CMOS/CCD成像传感器,由于其在设计和制造阶段的缺陷,在按下快门采集得到的图像信号上有各种各样的缺陷 ...

最新文章

  1. LeetCode中等题之特殊等价字符串组
  2. Oracle存储过程语法
  3. 输入设备配置文件(.idc文件)
  4. CABR:Beamer的内容自适应速率控制算法
  5. 从1876年第一个电话至今:盘点英国通信变迁史
  6. asp.net core2.0中网站发布的时候,怎么样才配置才可以使视图文件不被打包进去?...
  7. vue+sentry 前端异常日志监控
  8. 2194. Excel 表中某个范围内的单元格
  9. Android NDK开发之 opencv for android 问题总结
  10. 程序员如何在未来之路寻找自己的“龙椅”
  11. 阿里P7亲自教你!java开发如何包装自己的简历
  12. 描述Map/Reduce框架的清明上河图
  13. 《软件管理沉思录》读书笔记
  14. Android Studio 2.2 正式稳定版已发布,先睹为快!
  15. 高尔顿钉板 matlab,高尔顿钉板试验模拟
  16. cad断点快捷键_CAD打断命令怎么使用,快捷键是什么
  17. Python安装教程和Pycham教程
  18. ElasticSearch部署架构和容量规划
  19. Mac 截图工具 iShot Pro - 软件介绍、下载安装详细教程
  20. 蔡维德:区块链应用落地不是狼来了,而是老虎来了

热门文章

  1. 计算机系迎新晚会策划,迎新晚会策划方案
  2. Spring+Maven+Dubbo+MyBatis+Linner+Handlebars—Web开发环境搭建
  3. linux系统创建硬盘分区,介绍Linux硬盘系统创建分区步骤
  4. SFML 和 Visual Studio
  5. 微软应用商店无法使用问题记录-微软商店提示“我们这边出错了”的解决方法
  6. 2018(第二届)中国科技产业园区大会11月深圳举行 集中为千家科技企业一站式解决选址难题
  7. 网易Q1财报中的增长信号:有道和云音乐如何打通“任督二脉”?
  8. 【BZOJ 4455】ZJOI2016小星星
  9. Endnote生成GB/T7714-2005输出格式及中英文混排问题解决
  10. [LeetCode]Number of Digit One,解题报告