Reliability, Availability, and Serviceability (RAS) Features


  • Processor

  • Hardware

  • BIOS

  • FirmWare

  • Software

Machine Check Architecture

1、MCA overeview

  • 通过内部逻辑和寄存器,实现数据通道或控制通道的错误检测、记录和纠正。(Logic and Registers : detect,log and correct errors in data or control paths)

  • MCA 定义此类基础功能以记录和上报处理器和硬件错误给系统软件。

  • 系统软件在硬件错误的诊断和恢复中发挥战略作用

参考资料: AMD64 Architecture Programmer's Manual

MCE generate:

Core::X86::Cpuid::FeatureIdEdx[MCE] or Core::X86::Cpuid::FeatureExtIdEdx[MCE] ->MCE

2、MCA Extension

MCA Extension is indicated by:

Core::X86::Cpuid::RasCap[ScalableMca]->Machine Check Architecture (MCA) Extensions

Machine Check Architecture (MCA) Extensions

• Increased MCA Bank Count

• MCA Extension Registers Expanded information loged


Error Management (Dynamic Operational Error Handling, or DOER) for managing running programs,

Fault Management (Symptom Elaboration of Errors, or SEER) for hardware diagnosability and reconfiguration.

3、Machine Check Global Registers

• Core::X86::Msr::MCG_CAP Reports how many machine check register banks are supported..

• Core::X86::Msr::MCG_STAT Provides basic information about processor state after the occurrence of a machine check error.

• Core::X86::Msr::MCG_CTL Used by software to enable or disable the logging and reporting of machine check errors in the error

4、Machine Check Banks

5、Machine Check Bank Registers

5.1 The legacy MCA registers include:

• MCA_CTL Enables error reporting via machine check exception.

• MCA_STATUS Logs information associated with errors.

• MCA_ADDR Logs address information associated with errors. The use of Hygon Secure Memory Encryption may

change the information logged in the address register. See 2.1.3 [Memory Encryption] for more details.

• MCA_MISC0 Logs miscellaneous information associated with errors.

5.2 The MCA Extension registers include:

• MCA_CONFIG Provide configuration capabilities for this MCA bank.

• MCA_IPID Provides information on the block associated with this MCA bank.

• MCA_SYND Logs physical location information associated with a logged error.

• MCA_DESTATUS Logs status information associated with a deferred error.

• MCA_DEADDR Logs address information associated with a deferred error.

• MCA_MISC[1:4] Provides additional threshold counters within an MCA bank.

• MCA_TRANSSYND Logs physical location information associated with a transparent error.

• MCA_TRANSADDR Logs address associated with a transparent error.

6、Legacy MCA MSRs

address space is MSR0000_04[7F:00] 4 registers 32 bank

MSR0000_0000 is aliased to MCA_ADDR for MCA Bank 0, and MSR0000_0001 is aliased to MCA_STATUS for MCA Bank 0.

MCA Extensions are not available in this legacy space How tu use new MSR space ?

7、Determining Bank Type


  • MCA_IPID[HardwareID] Block,for example ,LS/IF/L2/DE/EX/FP/L3 belong to core block

  • MCA_IPID[McaType] an identifier for the type of MCA bank

An MCA bank type can be identified by the value of {MCA_IPID[Hwid], MCA_IPID[McaType]}.

MCA_IPID[HardwareID] value of 0 indicates an unpopulated MCA bank that is ensured to be RAZ/WRIG.

MCA_IPID[InstanceId] provides a unique instance number to allow software to differentiate blocks with multiple

identical instances within a processor.

8、Machine Check Errors

The classes of machine check errors are, in priority order from highest to lowest:

• Uncorrected

• Deferred

• Corrected

• Transparent *

*When enabled for logging in the MCA, transparent errors are treated identically to corrected errors

8.1 MCA Overflow Recovery


MCA Overflow Recovery is a feature allowing recovery of the system when the overflow bit is set

if supported, MCA_STATUS[PCC]=1 -> system-fatal

if not, MCA_STATUS[OF]=1 -> system-fatal

8.2 MCA Recovery


MCA Recovery is a feature allowing recovery of the system when the hardware cannot correct an error

if supported,MCA_STATUS[UC]=1, MCA_STATUS[PCC]=0 ,kill process

8.3 Machine Check Error Handling

• Data collection:

• Read Core::X86::Msr::MCG_CAP[Count] to determine the number of status registers visible to the thread.

• All status registers in all error reporting banks must be examined to identify the cause of the machine check exception.

• Check the valid bit in each status register (MCA_STATUS[Val]). The remainder of the status register should be examined

only when its valid bit is set.

• When identifying the error condition and determining how to handle the error, portable exception handlers should

examine only DOER fields in machine check registers.

• Error handlers should collect all available MCA information, but should only interrogate details to the

level which affects their actions. Lower level details may be useful for diagnosis and root cause analysis,

but not for error handling.

• Error handlers should save the values in MCA_ADDR, MCA_MISC0, and MCA_SYND evenif

MCA_STATUS[AddrV], MCA_STATUS[MiscV], and MCA_STATUS[SyndV] are zero.

• DOER Error Management:


• If PCC is set, error recovery is not possible. The handler should log the error information and

terminate the system. If PCC is clear, the handler may continue with the following recoverysteps.


• If UC is set, the processor did not correct the error. Continue with the following recovery steps.

• If MCA Overflow Recovery is not supported, and MCA_STATUS[OF]=1, error recovery

is not possible; follow the steps for PCC=1. See 3.1.10 [MCA OverflowRecovery].

• If MCA Recovery is not supported, error recovery is not possible; follow the steps for

PCC=1. See 3.1.11 [MCA Recovery].

• If MCA Recovery is supported:


• If TCC is set, the context of the process thread executing on the

interrupted logical core may be corrupt and the thread cannot be

recovered. The rest of the system is unaffected; it is possible to terminate

only the affected process thread.

• If TCC is clear, the context of the process thread executing on the

interrupted logical core is not corrupt. Recovery of the process thread

may be possible, but only if the uncorrected error condition is first

corrected by software; otherwise, the interrupted process thread must be


• Legacy exception handlers can check

Core::X86::Msr::MCG_STAT[RIPV] and

Core::X86::Msr::MCG_STAT[EIPV] in place ofMCA_STATUS[TCC].

If RIPV=EIPV=1, the interrupted program can be restarted reliably.

Otherwise, the program cannot be restarted reliably.

• If UC is clear, the processor either corrected or deferred the error and no software action is

needed. The handler can log the error information and continue process execution.


• When an exception handler is able to successfully log an error condition, clear theMCA_STATUS

registers prior to exiting the machine check handler.

• Prior to exiting the machine check handler, clear Core::X86::Msr::MCG_STAT[MCIP]. MCIP indicates

that a machine check exception is in progress. If this bit is set when another machine check exception

occurs, the processor enters the shutdown state.


