[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2020239060A1 - 错误恢复的方法和装置 - Google Patents

错误恢复的方法和装置 Download PDF

Info

Publication number
WO2020239060A1
WO2020239060A1 PCT/CN2020/093188 CN2020093188W WO2020239060A1 WO 2020239060 A1 WO2020239060 A1 WO 2020239060A1 CN 2020093188 W CN2020093188 W CN 2020093188W WO 2020239060 A1 WO2020239060 A1 WO 2020239060A1
Authority
WO
WIPO (PCT)
Prior art keywords
cpu
error
software
context
visible
Prior art date
Application number
PCT/CN2020/093188
Other languages
English (en)
French (fr)
Inventor
耿东久
李硕
梁永祥
林强敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to FIEP20785894.5T priority Critical patent/FI3770765T3/fi
Priority to DK20785894.5T priority patent/DK3770765T3/da
Priority to JP2021570888A priority patent/JP7351933B2/ja
Priority to CA3142308A priority patent/CA3142308A1/en
Priority to AU2020285262A priority patent/AU2020285262B2/en
Priority to KR1020217042599A priority patent/KR20220010040A/ko
Priority to EP20785894.5A priority patent/EP3770765B1/en
Priority to US17/038,428 priority patent/US11068360B2/en
Publication of WO2020239060A1 publication Critical patent/WO2020239060A1/zh
Priority to US17/376,442 priority patent/US11604711B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • G06F11/1662Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit the resynchronized component or unit being a persistent storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • G06F11/1687Temporal synchronisation or re-synchronisation of redundant processing components at event level, e.g. by interrupt or result of polling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/82Solving problems relating to consistency

Definitions

  • This application relates to the computer field, and more specifically, to methods and devices for error recovery in the computer field.
  • Lock-step system is a fault-tolerant computer system that adopts lock-step mechanism, and realizes safety redundancy by running the same group of operations in parallel.
  • two independent central processing units Central Processing Unit, CPU
  • Each CPU has its own error checking function, such as Error Correction Code (ECC) parity check, etc., and the outputs of the two CPUs are compared through a comparator.
  • ECC Error Correction Code
  • Lockstep will be disabled and released at this time, so that the CPU with the check error will exit and the CPU with the correct check normal work.
  • the comparison result is only one bit inconsistent and there is only one CPU check error, it will be restored to the previous state.
  • the comparison result is only one bit inconsistent and only one CPU has a check error
  • the two CPUs will be restored to the last saved state of the current running state of the CPU and run again. If there are multiple errors and the error cannot be repaired, the lockstep mode will be exited and the business will be stopped. Therefore, the error recovery capability of the existing lockstep system is weak, which makes the reliability of the system difficult to meet the requirements of security services.
  • This application provides an error recovery method and device, which can improve the error recovery capability of a lockstep system and increase system reliability.
  • an error recovery method including: receiving an interrupt when an error occurs in a first CPU of at least two central processing units CPUs in lockstep mode; in response to the interrupt, the at least two The CPU exits the lockstep mode; the error type of the first CPU where the error occurred is determined; based on the type of the error as a recoverable error, according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption, Perform error recovery on the first CPU. Therefore, the solution of the embodiment of the present application is based on the judgment of the lockstep CPU error type.
  • the CPU that has the error can be recovered according to the state of the CPU that is running correctly, so that the at least two Each CPU re-runs where the business program was interrupted, so the embodiment of the present application can improve the error recovery capability of the lockstep system and increase the system reliability.
  • the state of the second CPU at the time of the interruption includes the CPU context visible to the software of the second CPU at the time of the interruption, and the CPU context includes the value of the system register and the general-purpose register
  • the value of; the error recovery of the first CPU according to the state of the second CPU that is running correctly in the at least two CPUs at the time of interruption includes: obtaining from memory the information of the second CPU at the time of interruption
  • the software-visible CPU context, and the software-visible CPU context in the first CPU is updated according to the software-visible CPU context of the second CPU.
  • the CPU context visible to the software at the time of the interruption of the second CPU and the data in the cache are saved in the memory. Save the CPU context visible to the software when the first CPU is interrupted and the data in the cache to the memory.
  • the CPUs visible by the software change from one to multiple.
  • the data in the CPU L1/L2 cache is flushed to the external memory to ensure that the data will not be lost when the lockstep mode is re-entered.
  • the at least two CPUs jump to the entry of the abnormal vector table respectively to synchronize the error of the CPU to ensure that the asynchronous error in the system at that moment can be reported immediately, and prepare for the subsequent query of the error type.
  • the performing error recovery on the first CPU according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption includes: The first CPU obtains the software-visible CPU context of the second CPU at the time of interruption through the hardware channel with the second CPU, and updates the first CPU according to the software-visible CPU context of the second CPU.
  • the hardware channel-based method can be used to repair all levels of registers.
  • the first CPU and the second CPU after updating the software-visible CPU context of the first CPU, the first CPU and the second CPU reset their respective non-software-visible And retain the CPU context visible to the respective software, so that the first CPU and the second CPU enter lockstep mode again.
  • the error CPU resets all hardware states that are not visible to software, clears the data in the CPU cache, and retains the software-visible state in system registers and general registers. Therefore, before the reset, the software visible state of the above at least two CPU settings are exactly the same. After the reset, the software visible state of the at least two CPUs is still the same, and both obtain data and instructions from the external memory uniformly. Receive the same input instruction stream.
  • the performing error recovery on the first CPU according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption includes: The first CPU and the second CPU are reset respectively, and an initialization instruction is executed to restore the CPU context visible by the software, so that the first CPU and the second CPU enter the lockstep mode again, wherein the initialization
  • the instruction includes the software-visible CPU context of the second CPU at the time of interruption, and the initialization instruction is used to restore the software-visible CPU context to the software-visible CPU context of the second CPU at the time of interruption.
  • the CPU context includes the value of system registers and the value of general-purpose registers.
  • the first CPU and the second CPU may simultaneously reset and execute the initialization instruction at the same time, so that the first CPU and the second CPU enter the lockstep mode again. Therefore, before the reset, the software visible state of the above at least two CPU settings are exactly the same. After the reset, the software visible state of the at least two CPUs is still the same, and both obtain data and instructions from the external memory uniformly. Receive the same input instruction stream.
  • the determining the first CPU where an error occurs among the at least two CPUs and the type of the error includes: the first CPU according to the The ACPI table of the advanced configuration and power management interface corresponding to the first CPU determines the type of the error, where the ACPI table is used to record the status register of the RAS node that is found when polling the reliability, availability, and serviceability of the CPU error.
  • the CPU when a RAS error occurs in the CPU, the CPU generates an interrupt or system exception and enters UEFI or BIOS.
  • UEFI or BIOS traverses each RAS node status register and records the error corresponding to the CPU in the memory table (ie, the APCI table).
  • the ACPI driver of the operating system can parse the table to know which node in the system has which type of error.
  • the first CPU polls the status register of the RAS node of the first CPU to determine the type of the error.
  • the RAS driver directly traverses the status register of each RAS node to determine the cause of the error, instead of querying the ACPI table.
  • the second CPU may also poll the status register of the RAS node of the second CPU to determine that the second CPU is operating correctly.
  • the second CPU may also determine that the second CPU is operating correctly according to the ACPI table corresponding to the second CPU.
  • each CPU can determine whether it has made an error without querying the RAS node or ACPI table. In other words, at this time, you can directly determine which CPUs are the CPUs that have the error and which are the CPUs that are running correctly.
  • the at least two central processing units CPUs in lockstep mode receiving interrupts include: the at least two CPUs receiving the interrupts sent by the interrupt controller Wherein, the interrupt controller sends the interrupt to the at least two CPUs when the comparison circuit determines that the outputs of the at least two CPUs are inconsistent.
  • the comparison circuit can be implemented by a dedicated hardware circuit, and is not arranged on the critical path, for example, it can be arranged outside the CPU, so that the comparison circuit has no effect on the performance of the CPU.
  • the comparison circuit is a CPU clock cycle (cycle) level comparison circuit.
  • the comparison circuit corresponding to the lockstep CPU shares the clock source with the lockstep CPU to ensure that the comparison circuit and the CPU are at the same frequency to achieve cycle-by-cycle data comparison, so that errors can be detected in time and as soon as possible Perform error recovery or other further processing.
  • the output of the at least two CPUs includes the internal bus output of each of the at least two CPUs, and the output of each CPU to the external bus. And at least one of the output of the layer 3 cache control logic of each CPU.
  • the determining the first CPU where the error occurs among the at least two CPUs and the type of the error include:
  • the comparator determines that the output of the obtained CPU is inconsistent, it can report the RAS interrupt error, and at the same time provide the information of the inconsistent data in the register of the RAS node corresponding to the comparator, such as the wrong data address, the wrong module , At least one of error types, etc.
  • the method further includes: stopping the operation of the at least two CPUs based on the type of the error being an unrecoverable error.
  • an error recovery device including: a first central processing unit CPU and a second CPU;
  • the first CPU is configured to receive an interrupt, the interrupt is triggered by an error of the first CPU when the first CPU and the second CPU are in the lockstep mode; in response to the interrupt, the lock is exited Step mode, and determine the type of the error; based on the type of the error as a recoverable error, perform error recovery according to the state of the second CPU at the time of the interrupt; the second CPU is used to receive the interrupt, Exit lockstep mode.
  • the first CPU is specifically configured to: obtain from memory the CPU context visible to the software of the second CPU at the time of interruption, and according to the second The CPU context visible to the software of the CPU updates the CPU context visible to the software of the first CPU, where the CPU context includes the value of the system register and the value of the general register.
  • the second CPU is further configured to save the CPU context visible to the software of the second CPU when the second CPU is interrupted and the data in the cache to the memory.
  • the first CPU is specifically configured to: obtain the software-visible information of the second CPU at the time of interruption through the hardware channel with the second CPU CPU context, and update the CPU context visible to the software of the first CPU according to the CPU context visible to the software of the second CPU, where the CPU context includes the value of the system register and the value of the general register.
  • the first CPU is further configured to: after updating the software-visible CPU context, reset the non-software-visible microarchitecture state of the first CPU, And retain the CPU context visible to the software of the first CPU, so that the first CPU re-enters the lockstep mode; the second CPU is also used to: after the first CPU updates the CPU context visible to the software, restart Set the non-software-visible microarchitecture state of the second CPU, and retain the software-visible CPU context of the second CPU, so that the second CPU reenters the lockstep mode.
  • the first CPU is specifically used for resetting, and after the resetting, an initialization instruction is executed to restore the CPU context visible to the software, so that the first CPU restarts Enter the lockstep mode, wherein the initialization instruction includes the software-visible CPU context of the second CPU at the time of interruption, and the initialization instruction is used to restore the software-visible CPU context to the second CPU being interrupted.
  • the CPU context visible to the software at the time, the CPU context includes the value of the system register and the value of the general register.
  • the second CPU is specifically configured to reset, and execute the initialization instruction after the reset, so that the second CPU enters the lockstep mode again.
  • the first CPU and the second CPU may simultaneously reset and execute the initialization instruction at the same time, so that the first CPU and the second CPU enter the lockstep mode again.
  • the first CPU is specifically configured to: determine the type of the error according to the advanced configuration and power management interface ACPI table corresponding to the first CPU, where The ACPI table is used to record errors found when polling the status register of the RAS node of the CPU for reliability, availability, and serviceability; or polling the status register of the RAS node of the first CPU to determine the error Types of.
  • the first CPU is specifically configured to: receive the interrupt sent by an interrupt controller, wherein the interrupt controller determines the first When the outputs of the CPU and the second CPU are inconsistent, sending the interrupt to the first CPU and the second CPU; the second CPU is specifically configured to: receive the interrupt sent by the interrupt controller.
  • the output of the CPU includes at least one of output per internal bus of the CPU, output to external bus, and output of layer 3 cache control logic.
  • the first CPU is further configured to query the status register of the RAS node corresponding to the comparison circuit to determine the first CPU where the error occurred, And the type of error.
  • the first CPU is also used to stop operation, and the second CPU is also used to stop operation.
  • an interrupt controller and a comparison circuit are further included.
  • the comparison circuit is used to obtain the output of the first CPU and the second CPU, and determine the When the outputs of the first CPU and the second CPU are inconsistent, a first signal is sent to the interrupt controller, and the first signal is used to instruct the interrupt controller to send a message to the first CPU and the second CPU. Send an interrupt; the interrupt controller sends the interrupt to the first CPU and the second CPU according to the first signal.
  • an error recovery device which is characterized by comprising: a determining unit and a recovery unit, where an error occurs in the first CPU of at least two central processing units CPUs in lockstep mode, and the at least two When the CPU exits the lockstep mode, the determining unit is configured to determine the type of error of the first CPU; the recovery unit is configured to be a recoverable error based on the type of the error, and according to the at least two types of errors. The state of the second CPU that is running correctly among the two CPUs at the time of interruption, and error recovery is performed on the first CPU.
  • the recovery unit is specifically configured to: obtain from the memory the CPU context visible to the software of the second CPU at the time of interruption, and according to the second CPU
  • the software-visible CPU context updates the software-visible CPU context of the first CPU, where the CPU context includes the value of the system register and the value of the general register.
  • it further includes a CPU context management unit, configured to save the CPU context visible to the software when the second CPU is interrupted and the data in the cache to the memory .
  • an initialization unit configured to execute an initialization instruction to restore the CPU context visible to the software after the first CPU and the second CPU are reset, So that the first CPU and the second CPU re-enter the lockstep mode, wherein the initialization instruction includes the CPU context visible to the software of the second CPU at the time of interruption, and the initialization instruction is used to transfer the software
  • the visible CPU context is restored to the CPU context visible to the software of the second CPU at the time of interruption, where the CPU context includes the value of the system register and the value of the general register.
  • the determining unit is specifically configured to: determine the type of the error according to the advanced configuration and the power management interface ACPI table corresponding to the first CPU, wherein: The ACPI table is used to record errors found when polling the status register of the RAS node for the reliability, availability, and serviceability of the CPU; or polling the status register of the RAS node of the first CPU to determine the type of error .
  • the determining unit is specifically configured to query the status register of the RAS node corresponding to the comparison circuit, and determine the first error of the at least two CPUs.
  • the comparison circuit is used to send first information to the interrupt controller when it is determined that the outputs of the at least two CPUs are inconsistent, and the first signal is used to instruct the interrupt control
  • the device sends an interrupt to the at least two CPUs to trigger the at least two CPUs to exit the lockstep mode.
  • the output of the at least two CPUs includes the internal bus output of each of the at least two CPUs, and the output of each CPU to the external bus. And at least one of the output of the layer 3 cache control logic of each CPU.
  • the determining unit is further configured to control the at least two CPUs to stop running based on the type of the error being an unrecoverable error.
  • a comparison circuit for querying errors is provided, the comparison circuit is arranged outside at least two CPUs in lockstep mode, and the comparison circuit is used to: determine that the outputs of the at least two CPUs are inconsistent; Based on the inconsistency of the outputs of the at least two CPUs, a first signal is sent to the interrupt controller, where the first signal is used to instruct the interrupt controller to send an interrupt to the at least two CPUs, and the interrupt is used for An error occurred in at least one of the at least two CPUs.
  • the output of the at least two CPUs includes the internal bus output of each of the at least two CPUs, and the output of each CPU to the external bus. And at least one of the output of the layer 3 cache control logic of each CPU.
  • an error recovery device in a fifth aspect, includes a module corresponding to the method/operation/step/action described in the first aspect.
  • an error recovery device in a sixth aspect, includes a processor, and the processor is configured to call a program code stored in a memory to perform part or all of the operations in any one of the above-mentioned first aspects.
  • the memory for storing the program code can be located inside the error recovery device (the error recovery device can also include memory in addition to the processor), or it can be located outside the error recovery device (which can be other equipment).
  • the processor may be a lockstep CPU, and the lockstep CPU includes at least two physical CPUs.
  • the aforementioned memory is a non-volatile memory.
  • the processor and the memory may be coupled together.
  • the foregoing error recovery device may be a terminal, or a device (for example, a chip, or a device that can be used with the terminal) for performing error recovery in the terminal.
  • the terminal may specifically be a smart phone, a vehicle-mounted device, or a wearable device.
  • the aforementioned vehicle-mounted device may be a computer system independent of the automobile but applicable to the automobile, or may be a computer system integrated into the automobile (for example, an autonomous vehicle).
  • a computer-readable storage medium stores program code, where the program code includes instructions for executing part or all of the operations in the method described in the first aspect.
  • the foregoing computer-readable storage medium is located in a terminal, and the terminal may be a device capable of error recovery.
  • embodiments of the present application provide a computer program product, which when the computer program product runs on an error recovery device, causes the error recovery device to perform some or all of the operations in the method described in the first aspect.
  • a chip in a ninth aspect, includes a processor configured to perform part or all of the operations in the method described in the first aspect.
  • Figure 1 shows an implementation form of the system of the embodiment of the present application.
  • Figure 2 shows a schematic diagram of a system architecture provided by an embodiment of the present application.
  • Figure 3 shows an example of the query mode.
  • Fig. 4 shows a schematic flowchart of an error recovery method provided by an embodiment of the present application.
  • Figure 5 shows a specific example of lockstep manager initialization.
  • Figure 6 shows an example of CPU context saving and restoration.
  • Fig. 7 shows an example of hardware channel-based error repair provided by an embodiment of the present application.
  • FIG. 8 shows a schematic flowchart of an error recovery method provided by an embodiment of the present application.
  • FIG. 9 shows a schematic flowchart of an error recovery apparatus provided by an embodiment of the present application.
  • Fig. 10 shows a schematic flowchart of an error recovery apparatus provided by an embodiment of the present application.
  • Lockstep CPU a logical CPU, which contains at least two physical CPUs (also called CPUs), or contains at least two physical cores.
  • the at least two CPUs may be arranged on one chip or distributed on different chips, which is not limited in the embodiment of the present application.
  • the Lockstep CPU may also be referred to as a lockstep logical CPU.
  • a logical CPU includes at least two CPUs as an example for description.
  • the at least two CPUs in the lockstep CPU execute the same code or instruction and only output the calculation result of one CPU.
  • the software can see only one CPU, but it contains at least two (for example, multiple) CPUs inside.
  • Split CPU split CPU: When at least two CPUs in the lockstep CPU exit from the lockstep mode to a normal independent and separate CPU, it can be said that the at least two physical CPUs exiting from the lockstep mode are in the split mode. At this time, the software can see the at least two CPUs.
  • At least two CPUs in lockstep mode should have the same output result. Once the output results of the at least two CPUs are inconsistent, there must be at least one CPU running error (that is, an error occurs). When a CPU error occurs, the lockstep CPU is abnormal, and the CPU in the lockstep CPU needs to exit from lockstep mode and enter split mode.
  • CPU exception jump When the CPU is running, if an error occurs or needs to respond to an interrupt, it will jump into the entry of the exception vector table or interrupt vector table, and then there will be functions to handle the error or interrupt. After the processing is completed, the CPU can return to the place where it was interrupted to continue execution. As an example, when the lockstep CPU is abnormal, the CPU in the lockstep CPU abnormally jumps, enters the split mode, and performs error recovery.
  • Fig. 1 shows an implementation form of the system in the platform software and hardware of the embodiment of the present application.
  • the hardware part may include a central processing unit (CPU), a graphics processing unit (GPU), a memory, and the like.
  • the CPU includes lockstep CPU0, lockstep CPU1, normal CPU2, normal CPU3, etc., which are not specifically limited in the embodiment of the present application.
  • the lockstep CPU may also be referred to as a lockstep logical CPU, including at least two CPUs (also referred to as physical CPUs).
  • one of the CPUs may be referred to as a main CPU, and the other CPUs may be referred to as a secondary CPU or a redundant CPU.
  • the software part includes different business programs that run, and software modules that manage the hardware modules.
  • business programs such as automotive safety integration level (ASIL)-D business program #1, ASIL-D business program #2, ASIL-B business program, ordinary program, etc.
  • software modules that manage hardware modules such as error manager #1 that manages the lockstep CPU0, and error manager #2 that manages the lockstep CPU1, etc.
  • ASIL-D business program #1 runs on lockstep CPU0
  • ASIL-D business program #2 runs on lockstep CPU2
  • ASIL-B business programs and ordinary programs can run on CPU2 or CPU3.
  • applications with different security levels are isolated by means of containers or virtual machines to avoid failure of one partition and affecting the operation of programs in another partition.
  • Fig. 2 shows a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture of the embodiment of the present application includes a hardware architecture and a software architecture.
  • the hardware architecture is used to provide a hardware platform for error detection and repair
  • the software architecture is used to provide an error repair solution based on the hardware platform.
  • the hardware architecture can also be called the hardware layer, or the underlying hardware layer.
  • the hardware layer may include at least one lockstep (lockstep) CPU, and an interrupt controller.
  • the interrupt controller is used to perform interrupt control when an error occurs in part of the CPU in the lockstep CPU.
  • the hardware layer includes lockstep CPU0 and lockstep CPU1
  • lockstep CPU0 further includes a main CPU0 and at least one sub CPU0
  • lockstep CPU1 further includes a main CPU1 and at least one sub CPU1.
  • FIG. 2 only exemplarily shows one secondary CPU, but this does not limit the embodiment of the present application.
  • each lockstep CPU is provided with at least one comparator (or comparison circuit) for acquiring and comparing the outputs of at least two CPUs included in the lockstep CPU.
  • the output of each CPU included in the lockstep CPU can be obtained and compared by a comparator provided outside the lockstep CPU.
  • the comparison circuit can be implemented by a dedicated hardware circuit and is not set on the critical path, for example, it can be set outside the CPU, so that the comparison circuit has no effect on the performance of the CPU.
  • the foregoing comparison circuit is a CPU clock cycle (cycle) level comparison circuit.
  • the comparison circuit corresponding to the lockstep CPU shares the clock source with the lockstep CPU to ensure that the comparison circuit and the CPU are at the same frequency to achieve cycle-by-cycle data comparison, so that errors can be detected in time and as soon as possible Perform error recovery or other further processing.
  • the at least one comparator and the lockstep CPU may be provided on a chip to share the clock source with the lockstep CPU, but the embodiment of the present application is not limited to this.
  • the output of the CPU includes the internal bus output of each of the at least two CPUs, the external bus output of each CPU, and the layer 3 cache (L3 cache) corresponding to each CPU.
  • the internal bus output of the CPU is, for example, the L1 cache (L1 cache) of the CPU
  • the CPU's external bus output is, for example, the L2 cache (L2 cache) of the CPU.
  • L3_CTRL corresponding to the secondary CPU that is, redundant L3_CTRL
  • the L3 cache control logic of lockstep CPU0 includes L3_CTRL0, L3_RAM, and L3_CTRL0'
  • the L3 control logic of lockstep CPU1 includes L3_CTRL1, L3_RAM, and L3_CTRL1', for example. This is not limited.
  • the CPU internal output comparator 0 can be used to compare the internal bus output of the main CPU0 and at least one sub CPU0, and the CPU external output comparator 0 can be used to compare the main CPU0 and The external bus output of at least one sub CPU0 is compared, and the L3 cache control logic output comparator 0 can be used to compare the L3 control logic output (L3_CTRL0') of the main CPU0 (L3_CTRL0) and at least one sub CPU0.
  • the internal output comparator of the CPU can be arranged outside the CPU, and the internal bus output of the CPU can be obtained through the data line, which is not limited in the embodiment of the present application.
  • one lockstep CPU can be configured with one or two of the internal output comparator of the CPU, the external output comparator of the CPU, and the L3 control logic output comparator.
  • different lockstep CPUs can adopt different comparator setting methods. For example, lockstep CPU0 only sets the internal output comparator 0 of the CPU, and lockstep CPU1 only sets the external output comparator 1 of the CPU, and so on.
  • the output comparator outside the CPU can be set as the first-level comparison circuit, and the L3 control logic output comparator can be used as the second-level comparison circuit, instead of setting the output comparator in the CPU, that is, the data output by the CPU internal bus Comparison, this can reduce the level of comparison circuit.
  • the error inside the CPU can be found by the comparison circuit outside the CPU when it is transmitted to the outside of the CPU.
  • one lockstep CPU may include two physical CPUs, or include three physical CPUs.
  • the comparator finds that the outputs of at least two CPUs in lockstep mode are inconsistent, it can send a signal to the interrupt controller, which is used to instruct the interrupt controller Send interrupts to the at least two CPUs. After receiving the signal, the interrupt controller sends an interrupt to the lockstep CPU, and the interrupt is abnormal for at least two CPUs.
  • the at least two CPUs in the Lockstep CPU receive an interrupt, the at least two CPUs exit the lockstep mode, that is, enter the split mode. In split mode, the comparator does not work.
  • a possible implementation is that in split mode, the L3_CTRL corresponding to the main CPU in the lockstep CPU works, and the redundant L3_CTRL corresponding to the secondary CPU is in the gated_off state. At this time, all the requests of the CPU (including the main CPU and the sub CPU) in the lockstep CPU are sent to the L3_CTRL in the working state, and then converted by the L3_CTRL, and output to the L3_RAM.
  • the request sent by the CPU is, for example, a read-write request, a query request, or a replacement request, which is not limited in the embodiment of the present application.
  • the software architecture can also be referred to as the software layer.
  • the software layer mainly includes a lockstep (lockstep) manager and a reliability, availability and serviceability (RAS) error manager, and a health monitoring module.
  • the lockstep manager is used to manage at least two CPUs in the lockstep CPU.
  • the RAS error manager is used to determine the CPU where the error occurred and the type of error when part of the CPU in the lockstep CPU has an error.
  • the health monitoring module is responsible for decision-making and processing of error types.
  • the lockstep manager may include: lockstep configurator, split mode manager, CPU context manager, error query and repairer and reset-sync operator.
  • Lockstep configurator Set at least two physical CPUs in the computer system as a lockstep logical CPU, and set the number of lockstep logical CPUs in the system.
  • Split mode manager manage lockstep exception vector table and interrupt handling function.
  • the interrupt controller sends an interrupt to the at least two CPUs, and the at least two CPUs enter the split mode from the lockstep mode.
  • the at least two CPUs in the split mode jump to the entry of the exception vector table respectively, and call the CPU context manager and interrupt handling function.
  • each CPU can determine whether it has made an error. In other words, at this time, it can be determined which CPUs are the CPUs that have the error and which are the CPUs that are running correctly.
  • CPU context manager save the CPU context and data visible to the software in the L1/L2 cache of at least two CPUs when exiting the lockstep mode to different stacks in the L3 cache or memory to fix future errors prepare for.
  • the CPU context visible to the software includes the CPU state in the kernel mode and the user mode, that is, the data of the system register and the data of the general register corresponding to the CPU.
  • the interrupt processing function can call the error query and repairer.
  • the error query and repairer can query the RAS error manager corresponding to the CPU that has the error to determine the error type of the CPU that has the error.
  • the error query and repairer can query the RAS error manager corresponding to each CPU when the CPU where the error occurs is not determined when the CPU enters the split mode to determine the CPU where the error occurred and the type of error.
  • the types of errors include recoverable errors and unrecoverable errors.
  • the health monitoring module is notified to make decision processing on the error CPU, such as offline the error CPU.
  • the error type of the CPU is determined to be a recoverable error, the error query and repairer repairs the error CPU.
  • Reset-sync operator Re-enter at least two physical CPUs in split mode into lockstep mode.
  • the Reset-sync operator can be implemented by hardware or software, which is not limited in the embodiment of the present application.
  • the RAS error manager may include: advanced configuration and power management interface (advanced configuration and power management interface, ACPI) error checkers, and non-ACPI error parsers.
  • the RAS error manager includes one or more RAS nodes, and each RAS node corresponds to one or more status registers for storing various types of errors that occur in the CPU.
  • the ACPI error parser can perform error query according to the ACPI method. Specifically, the error parser can query the error status of the CPU through the ACPI table.
  • the CPU When a RAS error occurs in the CPU, the CPU generates an interrupt or a system exception and enters a unified extensible firmware interface (UEFI) or a basic input output system (BIOS). Then UEFI or BIOS traverses the status registers of each RAS node, and records the error corresponding to the CPU in the memory table (ie APCI table).
  • UEFI unified extensible firmware interface
  • BIOS basic input output system
  • the ACPI driver of the operating system parses the table to know which node in the system has which error type. mistake.
  • the non-ACPI error query device can perform error query in the non-ACPI way.
  • the memory management unit (MMU) in Figure 3 L1 data (L1data, L1D) cache, L1 instruction (L1 indicator, L1I) cache, L3 cache, and L2 cache each have one RAS node.
  • the ACPI method may be preferentially used to query errors. If no error is found in this way, you can use a non-ACPI way to check the error. This is because, for a producer error in the RAS node, the RAS register will record the error, but the system will not report the error. Only when the CPU consumes the wrong data will it report an exception on the consumer side. In this case, it is possible that the error is not recorded in the ACPI table. In this case, it is necessary to use a non-ACPI method to poll each RAS node status register to determine the error type.
  • the producer error refers to who made the error, and to whom the error is the producer error. After this type of error occurs, it will not be triggered immediately, and the error will only be reported when it is consumed. For example, the memory generates an error, but when the memory generates the error, it will not actively report the error. It will only be triggered when other components read the error.
  • one or more RAS nodes may be set for the comparator corresponding to the lockstep CPU, such as outputting comparator 0 in the CPU, output comparator 0 outside the CPU, and comparing the L3 control logic output Device 0 sets one RAS node respectively, which is not limited in this embodiment of the application.
  • the comparator determines that the output of the obtained CPU is inconsistent, it can report the RAS interrupt error, and at the same time provide the information of the inconsistent data in the register of the RAS node corresponding to the comparator, such as the wrong data address, the wrong module , At least one of error types, etc.
  • the error module includes, for example, L1 cache, L2 cache, L3 controller and so on.
  • the names of the various functions or modules in the embodiments of the present application are only taken as an example. In specific implementation, the names of the various functions or modules in the system architecture shown in FIG. 2 may also be other names. The implementation of this application The example does not specifically limit this.
  • FIG. 4 shows a schematic flowchart of an error recovery method provided by an embodiment of the present application.
  • the method shown in FIG. 4 may be executed by the system in FIG. 1 or by the system in FIG. 2, but the embodiment of the present application is not limited to this.
  • FIG. 4 shows the steps or operations of the service processing method, but these steps or operations are only examples, and the embodiment of the present application may also perform other operations or variations of each operation in FIG. 4.
  • the various steps in FIG. 4 may be performed in a different order from that presented in FIG. 4, and it is possible that not all operations in FIG. 4 are to be performed.
  • the lockstep manager is initialized.
  • initialization of the lockstep manager includes: initialization of resource configuration, initialization of an exception vector table, initialization of interrupt processing functions, etc., which are not limited in the embodiment of the present application.
  • the RAS error manager can also be initialized.
  • Figure 5 shows a specific example of lockstep manager initialization. As shown in Figure 5, in the pre-initialization phase of the lockstep manager, the configuration file can be read.
  • Resource configuration initialization will select two or more adjacent physical CPUs as a group of lockstep logical CPUs according to business requirements. For example, when a lockstep CPU is required to run a task that requires a high safety (safety) level, when the resource configuration is initialized, physical CPU0 and physical CPU1 can be configured as a set of lockstep logical CPUs to run the business program of the task.
  • safety safety
  • the initialization of the exception vector table is mainly to handle the initialization of the memory stack of the CPU context in the CPU context when the lockstep CPU exits to the split mode, error synchronization and data consistency management, and processing interrupts.
  • the software-visible CPU changes from one to multiple.
  • the initialization of the memory stack of the CPU context it can be ensured that the contexts of the multiple CPUs are stored in different stacks, so as to avoid data coverage.
  • the at least two CPUs jump to the entry of the abnormal vector table respectively to synchronize the error of the CPU to ensure that the asynchronous error in the system at that moment can be reported immediately, and prepare for the subsequent query of the error type.
  • the initialization of the interrupt processing function can realize the processing of interrupts, such as the interrupts that are generated when an error occurs in part of the CPU in the lockstep CPU.
  • the software layer calls the interrupt processing function through the entry of the exception vector table, and then calls the error query and repairer in the interrupt processing function to query the error, and perform corresponding repairs according to the error type.
  • the lockstep core management module is initialized.
  • the output of each of the at least two CPUs included in the lockstep CPU may be obtained through a comparison circuit provided outside the lockstep CPU, and then it is determined whether the outputs of the at least two CPUs are consistent.
  • the comparison circuit can be referred to the description in FIG. 2. For brevity, it will not be repeated here.
  • the comparison circuit sends a signal to the interrupt controller, and the interrupt controller sends an interrupt to the CPU according to the signal.
  • the at least two CPUs enter the split mode from the lockstep mode.
  • the at least two CPUs in the split mode jump to the entry of the interrupt exception vector table respectively to synchronize CPU errors. After that, execute 403 and 404 in the next step.
  • the at least two physical CPUs in the split mode release their corresponding CPU contexts. Because at least one of the CPU contexts of the at least two CPUs is wrong, it is necessary to flush the at least two CPU contexts and the data in the cache to different stack addresses in the memory.
  • FIG. 6 shows an example of CPU context saving and restoration.
  • the CPU0 and CPU1 in the lockstep CPU0' jump to the interrupt request (interrupt request, IRQ) entry respectively.
  • the context of CPU0 is saved to stack 0 (stack0) in the memory
  • the context of CPU1 is saved to stack 1 (stack 1) in the memory.
  • the error query it can be determined which of CPU0 and CPU1 is the correct CPU and which CPU is the error CPU.
  • the error when the error is a recoverable error, the error is repaired according to the result of the error query, for example, the state of the error CPU can be set according to the context of the correct CPU stored in the memory. For example, when an error occurs in CPU0 and CPU1 is running normally, restore the context saved in stack1 to CPU0, and perform error repair on CPU0. Then, the two CPUs can re-enter lockstep mode.
  • Error query and repairer can send query information to RAS error manager, and RAS error manager can perform error query.
  • RAS error manager performs error query according to ACPI and non-ACPI methods.
  • ACPI mode and the non-ACPI mode can be referred to the above description, for the sake of brevity, details are not repeated here.
  • the RAS node corresponding to the comparator can be queried to determine the CPU where the error occurred and the type of error, without polling other RAS nodes.
  • the lockstep error will be regarded as an ordinary RAS error, and the error can be queried by directly reading the register of the RAS node corresponding to the comparator provided by the hardware.
  • ACPI or non-ACPI can be used. Since the register includes at least one of the error data address, the error module, and the error type, the error type can be determined by reading the register of the RAS node corresponding to the comparator.
  • a lockstep error may refer to an error in which the outputs of at least two CPUs are inconsistent when the lockstep CPU is in the lockstep mode.
  • recoverable errors include uncontainable errors (UC) type errors, or non-UC type errors whose occurrence times do not exceed a preset threshold, or system hangs, etc., which are not limited in the embodiment of the application .
  • unrecoverable errors may include at least one of UC-type errors, non-UC-type errors whose occurrence times exceed a preset threshold, and unknown error types, which are not limited in this embodiment of the application.
  • the health monitoring module may be notified to perform system health monitoring, that is, execute 405.
  • the health monitoring module can be notified to perform system health monitoring, that is, execute 405.
  • the error recovery is performed through software, as shown in 406.
  • the CPU system hanging up if the error does not propagate, you can perform error recovery through the hardware channel, as shown in 407.
  • the RAS node corresponding to the comparator can be used to determine which CPU has an error , And what type of error occurred.
  • the lockstep CPU when the lockstep CPU includes three or more physical CPUs, when the comparator determines that the data output by the three or more physical CPUs are not the same, it can be judged based on multiple judgments.
  • One principle is to determine the CPU where the error occurred.
  • “one more judgment” means that when the output results of one of the at least three CPUs are inconsistent with other CPUs, it can be determined that an error has occurred in the one CPU.
  • a possible way is to offline the wrong CPU, and at least two other CPUs can enter lockstep mode to continue running.
  • the RAS node corresponding to the comparator can determine which CPU has the error and which type of error has occurred, and then determine whether to recover the CPU that has the error according to the type of the error.
  • the health monitoring module performs system health monitoring.
  • the health monitoring module can offline the wrong CPU, or control all CPUs in the lockstep CPU to stop running.
  • the health monitoring module may notify the system to exit the automatic driving module, and let the microcontroller unit (MCU) take over and perform sudden braking.
  • MCU microcontroller unit
  • the context of the correct CPU has been flushed from the L1/L2 cache to the memory at the entry of the exception vector table, the context of the correct CPU can be restored to the wrong CPU, and the wrong CPU can be restored. restore.
  • the wrong CPU can synchronize the wrong CPU according to the state of the correct CPU.
  • the correct CPU can synchronize its software-visible CPU context to the wrong CPU through the hardware channel with the wrong CPU.
  • Fig. 7 shows an example of hardware channel-based error repair in an embodiment of the present application.
  • 701A to 704A For the wrong CPU, 701A to 704A will be executed, and for the correct CPU, 701B to 704B will be executed.
  • the wrong CPU enters recovery mode after single-core recovery, and informs the correct CPU to enter recovery mode at the same time.
  • the wrong CPU may notify the correct CPU to enter the recovery mode through interruption or other methods, which is not limited in the embodiment of the present application.
  • the wrong CPU can obtain the software visible state of the correct CPU through the hardware channel, and recover according to the software visible state of the correct CPU.
  • the hardware channel may be a data channel between the correct CPU and the wrong CPU.
  • 703A after the error CPU state recovery is completed, it enters the reset-sync state at the same time as the normal CPU. See the description of 408 for 703A.
  • the correct CPU when the wrong CPU is reset, the correct CPU is in a spin wait state.
  • the spin wait state the correct CPU waits for the error CPU notification to enter the recovery mode.
  • the wrong CPU may notify the correct CPU to enter this mode through interruption or other methods, which is not limited in the embodiment of the present application.
  • 703B after the software visible state transfer is completed, it enters the reset-sync state at the same time as the error CPU. See the description of 408 for 703B.
  • reset-sync that is, resets the internal microarchitecture.
  • the error CPU resets all hardware states that are not visible to software, clears the data in the CPU cache, and retains the software-visible state in system registers and general registers. Based on this, reset-sync is different from the traditional CPU restart (reset), it is not a complete reset, so the required time is shorter, for example, it can be dozens of CPU clock cycles (cycles).
  • an initialization instruction can be executed to restore the CPU context visible to the software, so that the at least two CPUs re-enter the lockstep mode, wherein the initialization instruction includes the second The CPU context visible to the software at the time of the CPU interruption, and the initialization instruction is used to restore the CPU context visible to the software to the CPU context visible to the software at the time of the interruption of the second CPU, wherein the CPU context includes the system The value of the register and the value of the general register.
  • the initialization instruction can be executed by the initialization unit.
  • a possible implementation is that at least two CPUs participating in lockstep are reset to the place where the software has placed an initialization instruction in advance, where the initialization instruction includes the PC pointer and system register of the CPU of the correct CPU at the interrupt time mentioned above (The value or data of the system register). After the restart, the at least two CPUs simultaneously execute the initialization instructions.
  • the software visibility status of the above at least two physical CPU settings are exactly the same. After reset-sync, the software visibility status of the at least two physical CPUs is still the same, and both obtain data and data from external memory uniformly. Instructions, receive the same input instruction stream.
  • lockstep CPU continues to run from where it left off.
  • the microarchitecture state of all CPUs participating in lockstep is the initial state after reset, and the state visible to the software is the state before service interruption.
  • all CPUs participating in lockstep execute initialization instructions at the same time, so the lockstep CPU can continue to run from where the business program was interrupted.
  • the comparator corresponding to the lockstep CPU continues to perform cycle-by-cycle comparison of at least two physical CPUs in the lockstep CPU.
  • the at least two CPUs in the lockstep mode in the embodiment of the present application can exit the lockstep mode when at least one of the CPUs has an error, and determine the CPU that has the error and the CPU that is running correctly, based on which the error can be recovered Next, the CPU that has the error is restored according to the CPU that is running correctly, which in turn helps the at least two CPUs to re-run where the business program was interrupted. Therefore, the embodiment of the present application can improve the error recovery of the lockstep system. Ability to increase system reliability.
  • FIG. 8 shows a schematic flowchart of an error recovery method provided by an embodiment of the present application.
  • the method may be executed by the system shown in FIG. 1 or FIG. 2.
  • the method includes 810 to 830.
  • At least two CPUs in the lockstep mode receive an interrupt, where the interrupt is used to indicate that at least one of the at least two CPUs has an error.
  • the at least two CPUs exit the lockstep mode.
  • the at least two CPUs in the lockstep mode in the embodiment of the present application can exit the lockstep mode when at least one of the CPUs has an error, and determine the CPU that has the error and the type of the error. Based on this, the error can be recovered. According to the CPU that is running correctly, the CPU that has the error is restored, which in turn helps the at least two CPUs to re-run where the business program was interrupted. Therefore, the embodiment of the present application can improve the error recovery capability of the lockstep system , Increase system reliability.
  • both the number of the first CPU and the second CPU may be one or more.
  • the state of the CPU may include the software-visible state and/or the non-software-visible hardware state of the CPU.
  • the state visible to the software can also be called the CPU context, including the value (or data) of general-purpose registers and the value (or data) of system registers.
  • the non-software-visible hardware state which can also be referred to as the non-software-visible micro-architecture state, can be executed on the processor.
  • the at least two CPUs stop running.
  • the performing error recovery on the first CPU according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption includes:
  • the CPU context includes the value of system registers and the value of general-purpose registers.
  • the second CPU saves the CPU context visible to the software of the second CPU at the time of interruption and the data in the cache to the memory.
  • the first CPU may save the CPU context visible to the software of the first CPU at the time of interruption and the data in the cache in the memory, which is not limited in this embodiment of the application.
  • the performing error recovery on the first CPU according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption includes:
  • the first CPU obtains the software-visible CPU context of the second CPU at the time of interruption through the hardware channel with the second CPU, and updates the first CPU according to the software-visible CPU context of the second CPU.
  • a CPU context visible to the software of the CPU where the CPU context includes the values of system registers and the values of general registers.
  • the hardware channel-based method can be used to repair all levels of registers.
  • the first CPU and the second CPU after updating the software-visible CPU context of the first CPU, reset their respective non-software-visible microarchitecture states, and retain their respective The CPU context visible to the software causes the first CPU and the second CPU to re-enter the lockstep mode.
  • the error CPU resets all hardware states that are not visible to software, clears the data in the CPU cache, and retains the software-visible state in system registers and general registers.
  • the software visible state of the above at least two CPU settings are exactly the same.
  • the software visible state of the at least two CPUs is still the same, and both obtain data and instructions from the external memory uniformly. Receive the same input instruction stream.
  • the performing error recovery on the first CPU according to the state of the second CPU that is running correctly among the at least two CPUs at the time of interruption includes:
  • the first CPU and the second CPU are reset respectively, and an initialization instruction is executed to restore the CPU context visible by the software, so that the first CPU and the second CPU enter the lockstep mode again, wherein the initialization
  • the instruction includes the software-visible CPU context of the second CPU at the time of interruption, and the initialization instruction is used to restore the software-visible CPU context to the software-visible CPU context of the second CPU at the time of interruption.
  • the CPU context includes the value of system registers and the value of general-purpose registers.
  • the software visible state of the above at least two CPU settings are exactly the same.
  • the software visible state of the at least two CPUs is still the same, and both obtain data and instructions from the external memory uniformly. Receive the same input instruction stream.
  • the determining the first CPU in which an error occurs among the at least two CPUs and the type of the error include:
  • the first CPU determines the type of the error according to the advanced configuration corresponding to the first CPU and the power management interface ACPI table, where the ACPI table is used to record the reliability, availability, and serviceability of the polling CPU Error found in the status register of the RAS node.
  • the CPU when a RAS error occurs in the CPU, the CPU generates an interrupt or system exception and enters UEFI or BIOS.
  • UEFI or BIOS traverses each RAS node status register and records the error corresponding to the CPU in the memory table (ie, the APCI table). Therefore, the ACPI driver of the operating system can parse the table to know which node in the system has which type of error.
  • the first CPU polls the status register of the RAS node of the first CPU to determine the type of the error.
  • the RAS driver directly traverses the status register of each RAS node to determine the cause of the error, instead of querying the ACPI table.
  • the second CPU may also poll the status register of the RAS node of the second CPU to determine that the second CPU is running correctly.
  • the second CPU may also determine that the second CPU is operating correctly according to the ACPI table corresponding to the second CPU.
  • each CPU can determine whether it has made an error without querying the RAS node or ACPI table. In other words, at this time, you can directly determine which CPUs are the CPUs that have the error and which are the CPUs that are running correctly.
  • the at least two CPUs receiving interrupts include:
  • the at least two CPUs receive the interrupt sent by the interrupt controller, wherein the interrupt controller sends the interrupt to the at least two CPUs when the comparison circuit determines that the outputs of the at least two CPUs are inconsistent.
  • the output of the at least two CPUs includes the internal bus output of each of the at least two CPUs, the external bus output of each CPU, and the layer 3 of each CPU. At least one of the cache control logic outputs.
  • the determining the first CPU in which an error occurs among the at least two CPUs and the type of the error include:
  • the comparator determines that the output of the obtained CPU is inconsistent, it can report the RAS interrupt error, and at the same time provide the information of the inconsistent data in the register of the RAS node corresponding to the comparator, such as the wrong data address, the wrong module , At least one of error types, etc.
  • the error recovery method shown in FIG. 8 can implement each process of the error recovery method corresponding to the foregoing method embodiment.
  • the error recovery method of the embodiment of the present application is described in detail above with reference to FIGS. 1 to 8.
  • the error recovery apparatus of the embodiment of the present application is described in detail below with reference to FIG. 9. It should be understood that the error recovery apparatus of FIG. 9 can execute each step of the error recovery method of the embodiment of the present application. When the error recovery apparatus shown in FIG. 9 is described below, repeated descriptions are appropriately omitted.
  • FIG. 9 is a schematic block diagram of an error recovery apparatus 900 according to an embodiment of the present application.
  • the device 900 shown in FIG. 9 includes a lockstep CPU910, and the lockstep CPU910 includes a first CPU9110 and a second CPU9120.
  • the first CPU 9110 is configured to receive an interrupt, and the interrupt is triggered by an error of the first CPU 9110 when the first CPU 9110 and the second CPU 9120 are in lockstep mode;
  • exit lockstep mode In response to the interrupt, exit lockstep mode, and determine the type of error
  • the second CPU 9120 is used to receive the interrupt and exit the lockstep mode.
  • the first CPU 9110 is specifically configured to:
  • the second CPU 9120 is also used to save the CPU context visible to the software of the second CPU 9120 at the time of interruption and the data in the cache to the memory.
  • the first CPU 9110 is specifically configured to:
  • the first CPU 9110 is also used to reset the non-software-visible microarchitecture state of the first CPU 9110 after updating the CPU context visible to the software, and retain the software of the first CPU 9110
  • the visible CPU context makes the first CPU 9110 re-enter the lockstep mode
  • the second CPU 9120 is further configured to: after the first CPU 9110 updates the software-visible CPU context, reset the non-software-visible microarchitecture state of the second CPU 9120, and retain the software-visible state of the second CPU 9120
  • the CPU context makes the second CPU 9120 re-enter the lockstep mode.
  • the first CPU 9110 is specifically used for resetting, and after the resetting, an initialization instruction is executed to restore the CPU context visible by the software, so that the first CPU 9110 re-enters the lockstep mode, wherein the The initialization instruction includes the software-visible CPU context of the second CPU 9120 at the time of interruption, and the initialization instruction is used to restore the software-visible CPU context to the second CPU 9120's software-visible CPU context at the time of interruption.
  • the CPU context includes the values of system registers and general-purpose registers.
  • the second CPU 9120 is specifically configured to reset, and execute the initialization instruction after the reset, so that the second CPU 9120 enters the lockstep mode again.
  • the first CPU and the second CPU may simultaneously reset and execute the initialization instruction at the same time, so that the first CPU and the second CPU enter the lockstep mode again.
  • the first CPU 9110 is specifically configured to:
  • the first CPU 9110 is specifically configured to: receive the interrupt sent by an interrupt controller, wherein the interrupt controller determines the output of the first CPU 9110 and the second CPU 9120 in a comparison circuit Sending the interrupt to the first CPU 9110 and the second CPU 9120 when they are inconsistent;
  • the second CPU 9120 is specifically configured to: receive the interrupt sent by the interrupt controller.
  • the first CPU 9110 is also used to:
  • the first CPU 9110 is also used to stop operation
  • the second CPU 9120 is also used to stop operation.
  • the device 900 may further include the foregoing interrupt controller and the foregoing comparison circuit,
  • the comparison circuit is used to obtain the outputs of the first CPU 9110 and the second CPU 9120, and send a first signal to the interrupt controller when it is determined that the outputs of the first CPU 9110 and the second CPU 9120 are inconsistent,
  • the first signal is used to instruct the interrupt controller to send an interrupt to the first CPU 9110 and the second CPU 9120;
  • the interrupt controller sends the interrupt to the first CPU 9110 and the second CPU 9120 according to the first signal.
  • the system may further include a storage unit 920.
  • the storage unit 920 is used to store instructions.
  • the storage unit 920 may also be used to store data or information.
  • the storage unit 920 may be implemented by a memory.
  • the first CPU 9110 and the second CPU 9120 may be used to execute the instructions stored in the storage unit 920, so that the device 900 can implement the above-mentioned error recovery method.
  • the first CPU 9110, the second CPU 9120, and the storage unit 920 can communicate with each other through internal connection paths to transfer control and/or data signals.
  • the storage unit 920 is used to store a computer program, and the first CPU 9110 and the second CPU 9120 can be used to call and run the calculation program from the storage unit 920 to complete the above error recovery method.
  • the storage unit 920 can be integrated in the lockstep CPU910, or can be provided separately from the lockstep CPU910.
  • the memory may be one or more of the following types: flash memory, hard disk type memory, micro multimedia card type memory, card type memory (such as SD or XD memory), random access memory (random access memory) memory, RAM), static random access memory (static RAM, SRAM), read only memory (ROM), electrically erasable programmable read-only memory (EEPROM), Programmable ROM (PROM), magnetic memory, magnetic disk or optical disk.
  • flash memory hard disk type memory
  • micro multimedia card type memory such as SD or XD memory
  • card type memory such as SD or XD memory
  • random access memory random access memory
  • RAM static random access memory
  • static RAM static random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read-only memory
  • PROM Programmable ROM
  • magnetic memory magnetic disk or optical disk.
  • the aforementioned memory may store a computer program (the computer program is a program corresponding to the error recovery method of the embodiment of the present application), and when the processing unit executes the computer program, the processing
  • the memory also stores other data besides the computer program.
  • the memory can store data during the processing of the error recovery method of the present application.
  • the device 900 shown in FIG. 9 can implement each process of the error recovery method corresponding to the foregoing method embodiment. Specifically, the device 900 can refer to the above description, and to avoid repetition, details are not repeated here.
  • FIG. 10 shows a schematic block diagram of an error recovery apparatus 1000 according to an embodiment of the present application, including: a determination unit 1010 and a recovery unit 1020,
  • the determining unit 1010 is configured to determine the error of the first CPU type
  • the recovery unit 1020 is configured to perform error recovery on the first CPU based on the error type of the error being a recoverable error, and according to the state of the second CPU that is running correctly during the interruption of the at least two CPUs.
  • the restoration unit 1020 is specifically configured to:
  • a CPU context management unit is further included, which is configured to save the CPU context visible to the software when the second CPU is interrupted and the data in the cache to the memory.
  • an initialization unit configured to execute an initialization instruction to restore the CPU context visible to the software after the first CPU and the second CPU are reset, so that the first CPU and the second CPU
  • the second CPU re-enters the lockstep mode, wherein the initialization instruction includes the software-visible CPU context of the second CPU at the time of interruption, and the initialization instruction is used to restore the software-visible CPU context to the first 2.
  • the determining unit 1010 is specifically configured to:
  • the determining unit 1010 is specifically configured to:
  • the controller sends first information, and the first signal is used to instruct the interrupt controller to send an interrupt to the at least two CPUs to trigger the at least two CPUs to exit the lockstep mode.
  • the output of the at least two CPUs includes the internal bus output of each of the at least two CPUs, the external bus output of each CPU, and the layer 3 of each CPU. At least one of the cache control logic outputs.
  • the determining unit 1010 is further configured to control the at least two CPUs to stop running based on the type of the error being an unrecoverable error.
  • the error recovery apparatus 1000 shown in FIG. 10 can implement the corresponding process of the error recovery method corresponding to the foregoing method embodiment. Specifically, the error recovery apparatus 1000 can refer to the above description. In order to avoid repetition, it will not be repeated here. Repeat.
  • the foregoing error recovery device may be a terminal, or a device (for example, a chip, or a device that can be used with the terminal) for performing error recovery in the terminal.
  • the terminal may specifically be a smart phone, a vehicle-mounted device, or a wearable device.
  • the aforementioned vehicle-mounted device may be a computer system independent of the automobile but applicable to the automobile, or may be a computer system integrated into the automobile (for example, an autonomous vehicle).
  • the embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium stores program code, where the program code includes instructions for executing part or all of the operations in the method described in any of the foregoing embodiments.
  • the foregoing computer-readable storage medium is located in a terminal, and the terminal may be a device capable of error recovery.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on the error recovery device, causes the error recovery device to perform some or all of the operations in the method described in any of the foregoing embodiments.
  • An embodiment of the present application also provides a chip, the chip includes a processor, and the processor is configured to perform part or all of the operations in the method described in any of the foregoing embodiments.
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not be implemented in this application.
  • the implementation process of the example constitutes any limitation.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)
  • Software Systems (AREA)
  • Studio Devices (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请提供的一种错误恢复的方法、装置和系统。本申请实施例处于锁步模式的至少两个CPU能够在至少一个CPU发生错误时退出锁步模式,并确定出发生错误的CPU以及错误的类型,基于此能够在错误可恢复的情况下,根据正确运行的CPU对发生错误的CPU进行恢复,进而有助于该至少两个CPU在业务程序被中断的地方重新运行,因此本申请实施例能够提高锁步系统的错误恢复能力,增加系统可靠性。

Description

错误恢复的方法和装置
本申请要求于2019年5月31日提交中国专利局、申请号为201910473113.6、申请名称为“错误恢复的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,并且更具体的,涉及计算机领域中的错误恢复的方法和装置。
背景技术
自动驾驶等趋势使功能安全成为汽车行业的一项关键指标,越来越多的软硬件系统必须是安全的。这些安全系统必须可靠地运行以确保人身安全,即使在发生故障或事故的情况下也是如此。这就要求整体开发流程、硬件、软件、算法等多个层次进行安全冗余的考虑。当分区失效的时候,能够及时发现错误并进行恢复,而不影响其他的分区功能。
为了满足上述安全性的要求,锁步(lockstep)系统应运而生。锁步系统是采用锁步机制的容错计算机系统,通过并行同时运行同一组操作来实现安全冗余。在锁步系统中,两个独立的中央处理单元(Central Processing Unit,CPU)在相同的时钟周期内执行相同的指令。每个CPU自身加入了错误校验功能,例如纠错码(Error Correction Code,ECC)奇偶校验等,同时通过比较器对两个CPU的输出做比较。当比较结果有两位或者大于两位不一致时,且其中一个CPU校验出错,另一个CPU校验正常,这时候会禁止和解除Lockstep,从而让校验出错的CPU退出,校验正常的CPU正常工作。当比较结果仅有一位不一致且仅有一个CPU校验出错,则恢复到上一状态。当两个CPU校验都发生错误,或者两个CPU各自校验正常,但两个CPU的输出结果不一致,则两个CPU失步,系统停止。可以看出,现有的锁步系统中,比较结果仅有一位不一致且仅有一个CPU校验出错时,会将两个CPU修复到CPU当前运行状态的上一个保存状态重新运行,而如果发生了多位错误,不能修复错误,会退出锁步模式,业务被停止。因此,现有的锁步系统的错误恢复能力较弱,导致系统的可靠性难以满足安全业务的要求。
发明内容
本申请提供一种错误恢复的方法和装置,能够提高锁步(lockstep)系统的错误恢复能力,增加系统可靠性。
第一方面,提供了一种错误恢复的方法,包括:当处于锁步模式的至少两个中央处理单元CPU中的第一CPU发生错误时,接收中断;响应于该中断,所述至少两个CPU退出锁步模式;确定发生错误的所述第一CPU的错误类型;基于所述错误的类型为可恢复错误,根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复。因此,本申请实施例的方案,基于对锁步CPU错误类型的判断,在错误 类型为可恢复的情况下,能根据正确运行的CPU的状态对发生错误的CPU进行恢复,进而使得该至少两个CPU在业务程序被中断的地方重新运行,因此本申请实施例能够提高锁步(lockstep)系统的错误恢复能力,增加系统可靠性。
结合第一方面,在第一方面的某些实现方式中,第二CPU在中断时的状态包括第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值;所述根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU中的软件可见的CPU上下文。
结合第一方面,在第一方面的某些实现方式中,将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。将第一CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
结合第一方面,在第一方面的某些实现方式中,当lockstep CPU中的至少两个CPU退出lockstep模式并进入split模式时,软件可见的CPU从一个变为了多个。此时,一方面,通过CPU上下文的内存栈的初始化,能够确保该多个CPU的上下文保存到不同的栈里面,避免数据的覆盖。同时,将CPU L1/L2 cache中的数据刷新(flush)到外面的存储器上,确保重新进入lockstep模式的时候数据不会被丢失。另一方面,该至少两个CPU分别跳到异常向量表的入口,同步CPU的错误,确保该时刻系统中的该异步错误能够立刻报出来,为后续查询错误类型做好准备。
结合第一方面,在第一方面的某些实现方式中,所述根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:所述第一CPU通过与所述第二CPU之间的硬件通道获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
需要说明的是,在某些特殊情况下并不知道错误发生在哪个等级的寄存器,例如系统挂死的情况。这时,可以采用基于硬件通道方式修复所有等级的寄存器。
结合第一方面,在第一方面的某些实现方式中,在更新所述第一CPU的软件可见的CPU上下文之后,所述第一CPU和所述第二CPU分别重置各自的非软件可见的微架构状态,并保留各自的软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式。也就是说,错误CPU重置掉所有非软件可见的硬件状态,清空CPU cache中的数据,保留系统寄存器以及通用寄存器中的软件可见的状态。因此,在进行重置之前,上述至少两个CPU设置的软件可见状态完全一样,在进行重置之后,该至少两个CPU的软件可见状态仍然一样,并且都统一从外部存储器获取数据和指令,接收相同的输入指令流。
结合第一方面,在第一方面的某些实现方式中,所述根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:所述第一CPU和所述第二CPU分别重置,并执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文 恢复为所述第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
一些实现方式中,第一CPU和第二CPU可以在同时进行重置,并同时执行所述初始化指令,从而该第一CPU和第二CPU重新进入锁步模式。因此,在进行重置之前,上述至少两个CPU设置的软件可见状态完全一样,在进行重置之后,该至少两个CPU的软件可见状态仍然一样,并且都统一从外部存储器获取数据和指令,接收相同的输入指令流。
结合第一方面,在第一方面的某些实现方式中,所述确定所述至少两个CPU中发生错误的第一CPU,以及所述错误的类型,包括:所述第一CPU根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误。这样,当CPU发生RAS错误的时候,CPU产生中断或者系统异常而进入UEFI或者BIOS,UEFI或者BIOS遍历各个RAS节点状态寄存器,将该CPU对应的错误记录在内存的表格(即APCI表格)里面,因此操作系统的ACPI驱动解析该表格就可以知道系统中哪个节点发生了何种错误类型的错误。或者,所述第一CPU轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。这样,当CPU发生RAS错误的时候,CPU中断或者系统异常,此时RAS驱动直接依次遍历各个RAS节点的状态寄存器,从而确定错误的原因,而不通过查询ACPI表格的方式获取。
在一种可能的实现方式中,第二CPU还可以轮询所述第二CPU的RAS节点的状态寄存器,确定所述第二CPU正确运行。
在一种可能的实现方式中,第二CPU还可以根据所述第二CPU对应的ACPI表格,确定所述第二CPU正确运行。
在一种可能的实现方式中,当该至少两个CPU进入到split模式时,每个CPU可以确定自己是否出错,而并不需要查询RAS节点,或ACPI表格。也就是说,此时可以直接确定哪些CPU为发生错误的CPU,哪些CPU为正确运行的CPU。
结合第一方面,在第一方面的某些实现方式中,所述处于锁步模式的至少两个中央处理单元CPU接收中断,包括:所述至少两个CPU接收中断控制器发送的所述中断,其中,所述中断控制器在比较电路确定所述至少两个CPU的输出不一致时向所述至少两个CPU发送所述中断。
在一种可能的实现方式中,比较电路可以由专门的硬件电路实现,且不设置在关键路径上,比如可以设置于CPU的外面,这样比较电路对CPU的性能没有影响。
在一种可能的实现方式中,比较电路为CPU时钟周期(cycle)级别的比较电路。具体的,lockstep CPU对应的比较电路与该lockstep CPU共用时钟源(clock source),以保证比较电路和CPU在同一个频率上面,实现cycle-by-cycle的数据比较,从而能够及时发现错误,尽早进行错误恢复,或其他进一步的处理。
结合第一方面,在第一方面的某些实现方式中,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
结合第一方面,在第一方面的某些实现方式中,所述确定所述至少两个CPU中发生错误的第一CPU,以及所述错误的类型,包括:
查询所述比较电路对应的RAS节点的状态寄存器,确定所述至少两个CPU发生的错误的所述第一CPU,以及所述错误的类型。
这种情况下,当比较器确定获取的CPU的输出不一致时,可以上报RAS中断错误,同时在该比较器对应的RAS节点的寄存器中提供对比不一致的数据的信息,比如错误数据地址,错误模块,错误类型等中的至少一种。
结合第一方面,在第一方面的某些实现方式中,还包括:基于所述错误的类型为不可恢复错误,所述至少两个CPU停止运行。
第二方面,提供了一种错误恢复的装置,包括:第一中央处理单元CPU和第二CPU;
所述第一CPU用于,接收中断,所述中断是在所述第一CPU和所述第二CPU处于锁步模式时所述第一CPU发生错误触发的;响应于所述中断,退出锁步模式,并确定所述错误的类型;基于所述错误的类型为可恢复错误,根据所述第二CPU在中断时的状态,进行错误恢复;所述第二CPU用于接收所述中断,退出锁步模式。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU具体用于:从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
结合第二方面,在第二方面的某些实现方式中,所述第二CPU还用于将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU具体用于:通过与所述第二CPU之间的硬件通道获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU还用于:在更新软件可见的CPU上下文之后,重置所述第一CPU的非软件可见的微架构状态,并保留所述第一CPU的软件可见的CPU上下文,使得所述第一CPU重新进入锁步模式;所述第二CPU还用于:在所述第一CPU更新软件可见的CPU上下文之后,重置所述第二CPU的非软件可见的微架构状态,并保留所述第二CPU的软件可见的CPU上下文,使得所述第二CPU重新进入锁步模式。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU具体用于重置,并在重置之后执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
所述第二CPU具体用于重置,并在重置之后执行所述初始化指令,使得所述第二CPU重新进入锁步模式。
一些实现方式中,第一CPU和第二CPU可以在同时进行重置,并同时执行所述初始化指令,从而该第一CPU和第二CPU重新进入锁步模式。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU具体用于:根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述 ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU具体用于:接收中断控制器发送的所述中断,其中,所述中断控制器在比较电路确定所述第一CPU和所述第二CPU的输出不一致时向所述第一CPU和所述第二CPU发送所述中断;所述第二CPU具体用于:接收所述中断控制器发送的所述中断。
结合第二方面,在第二方面的某些实现方式中,所述CPU的输出包括所述CPU的每内部总线输出、对外部总线输出和层3缓存控制逻辑输出中的至少一种。
结合第二方面,在第二方面的某些实现方式中,所述第一CPU还用于:查询所述比较电路对应的RAS节点的状态寄存器,确定所述发生错误的所述第一CPU,以及所述错误的类型。
结合第二方面,在第二方面的某些实现方式中,基于所述错误的类型为不可恢复错误,所述第一CPU还用于停止运行,所述第二CPU还用于停止运行。
结合第二方面,在第二方面的某些实现方式中,还包括中断控制器和比较电路,所述比较电路用于获取所述第一CPU和所述第二CPU的输出,并在确定所述第一CPU和所述第二CPU的输出不一致时向所述中断控制器发送第一信号,所述第一信号用于指示所述中断控制器向所述第一CPU和所述第二CPU发送中断;所述中断控制器根据所述第一信号,向第一CPU和所述第二CPU发送所述中断。
第三方面,提供了一种错误恢复的装置,其特征在于,包括:确定单元和恢复单元,在处于锁步模式的至少两个中央处理单元CPU中第一CPU发生错误,所述至少两个CPU退出锁步模式的情况下,所述确定单元,用于确定所述第一CPU的错误的类型;所述恢复单元,用于基于所述错误的类型为可恢复错误,根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复。
结合第三方面,在第三方面的某些实现方式中,所述恢复单元具体用于:从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
结合第三方面,在第三方面的某些实现方式中,还包括CPU上下文管理单元,用于将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
结合第三方面,在第三方面的某些实现方式中,还包括初始化单元,用于在所述第一CPU和所述第二CPU重置之后,执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
结合第三方面,在第三方面的某些实现方式中,所述确定单元具体用于:根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。
结合第三方面,在第三方面的某些实现方式中,所述确定单元具体用于:查询比较电路对应的RAS节点的状态寄存器,确定所述至少两个CPU发生的错误的所述第一CPU,以及所述错误的类型,其中,所述比较电路用于在确定所述至少两个CPU的输出不一致时向中断控制器发送第一信息,所述第一信号用于指示所述中断控制器向所述至少两个CPU发送中断以触发所述至少两个CPU退出锁步模式。
结合第三方面,在第三方面的某些实现方式中,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
结合第三方面,在第三方面的某些实现方式中,所述确定单元还用于基于所述错误的类型为不可恢复错误,控制所述至少两个CPU停止运行。
第四方面,提供了一种查询错误的比较电路,所述比较电路设置于处于锁步模式的至少两个CPU之外,所述比较电路用于:确定所述至少两个CPU的输出不一致;基于所述至少两个CPU的输出不一致,向中断控制器发送第一信号,其中,所述第一信号用于指示所述中断控制器向所述至少两个CPU发送中断,所述中断用于指示所述至少两个CPU中至少一个CPU发生错误。
结合第四方面,在第四方面的某些实现方式中,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
第五方面,提供了一种错误恢复的装置,该装置包括用于执行第一方面中所描述的方法/操作/步骤/动作所对应的模块。
第六方面,提供了一种错误恢复的装置,该装置包括处理器,该处理器用于调用存储器存储的程序代码以执行上述第一方面中的任意一种方式中的部分或全部操作。
上述第六方面中,存储程序代码的存储器既可以位于错误恢复的装置内部(错误恢复的装置除了包括处理器之外,还可以包括存储器),也可以位于错误恢复的装置外部(可以是其他设备的存储器)。作为示例,该处理器可以为锁步CPU,锁步CPU中包括至少两个物理CPU。
可选地,上述存储器为非易失性存储器。
当错误恢复的装置包括处理器和存储器时,该处理器和存储器可以耦合在一起。
作为示例,上述错误恢复装置可以为终端,也可以是终端中的用于执行错误恢复的装置(例如,芯片,或者是能够和终端匹配使用的装置)。该终端具体可以为智能手机、车载装置或穿戴式设备等。可选的,前述车载装置可以是独立于汽车但可以应用于汽车的计算机系统,也可以是集成到汽车(例如自动驾驶汽车)内部的计算机系统。
第七方面,提供了一种计算机可读存储介质,计算机可读存储介质存储了程序代码,其中,程序代码包括用于执行上述第一方面所描述的方法中的部分或全部操作的指令。
可选地,上述计算机可读存储介质位于终端内,该终端可以是能够进行错误恢复的装置。
第八方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在错误恢复的装置上运行时,使得错误恢复的装置执行上述第一方面所描述的方法中的部分或全部操作。
第九方面,提供了一种芯片,所述芯片包括处理器,所述处理器用于执行上述第一方面所描述的方法中的部分或全部操作。
附图说明
图1示出了本申请实施例的系统的一个实现形态。
图2示出了本申请实施例提供的系统架构的示意图。
图3示出了的查询方式的示例。
图4示出了本申请实施例提供的一种错误恢复的方法的示意性流程图。
图5示出了lockstep管理器初始化的一个具体例子。
图6示出了CPU上下文保存和恢复的一个示例。
图7示出了本申请实施例提供的一种基于硬件通道的错误修复的示例。
图8示出了本申请实施例提供的一种错误恢复的方法的示意性流程图。
图9示出了本申请实施例提供的一种错误恢复的装置的示意性流程图。
图10示出了本申请实施例提供的一种错误恢复的装置的示意性流程图。
具体实施方式
首先,对本申请实施例涉及的相关术语进行描述。
锁步CPU(lockstep CPU):为一个逻辑CPU,其中包含至少两个物理CPU(也可以称为CPU),或者包含至少两个物理核(core)。作为示例,该至少两个CPU可以设置在一个芯片上,或者分布在不同的芯片上,本申请实施例对此不作限定。在一些描述中,Lockstep CPU还可以称为lockstep逻辑CPU。为了方便,下面以一个逻辑CPU中包括至少两个CPU为例进行描述。
当lockstep CPU中的该至少两个CPU处于锁步模式时,该至少两个CPU执行相同的代码或指令,并只输出一个CPU的计算结果。这时,软件可见的只有一个CPU,但是其内部包含至少两个(比如多个)CPU。
分离CPU(split CPU):lockstep CPU中的至少两个CPU从lockstep模式退出到正常的独立分开运行的CPU时,可以称从lockstep模式退出来的该至少两个物理CPU处于split模式。这时,软件可见该至少两个CPU。
可以理解的是,处于锁步(lockstep)模式的至少两个CPU应当具有相同的输出结果。一旦该至少两个CPU的输出结果不一致,则必然存在至少一个CPU运行出错(即发生错误)。当存在一个CPU出错时,则该lockstep CPU异常,需要该lockstep CPU中的CPU从lockstep模式退出,进入split模式。
CPU异常跳转:当CPU正在运行的时候,如果发生错误或者需要响应中断,都会跳入异常向量表或中断向量表的入口,然后会有函数对错误或者中断进行处理。处理完成后,该CPU可以返回原来被打断的地方继续执行。作为示例,当lockstep CPU异常时,该lockstep CPU中的CPU异常跳转,进入split模式,并进行错误恢复。
下面将结合附图,对本申请中的技术方案进行描述。
图1示出了本申请实施例的系统在平台软件和硬件中的一个实现形态。如图1所示,硬件部分可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics  processing unit,GPU)、存储器等。其中,CPU包括锁步(lockstep)CPU0,锁步CPU1,正常CPU2,正常CPU3等,本申请实施例对此不作具体限定。其中,lockstep CPU也可以称为lockstep逻辑CPU,包括至少两个CPU(也可以称为物理CPU),作为示例,其中一个CPU可以称为主CPU,其他CPU可以称为副CPU或冗余CPU。软件部分包括运行的不同的业务程序,以及对硬件模块进行管理的软件模块。作为示例,业务程序比如汽车安全完整性等级(automotive safety integration level,ASIL)-D业务程序#1,ASIL-D业务程序#2、ASIL-B业务程序,普通程序等。作为示例,对硬件模块进行管理的软件模块,比如对锁步CPU0进行管理的错误管理器#1,以及对锁步CPU1进行管理的错误管理器#2等。
可以理解的是,由于lockstep CPU能够满足安全性的要求,因此对于安全级别要求较高的业务程序可以运行在锁步(lockstep)CPU上,安全级别要求较低的业务程序可以运行在正常CPU上面。比如ASIL-D业务程序#1运行在锁步CPU0上面,ASIL-D业务程序#2运行在锁步CPU2上面,ASIL-B业务程序以及普通程序可以运行在CPU2,或CPU3上面。其中,不同安全级别的应用采用容器或虚拟机的方式进行隔离,以避免一个分区失效而影响另一个分区中程序的运行。
图2示出了本申请实施例提供的一种系统架构的示意图。其中,本申请实施例的系统架构包括硬件架构和软件架构,硬件架构用于提供错误检测和修复的硬件平台,软件架构用于基于该硬件平台提供错误修复的方案。
硬件架构也可以称为硬件层,或底层硬件层。硬件层可以包括至少一个锁步(lockstep)CPU,以及中断控制器。其中,中断控制器用于在lockstep CPU中部分CPU发生错误时,进行中断控制。
如图2所示,硬件层包括lockstep CPU0和lockstep CPU1,lockstep CPU0进一步包括主CPU0以及至少一个副CPU0,lockstep CPU1进一步包括主CPU1以及至少一个副CPU1。图2中仅示例性的示出了一个副CPU,但这并不对本申请实施例构成限定。
可选的,本申请实施例中,每个lockstep CPU设置有至少一个比较器(或称比较电路),用于获取并比较该锁步CPU中包括的至少两个CPU的输出。作为一个示例,可以通过设置在lockstep CPU之外的比较器,来获取并比较该lockstep CPU中包括的每个CPU的输出。
具体的,比较电路可以由专门的硬件电路实现,且不设置在关键路径上,比如可以设置于CPU的外面,这样比较电路对CPU的性能没有影响。
可选的,上述比较电路为CPU时钟周期(cycle)级别的比较电路。具体的,lockstep CPU对应的比较电路与该lockstep CPU共用时钟源(clock source),以保证比较电路和CPU在同一个频率上面,实现cycle-by-cycle的数据比较,从而能够及时发现错误,尽早进行错误恢复,或其他进一步的处理。作为一个示例,该至少一个比较器可以与lockstep CPU设置于一个芯片上以与该lockstep CPU共用时钟源,但是本申请实施例不限于此。
可选的,本申请实施例中,CPU的输出包括该至少两个CPU中的每个CPU的内部总线输出、该每个CPU对外部总线输出和每个CPU对应的层3缓存(L3 cache)控制逻辑输出(L3_CTRL)中的至少一种。作为示例,CPU的内部总线输出例如为CPU的L1缓存(L1 cache),该CPU对外部总线输出例如为CPU的L2缓存(L2 cache)。
本申请实施例中,可以增加副CPU对应的L3_CTRL,即冗余L3_CTRL。作为一个示例,如图2所示,lockstep CPU0的L3缓存(L3 cache)控制逻辑例如包括L3_CTRL0、L3_RAM、L3_CTRL0’,lockstep CPU1的L3控制逻辑例如包括L3_CTRL1、L3_RAM、L3_CTRL1’,本申请实施例对此不作限定。
作为示例,如图2所示,以lockstep CPU0为例,CPU内输出比较器0可用于对主CPU0和至少一个副CPU0的内部总线输出进行比较,CPU外输出比较器0可用于对主CPU0和至少一个副CPU0的外部总线输出进行比较,L3缓存控制逻辑输出比较器0可用于对主CPU0(L3_CTRL0)和至少一个副CPU0的L3控制逻辑输出(L3_CTRL0’)进行比较。
需要说明的是,CPU内输出比较器可以设置于CPU的外面,并通过数据线获取CPU内部总线输出,本申请实施例对此不作限定。
需要说明的是,图2中的硬件层仅作为一个示例,并不对本申请构成任何限定。
例如,本申请实施例中,一个lockstep CPU可以设置CPU内输出比较器、CPU外输出比较器和L3控制逻辑输出比较器中的一个或两个。又例如,不同的lockstep CPU可以采取不同的比较器设置方式,比如lockstep CPU0仅设置CPU内输出比较器0,而lockstep CPU1仅设置CPU外输出比较器1,等等。
一个具体的例子,可以设置CPU外输出比较器作为第一级比较电路,L3控制逻辑输出比较器作为第二级比较电路,而不设置CPU内输出比较器,即不对CPU内部总线输出的数据做比较,这样能够减少一级比较电路。在这种情况下,CPU内部的错误在传递到CPU外面时,能够被CPU外面的比较电路发现。
又例如,本申请实施例中,一个lockstep CPU中可以包括两个物理CPU,或者包括三个物理CPU。
一种可能的实现方式,当比较器(比如上述任意一个比较器)发现处于lockstep模式的至少两个CPU的输出不一致的情况下,可以向中断控制器发送信号,该信号用于指示中断控制器向该至少两个CPU发送中断。中断控制器在收到该信号后,向lockstep CPU发送中断,该中断该至少两个CPU异常。Lockstep CPU中的该至少两个CPU接收到中断时,该至少两个CPU退出lockstep模式,即进入split模式。在split模式下,比较器不工作。
一种可能的实现方式,在split模式下,lockstep CPU中的主CPU对应的L3_CTRL工作,副CPU对应的冗余L3_CTRL处于gated_off状态。此时该lockstep CPU中的所有CPU(包括主CPU和副CPU)的请求,都发给该处于工作状态下的L3_CTRL然后由L3_CTRL经过转换,输出给L3_RAM。作为示例,CPU发出的请求例如读写请求、查询请求,或替换请求等,本申请实施例对此不作限定。
软件架构也可以称为软件层。如图2所示,软件层主要包括锁步(lockstep)管理器和可靠性、可用性和可服务性(reliability,availability and serviceability,RAS)错误管理器,以及健康监控模块。其中,lockstep管理器用于对lockstep CPU中的至少两个CPU进行管理。RAS错误管理器用于在lockstep CPU中的部分CPU发生错误时,确定发生错误的CPU,以及错误的类型。健康监控模块,负责对错误类型进行决策处理。
作为示例,lockstep管理器可以包括:lockstep配置器,分离(split)模式管理器,CPU上下文管理器,错误查询及修复器和重置同步(reset-sync)操作器。
lockstep配置器:设置计算机系统中的至少两个物理CPU为一个lockstep逻辑CPU,以及设置系统中的lockstep逻辑CPU的数量。
分离(split)模式管理器:管理lockstep异常向量表和中断处理函数。当比较器发现lockstep CPU中的至少两个CPU输出的数据不一致的时候,中断控制器给该至少两个CPU发送中断,该至少两个CPU由lockstep模式进入到分离(split)模式。此时,处于split模式的该至少两个CPU分别跳到异常向量表的入口,调用CPU上下文管理器和中断处理函数。
一种可能的实现方式,当该至少两个CPU进入到split模式时,每个CPU可以确定自己是否出错。也就是说,此时可以确定哪些CPU为发生错误的CPU,哪些CPU为正确运行的CPU。
CPU上下文管理器:将至少两个CPU在退出lockstep模式时的L1/L2 cache中软件可见的CPU上下文以及数据分别保存到L3 cache或内存中的不同的栈(stack)中,为后面的错误修复做准备。这里,软件可见的CPU上下文包括内核态和用户态的CPU状态,即该CPU对应的系统寄存器的数据和通用寄存器的数据。
错误查询及修复器:中断处理函数可以调用该错误查询及修复器。作为一个示例,当CPU在进入split模式时,可以确定发生错误的CPU时,错误查询及修复器可以查询该发生错误CPU对应的RAS错误管理器,以确定该发生错误CPU的错误类型。作为另一个示例,当CPU在进入split模式时,没有确定出发生错误的CPU时,错误查询及修复器可以查询每个CPU对应的RAS错误管理器,以确定发生错误的CPU以及错误类型。
本申请实施例中,错误的类型包括可恢复错误和不可恢复错误。当确定CPU的错误类型为不可恢复的错误时,通知健康监控模块对该错误CPU进行决策处理,例如下线该错误CPU。当确定CPU的错误类型为可恢复的错误时,该错误查询及修复器对错误CPU进行修复。
Reset-sync操作器:使处于split模式的至少两个物理CPU重新进入lockstep模式。其中,Reset-sync操作器可以通过硬件方式实现,也可以通过软件方式实现,本申请实施例对此不作限定。
RAS错误管理器可以包括:高级配置和电源管理接口(advanced configuration and power management interface,ACPI)方式的错误查询器,以及非ACPI方式的错误解析器。
作为示例,RAS错误管理器包括一个或多个RAS节点,每个RAS节点对应一个或多个状态寄存器,用于存储CPU中发生的各种类型的错误。
ACPI方式的错误解析器能够按照ACPI的方式进行错误查询。具体而言,该错误解析器可以通过ACPI表格来查询CPU的错误状态。当CPU发生RAS错误的时候,CPU产生中断或者系统异常而进入统一的可扩展固件接口(unified extensible firmware interface,UEFI)或者基本输入输出系统(basic input output system,BIOS)。然后UEFI或者BIOS遍历各个RAS节点状态寄存器,将该CPU对应的错误记录在内存的表格(即APCI表格)里面,操作系统的ACPI驱动解析该表格就可以知道系统中哪个节点发生了何种错误类型的错误。
非ACPI方式的错误查询器能够按照非ACPI的方式进行错误查询。作为示例,图3中内存管理单元(memory management unit,MMU)、L1数据(L1data,简称L1D)缓 存、L1指令(L1 indicator,简称L1I)缓存、L3缓存、L2缓存分别有一个RAS节点。当CPU发生RAS错误的时候,CPU中断或者系统异常,此时RAS驱动直接依次遍历各个RAS节点的状态寄存器,从而确定错误的原因,而不通过查询ACPI表格的方式获取。
需要说明的是,本申请实施例中,可以优先采用ACPI方式查询错误。如果采用该方式没有查询到错误,则可以采用非ACPI的方式进行错误查询。这是因为,对于RAS节点中的生产者错误,RAS寄存器会记录该错误,但是系统不会上报该错误。只有CPU消费到错误数据的时候才会在消费者侧上报异常。这种情况下,有可能ACPI表格中就没有记录该错误,此时需要采用非ACPI的方式来轮询各个RAS节点状态寄存器来确定错误类型。
需要说明的是,生产者错误,是指谁产生了错误,该错误对谁而言就是生产者错误。这类错误产生了之后,不会立刻触发,只有消费的时候才会报错误。例如内存产生了一个错误,但当内存产生该错误的时候,并不会主动上报错误,只有别的部件读取这个错误的时候,才会触发。
可选的,本申请实施例中,还可以为lockstep CPU对应的比较器设置一个或多个RAS节点,比如将CPU内输出比较器0、CPU外输出比较器0,以及将L3控制逻辑输出比较器0分别设置一个RAS节点,本申请实施例对此不作限定。这种情况下,当比较器确定获取的CPU的输出不一致时,可以上报RAS中断错误,同时在该比较器对应的RAS节点的寄存器中提供对比不一致的数据的信息,比如错误数据地址,错误模块,错误类型等中的至少一种。错误模块例如包括L1 cache,L2 cache,L3控制器等。
另外,本申请实施例中上述各个功能或模块的名称仅仅作为一个示例,在具体实现中,图2中所示的该系统架构的中各个功能或模块的名称还可能为其他名称,本申请实施例对此不作具体限定。
图4示出了本申请实施例提供的一种错误恢复的方法的示意性流程图。图4中所示的方法可以由图1中系统执行,也可以由图2中的系统执行,但本申请实施例并不限于此。应理解,图4示出了业务处理的方法的步骤或操作,但这些步骤或操作仅是示例,本申请实施例还可以执行其他操作或者图4中的各个操作的变形。此外,图4中的各个步骤可以按照与图4呈现的不同的顺序来执行,并且有可能并非要执行图4中的全部操作。
401,锁步管理器初始化。
作为示例,锁步管理器初始化包括:资源配置的初始化、异常向量表的初始化、中断处理函数的初始化等,本申请实施例对此不作限定。可选的,还可以进行RAS错误管理器的初始化。
图5示出了lockstep管理器初始化的一个具体例子。如图5所示,在lockstep管理器初始化前阶段,可以读取配置文件。
然后,进行资源配置初始化、异常向量表的初始化和中断处理函数的初始化。
资源配置初始化会根据业务需求选择出临近的两个或两个以上的物理CPU作为一组lockstep逻辑CPU。例如,当需要一个lockstep CPU来运行高安全(safety)级别要求的任务时,在资源配置初始化时,可以配置物理CPU0和物理CPU1为一组lockstep逻辑CPU,来运行该任务的业务程序。
异常向量表的初始化主要是处理lockstep的CPU在退出到split模式下的CPU上下文的内存栈的初始化,错误同步及数据一致性管理,以及处理中断。当lockstep CPU中的至 少两个CPU退出lockstep模式并进入split模式时,软件可见的CPU从一个变为了多个。此时,一方面,通过CPU上下文的内存栈的初始化,能够确保该多个CPU的上下文保存到不同的栈里面,避免数据的覆盖。另一方面,该至少两个CPU分别跳到异常向量表的入口,同步CPU的错误,确保该时刻系统中的该异步错误能够立刻报出来,为后续查询错误类型做好准备。同时,将CPU L1/L2 cache中的数据刷新(flush)到外面的存储器上,确保重新进入lockstep模式的时候数据不会被丢失。
中断处理函数的初始化,能够实现处理中断,比如当发现lockstep CPU中的部分CPU发生错误时会产生的中断。作为示例,软件层通过异常向量表的入口,调用中断处理函数,然后中断处理函数内调用错误查询及修复器对错误进行查询,并根据错误类型进行对应的修复。
在资源配置初始化、中断异常向量表的初始化以及中断处理函数的初始化完成之后,进入锁步核心管理模块初始化后阶段。
然后,锁步管理器初始化结束。
402,确定处于锁步(lockstep)模式的至少两个CPU的输出不一致。
一种实现方式,可以通过设置在lockstep CPU之外的比较电路获取该lockstep CPU中包括的至少两个CPU中的每个CPU中的输出,然后判断该至少两个CPU的输出是否一致。具体的,比较电路可以参见图2中的描述,为了简洁,这里不再赘述。
当确定处于lockstep模式的至少两个CPU的输出不一致时,比较电路向中断控制器发送信号,中断控制器根据该信号向CPU发送中断,此时该至少两个CPU由lockstep模式进入split模式。处于split模式的该至少两个CPU分别跳到中断异常向量表的入口,同步CPU的错误。之后,下一步执行403和404。
403,CPU上下文保存管理。
作为示例,处于split模式的该至少两个物理CPU释放其对应的CPU上下文。因为该至少两个CPU的CPU上下文中至少有一个是错误的,因此需要将该至少两个CPU上下文,以及cache中的数据刷新到内存中的不同的栈地址中。
作为示例,图6示出了CPU上下文保存和恢复的一个示例。如图6所示,当锁步(lockstep)CPU0’进入split模式之后,该lockstep CPU0’中的CPU0和CPU1分别跳到中断请求(interrupt request,IRQ)入口。然后,CPU0的上下文保存到内存中的栈0(stack0)中,CPU1中的上下文保存到内存中的栈1(stack 1)中。当进行错误查询之后,能够确定CPU0和CPU1中哪个CPU为正确CPU,哪个CPU为错误CPU。然后,在错误为可恢复错误时,根据错误查询的结果进行错误修复,例如可以根据内存中保存的正确CPU的上下文来设置错误CPU的状态。例如当CPU0发生错误,CPU1正常运行时,将stack1中保存的上下文恢复到CPU0里面,对CPU0进行错误修复。然后,两个CPU可以重新进入lockstep模式。
404,错误查询。
具体的,404可以由错误查询及修复器执行。错误查询及修复器可以向RAS错误管理器发送查询信息,RAS错误管理器可以进行错误查询。作为示例,RAS错误管理器按照ACPI方式和非ACPI方式进行错误查询。具体的,ACPI方式和非ACPI方式可以参见上文中的描述,为了简洁,这里不再赘述。
可选的,本申请实施例中,可以查询比较器对应的RAS节点,以确定发生错误的CPU以及错误的类型,而不需要轮询其他的RAS节点。此时,会把lockstep错误当成一个普通的RAS错误,可以通过直接读取硬件提供的比较器对应的RAS节点的寄存器来查询错误。轮询比较器的RAS错误节点的时候,可以采用ACPI方式或者非ACPI方式。由于该寄存器中包括错误数据地址,错误模块,错误类型等中的至少一种,因此通过读取比较器对应的RAS节点的寄存器,能够确定错误类型。作为示例,lockstep错误可以指的是lockstep CPU在lockstep模式下发生的至少两个CPU的输出不一致的错误。
作为示例,可恢复错误包括非已传播错误(uncontainable error,UC)类型的错误,或者发生次数未超过预设阈值的非UC类型的错误,或者系统挂死等,本申请实施例对此不作限定。作为示例,不可恢复错误可以包括UC类型的错误,发生次数超过预设阈值的非UC类型的错误,以及未知错误类型中的至少一种,本申请实施例对此不作限定。
一些可能的实现方式中,对于已传播错误类型,或者以及未知错误类型,可以通知健康监控模块进行系统健康监控,即执行405。当非UC类型的错误的发生次数超过预设阈值时,可以通知健康监控模块进行系统健康监控,即执行405。对于非UC类型错误,在错误发生次数没有超过预设阈值的情况下,通过软件进行错误恢复,如406所示。对于CPU系统挂死的情况,如果错误没有传播,可以通过硬件通道进行错误恢复,如407所示。
一些可选的实施例,对于lockstep CPU中包括的两个CPU的情况,当比较器确定该两个物理CPU输出的数据不一样的时候,可以通过比较器对应的RAS节点确定哪个CPU发生了错误,以及发生了哪种类型的错误。
一些可选的实施例,对于lockstep CPU中包括三个或三个以上的物理CPU的情况,当比较器确定该三个或三个以上的物理CPU输出的数据不一样的时候,可以根据多判一的原则,确定发生错误的CPU。这里,多判一指的是当至少三个CPU中的一个CPU与其他CPU的输出结果不一致时,可以确定该一个CPU发生错误。此时,一种可能的方式,可以下线该错误的CPU,其他至少两个CPU可以进入lockstep模式继续运行。或者,另一种可能的方式,可以通过比较器对应的RAS节点确定哪个CPU发生了错误,以及发生了哪种类型的错误,然后根据错误的类型,确定是否对发生错误的CPU进行恢复。
405,健康监控模块进行系统健康监控。
具体的,健康监控模块可以下线错误CPU,或者控制lockstep CPU中的所有CPU停止运行。作为示例,对于自动驾驶场景,健康监控模块可以通知系统退出自动驾驶模块,让微控制器单元(micro controller unit,MCU)接管进行急刹车。
406,软件进行恢复。
具体的,由于正确的CPU的上下文已经在异常向量表的入口处从L1/L2 cache中刷(flush)到了内存中,这时可以将正确的CPU的上下文恢复到错误CPU里面,对错误CPU进行恢复。
需要说明的是,软件修复通常用于常用等级下的寄存器,例如ARM64体系架构下的EL0级别的寄存器,E1级别的寄存器,或者X86体系结构下的RING0级别的寄存器,RING3级别的寄存器。通常,通过404步骤的错误查询,即可确定发生错误的CPU的错误权限等级。
407,通过硬件通道恢复错误CPU。
具体的,错误CPU可以根据正确CPU的状态,同步该错误的CPU。此时,正确CPU可以通过与错误CPU之间的硬件通道将其软件可见的CPU上下文同步给错误CPU。图7示出了本申请实施例的一种基于硬件通道的错误修复的示例。
对于错误CPU,将执行701A至704A,对于正确CPU,将执行701B至704B。
701A,重置(reset)错误CPU,即重置CPU的微架构的状态,进行该错误CPU的单核恢复。这里,单核恢复指的是该错误CPU进行恢复,而正确CPU不进行恢复。
702A,错误CPU单核恢复后进入恢复模式(recovery mode),并且通知正确CPU同时进入恢复模式。作为示例,错误CPU可以通过中断方式,或者其他方式通知正确CPU进入该恢复模式,本申请实施例对此不作限定。
并且,在该恢复模式下,错误CPU可以通过硬件通道获取正确CPU中的软件可见状态,并根据该正确CPU的软件可见状态进行恢复。作为示例,硬件通道可以为该正确CPU与错误CPU之间的数据通道。
703A,在错误CPU状态恢复完成之后,和正常CPU同时进入重置同步(reset-sync)状态。703A可以参见408的描述。
704A,reset-sync完成之后,参与lockstep的所有CPU重新进入锁步模式。704A可以参见409的描述。
701B,在错误CPU进行重置时,正确CPU处于等待(spin wait)状态。在等待(spin wait)状态下正确CPU等待错误CPU通知进入恢复模式。作为示例,错误CPU可以通过中断方式,或者其他方式通知正确CPU进入该模式,本申请实施例对此不作限定。
702B,正确CPU进入恢复模式之后,通过硬件通道将其寄存器中的软件可见状态发送给错误CPU,以使得错误CPU进行恢复。
703B,在软件可见状态传输完成之后,和错误CPU同时进入reset-sync状态。703B可以参见408的描述。
704B,reset-sync完成之后,参与lockstep的所有CPU重新进入锁步模式。704B可以参见409的描述。
需要说明的是,在某些特殊情况下并不知道错误发生在哪个等级的寄存器,例如系统挂死的情况。这时,可以采用基于硬件通道方式修复所有等级的寄存器。这时,由于需要恢复的寄存器的数量较多,因此恢复速度相对于软件恢复而言速度较慢。
408,进入重置同步(reset-sync)。
当对错误CPU核(core)内部的软件可见状态进行恢复之后,错误CPU进行reset-sync,即进行内部微架构的复位。一种可能的实现方式,错误CPU重置掉所有非软件可见的硬件状态,清空CPU cache中的数据,保留系统寄存器以及通用寄存器中的软件可见的状态。基于此,reset-sync与传统的CPU重启(reset)不同,并不是完全的复位,因此所需要的时间较短,比如可以为几十个CPU时钟周期(cycles)。
可选的,在至少两个CPU重置之后,可以执行初始化指令以恢复软件可见的CPU上下文,使得所述该至少两个CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。一种实现方式,可以由初始化单元执行该初始 化指令。
一种可能的实现方式,参与lockstep的至少两个CPU被重启(reset)到软件事先放置好初始化指令的地方,其中,该初始化指令包括上文中的中断时刻正确CPU的CPU的PC指针和系统寄存器(即系统寄存器的值或数据)。重启之后,该至少两个CPU同时执行初始化指令。
在进行reset-sync之前,上述至少两个物理CPU设置的软件可见状态完全一样,在进行reset-sync之后,该至少两个物理CPU的软件可见状态仍然一样,并且都统一从外部存储器获取数据和指令,接收相同的输入指令流。
409,lockstep CPU从之前退出的地方继续运行。
Reset-sync完成之后,一种情况,参与lockstep的所有CPU的微架构状态都是reset后的初始状态,软件可见的状态是业务中断前的状态。另一种情况,参与lockstep的所有CPU同时执行初始化指令,因此lockstep CPU可以从之前业务程序中断的地方继续运行。
同时,lockstep CPU对应的比较器继续对该lockstep CPU中的至少两个物理CPU进行cycle-by-cycle的比较。
因此,本申请实施例处于锁步模式的至少两个CPU能够在至少一个CPU发生错误时退出锁步模式,并确定出发生错误的CPU以及正确运行的CPU,基于此能够在错误可恢复的情况下,根据正确运行的CPU对发生错误的CPU进行恢复,进而有助于该至少两个CPU在业务程序被中断的地方重新运行,因此本申请实施例能够提高锁步(lockstep)系统的错误恢复能力,增加系统可靠性。
图8示出了本申请实施例提供的一种错误恢复的方法的示意性流程图。作为示例,该方法可以由图1或图2所示的系统执行。该方法包括810至830。
810,处于锁步模式的至少两个CPU接收中断,其中,所述中断用于指示所述至少两个CPU中的至少一个CPU发生错误。
820,响应于所述中断,所述至少两个CPU退出锁步模式;
830,确定所述至少两个CPU中发生错误的第一CPU,以及所述错误的类型。
840,基于所述错误的类型为可恢复错误,根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复。
因此,本申请实施例处于锁步模式的至少两个CPU能够在至少一个CPU发生错误时退出锁步模式,并确定出发生错误的CPU以及错误的类型,基于此能够在错误可恢复的情况下,根据正确运行的CPU对发生错误的CPU进行恢复,进而有助于该至少两个CPU在业务程序被中断的地方重新运行,因此本申请实施例能够提高锁步(lockstep)系统的错误恢复能力,增加系统可靠性。
需要说明的是,第一CPU和第二CPU的数量均可以为一个或多个。
作为示例,CPU的状态,可以包括该CPU的软件可见状态和/或非软件可见的硬件状态。软件可见的状态,也可以称为CPU上下文,包括通用寄存器的值(或数据)和系统寄存器的值(或数据)。非软件可见的硬件状态,也可以称为非软件可见的微架构状态,可以在处理器上被执行。
一种可能的设计中,当错误的类型为不可恢复错误时,该至少两个CPU停止运行。
在某些实现方式中,所述根据所述至少两个CPU中正确运行的第二CPU在中断时的 状态,对所述第一CPU进行错误恢复,包括:
从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU中的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
在某些实现方式中,所述第二CPU将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。可选的,第一CPU可以将第一CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中,本申请实施例对此不作限定。
在某些实现方式中,所述根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:
所述第一CPU通过与所述第二CPU之间的硬件通道获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
需要说明的是,在某些特殊情况下并不知道错误发生在哪个等级的寄存器,例如系统挂死的情况。这时,可以采用基于硬件通道方式修复所有等级的寄存器。
在某些实现方式中,在更新所述第一CPU的软件可见的CPU上下文之后,所述第一CPU和所述第二CPU分别重置各自的非软件可见的微架构状态,并保留各自的软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式。也就是说,错误CPU重置掉所有非软件可见的硬件状态,清空CPU cache中的数据,保留系统寄存器以及通用寄存器中的软件可见的状态。
因此,在进行重置之前,上述至少两个CPU设置的软件可见状态完全一样,在进行重置之后,该至少两个CPU的软件可见状态仍然一样,并且都统一从外部存储器获取数据和指令,接收相同的输入指令流。
在某些实现方式中,所述根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:
所述第一CPU和所述第二CPU分别重置,并执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
因此,在进行重置之前,上述至少两个CPU设置的软件可见状态完全一样,在进行重置之后,该至少两个CPU的软件可见状态仍然一样,并且都统一从外部存储器获取数据和指令,接收相同的输入指令流。
在某些实现方式中,所述确定所述至少两个CPU中发生错误的第一CPU,以及所述错误的类型,包括:
所述第一CPU根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误。这样,当CPU发生RAS错误的时候,CPU产生中断或者系统异常而进入UEFI或者BIOS,UEFI或者BIOS遍历各个RAS节点状态寄存 器,将该CPU对应的错误记录在内存的表格(即APCI表格)里面,因此操作系统的ACPI驱动解析该表格就可以知道系统中哪个节点发生了何种错误类型的错误。
或者,所述第一CPU轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。这样,当CPU发生RAS错误的时候,CPU中断或者系统异常,此时RAS驱动直接依次遍历各个RAS节点的状态寄存器,从而确定错误的原因,而不通过查询ACPI表格的方式获取。
可选的,第二CPU还可以轮询所述第二CPU的RAS节点的状态寄存器,确定所述第二CPU正确运行。
可选的,第二CPU还可以根据所述第二CPU对应的ACPI表格,确定所述第二CPU正确运行。
可选的,当该至少两个CPU进入到split模式时,每个CPU可以确定自己是否出错,而并不需要查询RAS节点,或ACPI表格。也就是说,此时可以直接确定哪些CPU为发生错误的CPU,哪些CPU为正确运行的CPU。
在某些实现方式中,所述至少两个CPU接收中断,包括:
所述至少两个CPU接收中断控制器发送的所述中断,其中,所述中断控制器在比较电路确定所述至少两个CPU的输出不一致时向所述至少两个CPU发送所述中断。
在某些实现方式中,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
在某些实现方式中,所述确定所述至少两个CPU中发生错误的第一CPU,以及所述错误的类型,包括:
查询所述比较电路对应的RAS节点的状态寄存器,确定所述至少两个CPU发生的错误的所述第一CPU,以及所述错误的类型。
这种情况下,当比较器确定获取的CPU的输出不一致时,可以上报RAS中断错误,同时在该比较器对应的RAS节点的寄存器中提供对比不一致的数据的信息,比如错误数据地址,错误模块,错误类型等中的至少一种。
图8所示的错误恢复的方法能够实现前述方法实施例对应的错误恢复的方法的各个过程,具体的,可以参见上文中的描述,为避免重复,这里不再赘述。
上文结合图1至8对本申请实施例的错误恢复的方法进行了详细的描述,下面结合图9对本申请实施例的错误恢复的装置进行详细的介绍。应理解,图9的错误恢复的装置能够执行本申请实施例的错误恢复的方法的各个步骤,下面在对图9所示的错误恢复的装置进行描述时,适当省略重复的描述。
图9是本申请实施例的错误恢复的装置900的示意性框图。
图9所示的装置900包括锁步CPU910,该锁步CPU910中包括第一CPU9110和第二CPU9120。
所述第一CPU9110用于,接收中断,所述中断是在所述第一CPU9110和所述第二CPU9120处于锁步模式时所述第一CPU9110发生错误触发的;
响应于所述中断,退出锁步模式,并确定所述错误的类型;
基于所述错误的类型为可恢复错误,根据所述第二CPU9120在中断时的状态,进行 错误恢复;
所述第二CPU9120用于接收所述中断,退出锁步模式。
在某些实现方式中,所述第一CPU9110具体用于:
从内存中获取所述第二CPU9120在中断时的软件可见的CPU上下文,并根据所述第二CPU9120的软件可见的CPU上下文更新所述第一CPU9110的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
在某些实现方式中,所述第二CPU9120还用于将所述第二CPU9120在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
在某些实现方式中,所述第一CPU9110具体用于:
通过与所述第二CPU9120之间的硬件通道获取所述第二CPU9120在中断时的软件可见的CPU上下文,并根据所述第二CPU9120的软件可见的CPU上下文更新所述第一CPU9110的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
在某些实现方式中,所述第一CPU9110还用于:在更新软件可见的CPU上下文之后,重置所述第一CPU9110的非软件可见的微架构状态,并保留所述第一CPU9110的软件可见的CPU上下文,使得所述第一CPU9110重新进入锁步模式;
所述第二CPU9120还用于:在所述第一CPU9110更新软件可见的CPU上下文之后,重置所述第二CPU9120的非软件可见的微架构状态,并保留所述第二CPU9120的软件可见的CPU上下文,使得所述第二CPU9120重新进入锁步模式。
在某些实现方式中,所述第一CPU9110具体用于重置,并在重置之后执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU9110重新进入锁步模式,其中,所述初始化指令包括所述第二CPU9120在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU9120在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
所述第二CPU9120具体用于重置,并在重置之后执行所述初始化指令,使得所述第二CPU9120重新进入锁步模式。
一些实现方式中,第一CPU和第二CPU可以在同时进行重置,并同时执行所述初始化指令,从而该第一CPU和第二CPU重新进入锁步模式。
在某些实现方式中,所述第一CPU9110具体用于:
根据所述第一CPU9110对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者
轮询所述第一CPU9110的RAS节点的状态寄存器,确定所述错误的类型。
在某些实现方式中,所述第一CPU9110具体用于:接收中断控制器发送的所述中断,其中,所述中断控制器在比较电路确定所述第一CPU9110和所述第二CPU9120的输出不一致时向所述第一CPU9110和所述第二CPU9120发送所述中断;
所述第二CPU9120具体用于:接收所述中断控制器发送的所述中断。
在某些实现方式中,所述第一CPU9110还用于:
查询所述比较电路对应的RAS节点的状态寄存器,确定发生错误的所述第一 CPU9110,以及所述错误的类型。
在某些实现方式中,基于所述错误的类型为不可恢复错误,所述第一CPU9110还用于停止运行,所述第二CPU9120还用于停止运行。
在某些实现方式中,所述装置900中还可以包括上述中断控制器和上述比较电路,
所述比较电路用于获取所述第一CPU9110和所述第二CPU9120的输出,并在确定所述第一CPU9110和所述第二CPU9120的输出不一致时向所述中断控制器发送第一信号,所述第一信号用于指示所述中断控制器向所述第一CPU9110和所述第二CPU9120发送中断;
所述中断控制器根据所述第一信号,向第一CPU9110和所述第二CPU9120发送所述中断。
可选的,该系统还可以包括存储单元920。一种可能的方式中,该存储单元920用于存储指令。可选的,该存储单元920也可以用于存储数据或者信息。存储单元920可以通过存储器实现。
一种可能的设计中,第一CPU9110和第二CPU9120可以用于执行该存储单元920存储的指令,以使装置900实现如上述错误恢复的方法。
进一步的,第一CPU9110、第二CPU9120、存储单元920可以通过内部连接通路互相通信,传递控制和/或数据信号。例如,该存储单元920用于存储计算机程序,该第一CPU9110和第二CPU9120可以用于从该存储单元920中调用并运行该计算计程序,以完成上述错误恢复的方法。该存储单元920可以集成在锁步CPU910中,也可以与锁步CPU910分开设置。
其中,存储器可以是以下类型中的一种或多种:闪速(flash)存储器、硬盘类型存储器、微型多媒体卡型存储器、卡式存储器(例如SD或XD存储器)、随机存取存储器(random access memory,RAM)、静态随机存取存储器(static RAM,SRAM)、只读存储器(read only memory,ROM)、电可擦除可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、可编程只读存储器(programmable ROM,PROM)、磁存储器、磁盘或光盘。例如,上述存储器可以存储一种计算机程序(该计算机程序是本申请实施例的错误恢复的方法对应的程序),当处理单元执行该计算机程序时,处理单元能够执行本申请实施例的错误恢复的方法。
存储器还存储有除计算机程序之外的其他数据,例如,存储器可以存储本申请的错误恢复的方法处理过程中的数据。
图9所示的装置900能够实现前述方法实施例对应的错误恢复的方法的各个过程,具体的,该装置900可以参见上文中的描述,为避免重复,这里不再赘述。
图10示出了本申请实施例还提供了一种错误恢复的装置1000的示意性框图,包括:确定单元1010和恢复单元1020,
在处于锁步模式的至少两个中央处理单元CPU中第一CPU发生错误,所述至少两个CPU退出锁步模式的情况下,所述确定单元1010,用于确定所述第一CPU的错误的类型;
所述恢复单元1020,用于基于所述错误的类型为可恢复错误,根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复。
在某些实现方式中,所述恢复单元1020具体用于:
从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
在某些实现方式中,还包括CPU上下文管理单元,用于将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
在某些实现方式中,还包括初始化单元,用于在所述第一CPU和所述第二CPU重置之后,执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
在某些实现方式中,所述确定单元1010具体用于:
根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者
轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。
在某些实现方式中,所述确定单元1010具体用于:
查询比较电路对应的RAS节点的状态寄存器,确定发生错误的所述第一CPU,以及所述错误的类型,其中,所述比较电路用于在确定所述至少两个CPU的输出不一致时向中断控制器发送第一信息,所述第一信号用于指示所述中断控制器向所述至少两个CPU发送中断以触发所述至少两个CPU退出锁步模式。
在某些实现方式中,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
在某些实现方式中,所述确定单元1010还用于基于所述错误的类型为不可恢复错误,控制所述至少两个CPU停止运行。
图10所示的该错误恢复的装置1000能够实现前述方法实施例对应的错误恢复的方法的相应过程,具体的,该错误恢复的装置1000可以参见上文中的描述,为避免重复,这里不再赘述。
作为示例,上述错误恢复装置可以为终端,也可以是终端中的用于执行错误恢复的装置(例如,芯片,或者是能够和终端匹配使用的装置)。该终端具体可以为智能手机、车载装置或穿戴式设备等。可选的,前述车载装置可以是独立于汽车但可以应用于汽车的计算机系统,也可以是集成到汽车(例如自动驾驶汽车)内部的计算机系统。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储了程序代码,其中,程序代码包括用于执行上述任意实施例所描述的方法中的部分或全部操作的指令。
可选地,上述计算机可读存储介质位于终端内,该终端可以是能够进行错误恢复的装置。
本申请实施例还提供一种计算机程序产品,当计算机程序产品在错误恢复的装置上运 行时,使得错误恢复的装置执行上述任意实施例所描述的方法中的部分或全部操作。
本申请实施例还提供了一种芯片,所述芯片包括处理器,所述处理器用于执行上述任意实施例所描述的方法中的部分或全部操作。
本申请中的各个实施例可以独立的使用,也可以进行联合的使用,这里不做限定。
应理解,本申请实施例中出现的第一、第二等描述,仅作示意与区分描述对象之用,没有次序之分,也不表示本申请实施例中对设备个数的特别限定,不能构成对本申请实施例的任何限制。
还应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (31)

  1. 一种错误恢复的方法,其特征在于,包括:
    接收中断,所述中断是在第一中央处理单元CPU和第二CPU处于锁步模式时,所述第一CPU发生错误触发的;
    响应于所述中断,所述第一CPU退出锁步模式;
    确定所述错误的类型;
    基于所述错误的类型为可恢复错误,根据正确运行的所述第二CPU在中断时的状态,对所述第一CPU进行错误恢复。
  2. 根据权利要求1所述的方法,其特征在于,所述根据正确运行的所述第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:
    从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU中的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
  4. 根据权利要求1所述的方法,其特征在于,所述根据正确运行的所述第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:
    通过所述第一CPU与所述第二CPU之间的硬件通道获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  5. 根据权利要求2-4任一项所述的方法,其特征在于,还包括:
    在更新所述第一CPU的软件可见的CPU上下文之后,分别重置所述第一CPU和所述第二CPU各自的非软件可见的微架构状态,并保留所述第一CPU和所述第二CPU各自的软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式。
  6. 根据权利要求1所述的方法,其特征在于,所述根据正确运行的所述第二CPU在中断时的状态,对所述第一CPU进行错误恢复,包括:
    所述第一CPU和所述第二CPU分别重置,并执行初始化指令,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述确定所述错误的类型,包括:
    根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态 寄存器时发现的错误;或者
    轮询所述第一CPU的RAS节点的状态寄存器,以确定所述错误的类型。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述是由中断控制器发送的,其中,所述中断控制器在比较电路确定所述第一CPU和所述第二CPU的输出不一致时向所述第一CPU和所述第二CPU发送所述中断。
  9. 根据权利要求8所述的方法,其特征在于,所述第一CPU和所述第二CPU的输出包括CPU内部总线输出、外部总线输出和层3缓存控制逻辑输出中的至少一种。
  10. 根据权利要求8或9所述的方法,其特征在于,所述确定所述错误的类型,包括:
    查询所述比较电路对应的RAS节点的状态寄存器,以确定所述错误的类型。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,还包括:
    基于所述错误的类型为不可恢复错误,所述第一CPU和所述第二CPU停止运行。
  12. 一种错误恢复的装置,其特征在于,包括:第一中央处理单元CPU和第二CPU;
    所述第一CPU用于,接收中断,所述中断是在所述第一CPU和所述第二CPU处于锁步模式时所述第一CPU发生错误触发的;响应于所述中断,退出锁步模式,并确定所述错误的类型;基于所述错误的类型为可恢复错误,根据所述第二CPU在中断时的状态,进行错误恢复;
    所述第二CPU用于接收所述中断,退出锁步模式。
  13. 根据权利要求12所述的装置,其特征在于,所述第一CPU具体用于:
    从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  14. 根据权利要求13所述的装置,其特征在于,所述第二CPU还用于将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
  15. 根据权利要求12所述的装置,其特征在于,所述第一CPU具体用于:
    通过与所述第二CPU之间的硬件通道获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  16. 根据权利要求13-15任一项所述的装置,其特征在于,
    所述第一CPU还用于:在更新软件可见的CPU上下文之后,重置所述第一CPU的非软件可见的微架构状态,并保留所述第一CPU的软件可见的CPU上下文,使得所述第一CPU重新进入锁步模式;
    所述第二CPU还用于:在所述第一CPU更新软件可见的CPU上下文之后,重置所述第二CPU的非软件可见的微架构状态,并保留所述第二CPU的软件可见的CPU上下文,使得所述第二CPU重新进入锁步模式。
  17. 根据权利要求12所述的装置,其特征在于,
    所述第一CPU具体用于,重置并执行初始化指令,使得所述第一CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,所述CPU上下文包括系统寄存器的值和通用寄存器的值;
    所述第二CPU具体用于,重置并执行所述初始化指令,使得所述第二CPU重新进入锁步模式。
  18. 根据权利要求12-17任一项所述的装置,其特征在于,所述第一CPU具体用于:
    根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者
    轮询所述第一CPU的RAS节点的状态寄存器,以确定所述错误的类型。
  19. 根据权利要求12-18任一项所述的装置,其特征在于,
    所述中断是中断控制器发送的,其中,所述中断控制器在比较电路确定所述第一CPU和所述第二CPU的输出不一致时向所述第一CPU和所述第二CPU发送所述中断。
  20. 根据权利要求19所述的装置,其特征在于,所述第一CPU和所述第二CPU的输出包括CPU内部总线输出、外部总线输出和层3缓存控制逻辑输出中的至少一种。
  21. 根据权利要求19或20所述的装置,其特征在于,所述第一CPU还用于:
    查询所述比较电路对应的RAS节点的状态寄存器,以确定发生错误的所述第一CPU,以及所述错误的类型。
  22. 根据权利要求12-21任一项所述的装置,其特征在于,所述第一CPU和所述第二CPU还用于,基于所述错误的类型为不可恢复错误,停止运行。
  23. 根据权利要求12-18任一项所述的装置,其特征在于,还包括中断控制器和比较电路,
    所述比较电路用于获取所述第一CPU和所述第二CPU的输出,并在确定所述第一CPU和所述第二CPU的输出不一致时向所述中断控制器发送第一信号,所述第一信号用于指示所述中断控制器向所述第一CPU和所述第二CPU发送中断;
    所述中断控制器根据所述第一信号,向第一CPU和所述第二CPU发送所述中断。
  24. 一种错误恢复的装置,其特征在于,包括:确定单元和恢复单元,
    在处于锁步模式的至少两个中央处理单元CPU中第一CPU发生错误,所述至少两个CPU退出锁步模式的情况下,所述确定单元,用于确定所述第一CPU的错误的类型;
    所述恢复单元,用于基于所述错误的类型为可恢复错误,根据所述至少两个CPU中正确运行的第二CPU在中断时的状态,对所述第一CPU进行错误恢复。
  25. 根据权利要求24所述的装置,其特征在于,所述恢复单元具体用于:
    从内存中获取所述第二CPU在中断时的软件可见的CPU上下文,并根据所述第二CPU的软件可见的CPU上下文更新所述第一CPU的软件可见的CPU上下文,其中,所述CPU上下文包括系统寄存器的值和通用寄存器的值。
  26. 根据权利要求25所述的装置,其特征在于,还包括:CPU上下文管理单元,用于将所述第二CPU在中断时的软件可见的CPU上下文,以及缓存中的数据保存到内存中。
  27. 根据权利要求24所述的装置,其特征在于,还包括初始化单元,用于在所述第一CPU和所述第二CPU重置之后,执行初始化指令以恢复软件可见的CPU上下文,使得所述第一CPU和所述第二CPU重新进入锁步模式,其中,所述初始化指令包括所述第二CPU在中断时的软件可见的CPU上下文,所述初始化指令用于将所述软件可见的CPU上下文恢复为所述第二CPU在中断时的软件可见的CPU上下文,其中,所述CPU上下 文包括系统寄存器的值和通用寄存器的值。
  28. 根据权利要求24-27任一项所述的装置,其特征在于,所述确定单元具体用于:
    根据所述第一CPU对应的高级配置和电源管理接口ACPI表格,确定所述错误的类型,其中,所述ACPI表格用于记录轮询CPU的可靠性、可用性、可服务性RAS节点的状态寄存器时发现的错误;或者
    轮询所述第一CPU的RAS节点的状态寄存器,确定所述错误的类型。
  29. 根据权利要求24-27任一项所述的装置,其特征在于,所述确定单元具体用于:
    查询比较电路对应的RAS节点的状态寄存器,确定发生错误的所述第一CPU,以及所述错误的类型,其中,所述比较电路用于在确定所述至少两个CPU的输出不一致时向中断控制器发送第一信息,所述第一信号用于指示所述中断控制器向所述至少两个CPU发送中断以触发所述至少两个CPU退出锁步模式。
  30. 根据权利要求29所述的装置,其特征在于,所述至少两个CPU的输出包括所述至少两个CPU中的每个CPU的内部总线输出、所述每个CPU对外部总线输出和所述每个CPU的层3缓存控制逻辑输出中的至少一种。
  31. 根据权利要求24-30任一项所述的装置,其特征在于,所述确定单元还用于基于所述错误的类型为不可恢复错误,控制所述至少两个CPU停止运行。
PCT/CN2020/093188 2019-05-31 2020-05-29 错误恢复的方法和装置 WO2020239060A1 (zh)

Priority Applications (9)

Application Number Priority Date Filing Date Title
FIEP20785894.5T FI3770765T3 (fi) 2019-05-31 2020-05-29 Virheestä palautumismenetelmä ja -laite
DK20785894.5T DK3770765T3 (da) 2019-05-31 2020-05-29 Fremgangsmåde og apparat til fejlgenoprettelse
JP2021570888A JP7351933B2 (ja) 2019-05-31 2020-05-29 エラーリカバリ方法及び装置
CA3142308A CA3142308A1 (en) 2019-05-31 2020-05-29 Error recovery method and apparatus
AU2020285262A AU2020285262B2 (en) 2019-05-31 2020-05-29 Error recovery method and apparatus
KR1020217042599A KR20220010040A (ko) 2019-05-31 2020-05-29 에러 복구 방법 및 장치
EP20785894.5A EP3770765B1 (en) 2019-05-31 2020-05-29 Error recovery method and apparatus
US17/038,428 US11068360B2 (en) 2019-05-31 2020-09-30 Error recovery method and apparatus based on a lockup mechanism
US17/376,442 US11604711B2 (en) 2019-05-31 2021-07-15 Error recovery method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910473113.6 2019-05-31
CN201910473113.6A CN112015599B (zh) 2019-05-31 2019-05-31 错误恢复的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/038,428 Continuation US11068360B2 (en) 2019-05-31 2020-09-30 Error recovery method and apparatus based on a lockup mechanism

Publications (1)

Publication Number Publication Date
WO2020239060A1 true WO2020239060A1 (zh) 2020-12-03

Family

ID=73506531

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093188 WO2020239060A1 (zh) 2019-05-31 2020-05-29 错误恢复的方法和装置

Country Status (10)

Country Link
US (2) US11068360B2 (zh)
EP (1) EP3770765B1 (zh)
JP (1) JP7351933B2 (zh)
KR (1) KR20220010040A (zh)
CN (1) CN112015599B (zh)
AU (1) AU2020285262B2 (zh)
CA (1) CA3142308A1 (zh)
DK (1) DK3770765T3 (zh)
FI (1) FI3770765T3 (zh)
WO (1) WO2020239060A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015599B (zh) * 2019-05-31 2022-05-13 华为技术有限公司 错误恢复的方法和装置
CN112596916A (zh) * 2021-03-03 2021-04-02 上海励驰半导体有限公司 双核锁步错误恢复系统及方法
US20220414222A1 (en) * 2021-06-24 2022-12-29 Advanced Micro Devices, Inc. Trusted processor for saving gpu context to system memory
CN113687986B (zh) * 2021-08-31 2024-09-13 上海阡视科技有限公司 一种芯片和处理单元的恢复方法
JP2023035739A (ja) * 2021-09-01 2023-03-13 ルネサスエレクトロニクス株式会社 半導体装置
CN118401924A (zh) * 2021-12-15 2024-07-26 华为技术有限公司 实现软件锁步的数据处理装置和方法
US11726855B1 (en) * 2022-04-26 2023-08-15 Dell Products L.P. Controlling access to an error record serialization table of an information handlng system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035843A (zh) * 2013-03-06 2014-09-10 英飞凌科技股份有限公司 用于提高锁步核可用性的系统和方法
US8856587B2 (en) * 2011-05-31 2014-10-07 Freescale Semiconductor, Inc. Control of interrupt generation for cache
CN109710445A (zh) * 2018-12-27 2019-05-03 联想(北京)有限公司 内存校正方法和电子设备

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3423732B2 (ja) * 1992-09-17 2003-07-07 株式会社日立製作所 情報処理装置及び情報処理装置における障害処理方法
US5915082A (en) * 1996-06-07 1999-06-22 Lockheed Martin Corporation Error detection and fault isolation for lockstep processor systems
US6061711A (en) * 1996-08-19 2000-05-09 Samsung Electronics, Inc. Efficient context saving and restoring in a multi-tasking computing system environment
US5905857A (en) * 1996-12-03 1999-05-18 Bull Hn Information Systems Inc. Safestore procedure for efficient recovery following a fault during execution of an iterative execution instruction
US6393582B1 (en) * 1998-12-10 2002-05-21 Compaq Computer Corporation Error self-checking and recovery using lock-step processor pair architecture
US6948092B2 (en) * 1998-12-10 2005-09-20 Hewlett-Packard Development Company, L.P. System recovery from errors for processor and associated components
US6625749B1 (en) * 1999-12-21 2003-09-23 Intel Corporation Firmware mechanism for correcting soft errors
DE102004058288A1 (de) * 2004-12-02 2006-06-08 Robert Bosch Gmbh Vorrichtung und Verfahren zur Behebung von Fehlern bei einem Prozessor mit zwei Ausführungseinheiten
EP2798557A4 (en) * 2011-12-29 2015-09-23 Intel Corp SECURE ERROR MANAGEMENT
CN103544087B (zh) * 2013-10-30 2015-10-28 中国航空工业集团公司第六三一研究所 一种锁步的处理器总线监控方法与计算机
US10761925B2 (en) * 2015-03-24 2020-09-01 Nxp Usa, Inc. Multi-channel network-on-a-chip
GB2555627B (en) * 2016-11-04 2019-02-20 Advanced Risc Mach Ltd Error detection
US10802932B2 (en) * 2017-12-04 2020-10-13 Nxp Usa, Inc. Data processing system having lockstep operation
CN112015599B (zh) * 2019-05-31 2022-05-13 华为技术有限公司 错误恢复的方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856587B2 (en) * 2011-05-31 2014-10-07 Freescale Semiconductor, Inc. Control of interrupt generation for cache
CN104035843A (zh) * 2013-03-06 2014-09-10 英飞凌科技股份有限公司 用于提高锁步核可用性的系统和方法
CN109710445A (zh) * 2018-12-27 2019-05-03 联想(北京)有限公司 内存校正方法和电子设备

Also Published As

Publication number Publication date
CN112015599A (zh) 2020-12-01
US20210019240A1 (en) 2021-01-21
JP7351933B2 (ja) 2023-09-27
AU2020285262A1 (en) 2022-01-20
CN112015599B (zh) 2022-05-13
US20210342234A1 (en) 2021-11-04
US11604711B2 (en) 2023-03-14
JP2022534418A (ja) 2022-07-29
EP3770765A4 (en) 2021-07-07
EP3770765A1 (en) 2021-01-27
DK3770765T3 (da) 2023-04-11
FI3770765T3 (fi) 2023-03-22
EP3770765B1 (en) 2023-01-18
AU2020285262B2 (en) 2023-10-12
KR20220010040A (ko) 2022-01-25
US11068360B2 (en) 2021-07-20
CA3142308A1 (en) 2020-12-03

Similar Documents

Publication Publication Date Title
WO2020239060A1 (zh) 错误恢复的方法和装置
US8892944B2 (en) Handling a failed processor of multiprocessor information handling system
US20030074601A1 (en) Method of correcting a machine check error
US8868968B2 (en) Partial fault processing method in computer system
US20160321127A1 (en) Determine when an error log was created
US7366948B2 (en) System and method for maintaining in a multi-processor system a spare processor that is in lockstep for use in recovering from loss of lockstep for another processor
US8122176B2 (en) System and method for logging system management interrupts
US8898653B2 (en) Non-disruptive code update of a single processor in a multi-processor computing system
US10776193B1 (en) Identifying an remediating correctable hardware errors
JPH05225067A (ja) 重要メモリ情報保護装置
US7516359B2 (en) System and method for using information relating to a detected loss of lockstep for determining a responsive action
TWI772024B (zh) 減少停機時間的方法及系統
US11360839B1 (en) Systems and methods for storing error data from a crash dump in a computer system
US8028189B2 (en) Recoverable machine check handling
CN114003416A (zh) 内存错误动态处理方法、系统、终端及存储介质
US10768940B2 (en) Restoring a processing unit that has become hung during execution of an option ROM
WO2008004330A1 (fr) Système à processeurs multiples
CN115576734B (zh) 一种多核异构日志存储方法和系统
US7657730B2 (en) Initialization after a power interruption
US20060107116A1 (en) System and method for reestablishing lockstep for a processor module for which loss of lockstep is detected
JP2968484B2 (ja) マルチプロセッサ計算機及びマルチプロセッサ計算機における障害復旧方法
US7818614B2 (en) System and method for reintroducing a processor module to an operating system after lockstep recovery

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020785894

Country of ref document: EP

Effective date: 20201014

ENP Entry into the national phase

Ref document number: 2021570888

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 3142308

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20217042599

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020285262

Country of ref document: AU

Date of ref document: 20200529

Kind code of ref document: A