[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US7308609B2 - Method, data processing system, and computer program product for collecting first failure data capture information - Google Patents

Method, data processing system, and computer program product for collecting first failure data capture information Download PDF

Info

Publication number
US7308609B2
US7308609B2 US10/821,045 US82104504A US7308609B2 US 7308609 B2 US7308609 B2 US 7308609B2 US 82104504 A US82104504 A US 82104504A US 7308609 B2 US7308609 B2 US 7308609B2
Authority
US
United States
Prior art keywords
data
processing system
data processing
dump
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US10/821,045
Other versions
US20050240826A1 (en
Inventor
Marc Alan Dickenson
Brent William Jacobs
Michael Youhour Lim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US10/821,045 priority Critical patent/US7308609B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LIM, MICHAEL YOUHOUR, DICKENSON, MARC ALAN, JACOBS, BRENT WILLIAM
Publication of US20050240826A1 publication Critical patent/US20050240826A1/en
Application granted granted Critical
Publication of US7308609B2 publication Critical patent/US7308609B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2268Logging of test results

Definitions

  • the present invention relates generally to an improved data processing system and in particular to a data processing system and method for generating a data dump. Still more particularly, the present invention provides a mechanism for gathering first failure data collection information in a data processing system that has encountered a failure condition.
  • Data processing system failures cause many problems to the users of the system, especially when data is lost or corrupted. Therefore, when a data processing system fails, it is important to gather information that can aid in isolating and determining the problem associated with the failure.
  • FFDC first failure data capture
  • SP service processor
  • the SP collects and stores as much FFDC data as possible into a limited non-volatile memory resource.
  • the FFDC data is then later collected where it may be saved to a more permanent storage media and analyzed by, for example, field service personnel for analysis of the failures.
  • the present invention provides a method, computer program product, and a data processing system for generating a data dump in a data processing system.
  • a system boot of the data processing system is initialized.
  • a firmware that includes first failure data capture logic is executed.
  • a data dump is created in a persistent storage of the data processing system. An attempt is made to complete the system boot of the data processing system.
  • FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented
  • FIG. 2 a block diagram of a data processing system that may be implemented as a server in accordance with a preferred embodiment of the present invention
  • FIG. 3 is a diagrammatic illustration of a first failure data capture system implemented according to a preferred embodiment of the present invention
  • FIG. 4A is a flowchart of processing performed by the first failure data capture interface shown in FIG. 3 in accordance with a preferred embodiment of the present invention.
  • FIG. 4B is a flowchart of first failure data capture information collection performed during reboot of a service processor subsystem of the data processing system shown in FIG. 2 in accordance with a preferred embodiment of the present invention.
  • FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented.
  • Network data processing system 100 is a network of computers in which the present invention may be implemented.
  • Network data processing system 100 contains a network 102 , which is the medium used to provide communications links between various devices and computers connected together within network data processing system 100 .
  • Network 102 may include connections, such as wire, wireless communication links, or fiber optic cables.
  • server 104 is connected to network 102 along with storage unit 106 .
  • clients 108 , 110 , and 112 are connected to network 102 .
  • These clients 108 , 110 , and 112 may be, for example, personal computers or network computers.
  • server 104 provides data, such as boot files, operating system images, and applications to clients 108 - 112 .
  • Clients 108 , 110 , and 112 are clients to server 104 .
  • Network data processing system 100 may include additional servers, clients, and other devices not shown.
  • network data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.
  • network data processing system 100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN).
  • FIG. 1 is intended as an example, and not as an architectural limitation for the present invention.
  • Data processing system 200 is an example system in which code or instructions implementing the processes of the present invention may be located.
  • Data processing system 200 may be a symmetric multiprocessor (SMP) system including a plurality of processors 202 and 204 connected to system bus 206 . Alternatively, a single processor system may be employed.
  • SMP symmetric multiprocessor
  • memory controller/cache 208 which provides an interface to local memory 209 .
  • I/O bus bridge 210 is connected to system bus 206 and provides an interface to I/O bus 212 .
  • Memory controller/cache 208 and I/O bus bridge 210 may be integrated as depicted.
  • Peripheral component interconnect (PCI) bus bridge 214 connected to I/O bus 212 provides an interface to PCI local bus 216 .
  • PCI Peripheral component interconnect
  • a number of modems may be connected to PCI local bus 216 .
  • Typical PCI bus implementations will support four PCI expansion slots or add-in connectors.
  • Communications links to clients 108 - 112 in FIG. 1 may be provided through modem 218 and network adapter 220 connected to PCI local bus 216 through add-in connectors.
  • Additional PCI bus bridges 222 and 224 provide interfaces for additional PCI local buses 226 and 228 , from which additional modems or network adapters may be supported. In this manner, data processing system 200 allows connections to multiple network computers.
  • a memory-mapped graphics adapter 230 and hard disk 232 may also be connected to I/O bus 212 as depicted, either directly or indirectly.
  • service processor (SP) 244 is connected to I/O bus 212 by direct component connection.
  • SP processor 244 is connected to SP flash memory 245 , SP dynamic random access memory (DRAM) 241 , and non-volatile random access memory (NVRAM) 242 . All of these components form an SP unit or module.
  • SP flash memory 245 is an example of the flash memory in which firmware used for an initial program load (IPL) may be stored.
  • SP DRAM 241 is a memory in which firmware binaries from SP flash memory 245 are loaded for execution by SP processor 244 .
  • NVRAM 242 may be used to hold data that is to be retained when the system is powered down.
  • flash memory 245 provides storage for an initial program load firmware, which is used to initialize the hardware in data processing system 200 . Additionally, flash memory 245 provides a persistent storage for storing a data dump comprising first failure data capture information collected in response to detection of a system or application error or fault condition.
  • FIG. 2 may vary.
  • other peripheral devices such as optical disk drives and the like, also may be used in addition to or in place of the hardware depicted.
  • the depicted example is not meant to imply architectural limitations with respect to the present invention.
  • the data processing system depicted in FIG. 2 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
  • AIX Advanced Interactive Executive
  • the first failure data capture system instructions are preferably executed by SP processor 244 of data processing system 200 shown in FIG. 2 .
  • hardware device drivers and software components that detect failures make persistent records of the failures using a software facility provided for this purpose, herein called the first failure data capture (FFDC) system.
  • the FFDC system logic is implemented as part of the SP subsystem collectively designated as SP subsystem 240 in FIG. 2 .
  • SP subsystem 240 may be implemented as a distinct data processing system or alternatively may be implemented as a subsystem of a integrated in a host data processing system. In the examples provided herein, SP subsystem 240 is shown and described as comprising a subsystem integrated with data processing system 200 , and such an implementation is exemplary only.
  • SP subsystem 240 creates a failure report if an error or fault condition is detected in data processing system 200 .
  • FFDC error data may be associated with an application error or an operating system error.
  • FFDC information is stored as a data dump and associated header information written to, for example, SP flash memory 245 shown in FIG. 2 .
  • FIG. 3 is a diagrammatic illustration of a first failure data capture system implemented according to an embodiment of the present invention.
  • One or more application programs 302 communicates with FFDC interface 304 implemented in accordance with the principles of the present invention.
  • FFDC interface 304 stores and retrieves failure reports through, in one example, an operating system (O/S) error logging subsystem 306 to an O/S error log persistent storage 310 , or alternatively, to a error stack FFDC persistent storage 308 .
  • FFDC persistent storage 308 and O/S error log persistent storage 310 may be recorded in SP flash memory 245 .
  • FFDC persistent storage 308 and error log persistent storage 310 may comprise the same storage within data processing system 200 .
  • FFDC persistent storage 308 could store information that would not normally go into O/S error log persistent storage 310 .
  • FFDC persistent storage 308 and error logging persistent storage 310 are components available in SP subsystem 240 offered by International Business Machines Corporation.
  • the FFDC system component when making a failure record, provides enough information so that: 1) the failure is adequately described so that later analysis efforts may determine the nature and scope of the failure condition; and 2) specific details that are of importance to the data processing system environment such that the manufacturer of data processing system 200 can determine how the condition came to exist so any flaws in the data processing system environment can be identified and repaired.
  • FFDC interface 304 writes a dump to SP flash memory 245 upon detection of a failure condition.
  • the dump may have encapsulated data such as the time at which the failure report was recorded. Other data describing the failure condition may be recorded as well.
  • Data processing system 200 may then be serviced for collection and analysis of the FFDC data.
  • FFDC data may be communicated to a central repository, such as a central server functioning as a repository for the storage of FFDC data.
  • FIG. 4A is a flowchart of processing performed by FFDC interface 306 in accordance with a preferred embodiment of the present invention.
  • SP subsystem 240 is initialized (step 402 ) and begins running in a stable state (step 404 ).
  • SP subsystem 240 monitors for system errors or faults (step 406 ) until a fault condition is encountered.
  • SP subsystem 240 determines if the error is recoverable (step 408 ), that is if the system kernel is still running and SP subsystem 240 remains stable. If the error is recoverable, SP subsystem 240 invokes an FFDC routine and a dump including error or fault data is captured by FFDC interface 304 of FIG. 3 (step 410 ).
  • the FFDC routine invoked at step 410 is implemented as computer executable logic that runs on the system kernel.
  • the dump is encapsulated and placed in a persistent storage, such as SP flash memory 245 of SP subsystem 240 .
  • a persistent storage such as SP flash memory 245 of SP subsystem 240 .
  • FFDC information may be collected in one of two manners in accordance with a preferred embodiment of the present invention. If the SP state is evaluated as suitable for collection of the FFDC information at step 414 , a fault type evaluation is made (step 415 ). For example, the fault may be determined to be an unexpected application failure such as a critical application failure, a threshold exceeded failure, or the like. An FFDC routine then runs (step 416 ), after which SP subsystem 240 reboots (step 418 ). The system kernel is then restarted (step 420 ), and SP subsystem 240 returns to a stable state for monitoring system fault conditions (step 432 ).
  • a fault type evaluation is made (step 415 ). For example, the fault may be determined to be an unexpected application failure such as a critical application failure, a threshold exceeded failure, or the like.
  • An FFDC routine then runs (step 416 ), after which SP subsystem 240 reboots (step 418 ).
  • the system kernel is then restarted (step 420 ), and SP sub
  • the SP may not be in a state suitable for collection of FFDC information at step 414 .
  • SP subsystem 240 may hang, panic, or enter a non-responsive state.
  • SP subsystem 240 may be unexpectedly brought down by a host initiated reset, e.g., a hard boot.
  • An SP reboot is then performed and FFDC information collection is invoked during boot in accordance with a preferred embodiment of the present invention (step 426 ).
  • FFDC collection performed in accordance with step 426 is implemented as a dump collection routine in the system boot code and is executed during a reboot in accordance with a preferred embodiment of the present invention as described more fully below.
  • An attempt is then made to restart the system kernel (step 430 ), and SP subsystem 240 is returned to a stable state (step 432 ) for monitoring fault events upon a successful kernel restart.
  • FIG. 4B is a flowchart of FFDC collection performed during reboot of SP subsystem 240 in accordance with a preferred embodiment of the present invention.
  • the processing steps of FIG. 4B correspond to processing step 426 of FIG. 4A .
  • the collection of FFDC information is performed by a dump collection routine executed during system boot when data processing system 200 is not in a state to collect the FFDC information upon the fault condition.
  • the FFDC collection routine is implemented as a firmware plugin that extends the operating system boot code.
  • the FFDC collection begins at SP subsystem 240 reboot (step 440 ).
  • SP subsystem 240 hardware such as SP processor 244 , SP flash memory 245 , and the like shown in FIG.
  • a boot dump collection reset type may be identified as a unit check reset, a kernel panic reset, or a host-initiated reset. If the reset type is not evaluated as a boot dump collection reset type at step 446 , the boot dump collection routine exits and SP subsystem 240 continues the boot process (step 470 ).
  • SP flash memory 245 is initialized for data storage (step 448 ). SP flash memory 245 is then evaluated to determine if a valid dump exists in SP flash memory 245 (step 450 ). For example, when a dump is written by SP subsystem 240 , a valid dump indictor bit in the dump header may be asserted to indicate the dump is valid. Accordingly, an address of SP flash memory 245 may be read at step 450 for evaluation of a dump indicator bit and thus the presence or absence of a valid dump in SP flash memory 245 . The boot collection dump routine preserves the FFDC data dump (step 451 ), and the boot dump collection routine then exits and SP subsystem continues the boot process (step 470 ) if a valid dump is identified in SP flash memory 245 .
  • a new dump is created and stored in SP flash memory 245 (step 452 ) if a valid dump is not identified at step 450 .
  • the boot dump collection routine then evaluates SP flash memory 245 for additional storage capacity (step 454 ).
  • the boot dump collection routine collects or calculates dump data on a priority basis.
  • Generation of a valid dump header may be assigned a higher priority than calculation of error detection data as a valid dump header is often more critical in a dump analysis than error detection values calculated on the dump data.
  • Table A is an exemplary priority list that may be evaluated by the boot dump collection routine for determining additional data to add to a dump being generated and corresponding data item locations. Data item locations designated as “Calculated” are calculated by the boot dump collection routine logic.
  • the boot dump collection routine determines SP flash memory 245 has remaining capacity for storage of additional dump information at step 454 , an evaluation is made to determine if any priority item remains for dump collection (step 456 ). The highest remaining priority item is read or calculated (step 458 ) if the boot dump collection routine determines any priority items remain to be added to the dump at step 456 . The data item is then compressed (step 460 ) and an error detection code, such as a cyclic redundancy check (CRC) value, is calculated on the data item (step 462 ). The data item is then added to the dump in SP flash memory 245 (step 464 ), and the boot dump collection routine updates the dump header to indicate inclusion of the added item to the dump (step 466 ). The boot dump collection routine then returns to evaluate SP flash memory 245 for additional capacity for dump storage.
  • CRC cyclic redundancy check
  • the boot dump collection routine proceeds to finalize the dump (step 468 ). For example, the boot dump collection routine may complete the dump headers, calculate error detection values, and close the dump file. The boot dump collection routine then exits and SP subsystem 240 continues the boot process (step 470 ). Upon completion of SP subsystem 240 boot, system processing returns to step 430 of FIG. 4A .
  • the boot dump collection routine provides a mechanism for collection of FFDC data when a system fault condition results in a system state where the service processor is unable to collect FFDC data without a reboot.
  • FFDC information may be collected even if the system fault results in impairment of the data processing system to the extent that the system is inoperable, i.e. the system kernel is unable to be brought up after the system fault.
  • the FFDC information may be collected by execution of the boot dump collection routine at step 426 of FIG. 4A .
  • the kernel is unable to be brought up at step 430 of FIG.
  • the FFDC information collected by the boot dump collection routine may still be retrieved by manually removing SP subsystem 240 from data processing system 200 .
  • the FFDC information may be analyzed even in the event that the system fault renders the data processing system inoperable.
  • the first failure data capture system of the present invention facilitates dynamic reprioritization of data that is collected in a data dump.
  • the most significant data for properly evaluating a system fault cause is dependent on the fault type.
  • data retrieved from DRAM buffers may be the most critical data for properly evaluating a particular type of system failure
  • data retrieved from an NVRAM buffer may be the most critical data for properly evaluating another type of system failure.
  • the items in the priority list described above may be dynamically prioritized dependent on the an evaluated system fault type. For example, each of the priority items of the priority list shown in Table A may have separate index values associated with a reset type evaluated at step 446 of FIG.
  • each of a plurality of priority lists may be associated with a particular type of system fault, such as a reset type evaluated by the first failure data capture system at step 446 of FIG. 4B .
  • the first failure data capture system then collects items prioritized in accordance with the evaluated fault condition.
  • a first failure data capture system provides mechanisms for data dump collection of first failure data capture information for recoverable application failures and non-recoverable system failures where the service processor remains in a state suitable for data dump generation. Additionally, the first failure data capture system provides a mechanism implemented as a boot dump collection routine for the collection of FFDC information when a system fault condition results in a system state requiring execution of a system reboot. Firmware executed during boot of the service processor collects FFDC information prior to an attempt to restart the system kernel. Moreover, dynamic reprioritization of data items collected by the first failure data capture system is provided.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A method, computer program product, and a data processing system for generating a data dump in a data processing system is provided. A system boot of the data processing system is initialized. A firmware that includes fault collection logic is executed. A data dump is created in a persistent storage of the data processing system. An attempt is made to complete the system boot of the data processing system.

Description

BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates generally to an improved data processing system and in particular to a data processing system and method for generating a data dump. Still more particularly, the present invention provides a mechanism for gathering first failure data collection information in a data processing system that has encountered a failure condition.
2. Description of Related Art
Data processing system failures cause many problems to the users of the system, especially when data is lost or corrupted. Therefore, when a data processing system fails, it is important to gather information that can aid in isolating and determining the problem associated with the failure.
The collection of first failure data capture (FFDC) information is an important part of common field service strategies utilizing embedded subsystems such as a service processor (SP) subsystem. The SP collects and stores as much FFDC data as possible into a limited non-volatile memory resource. The FFDC data is then later collected where it may be saved to a more permanent storage media and analyzed by, for example, field service personnel for analysis of the failures.
Current solutions do not allow for the dynamic reprioritization that is often necessary to capture all of the correct information in the limited storage space available for FFDC dumps. Moreover, current solutions do not provide reliability features for enabling data collection processes that are tolerant of failures that occur during the data collection phase.
Thus, it would be advantageous to provide a method and data processing system for enabling the dynamic reprioritization of data items captured by a first failure data capture system. Moreover, it would be advantageous to provide a data capture system that increases the reliability of a dump collection process.
SUMMARY OF THE INVENTION
The present invention provides a method, computer program product, and a data processing system for generating a data dump in a data processing system. A system boot of the data processing system is initialized. A firmware that includes first failure data capture logic is executed. A data dump is created in a persistent storage of the data processing system. An attempt is made to complete the system boot of the data processing system.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented;
FIG. 2, a block diagram of a data processing system that may be implemented as a server in accordance with a preferred embodiment of the present invention;
FIG. 3 is a diagrammatic illustration of a first failure data capture system implemented according to a preferred embodiment of the present invention;
FIG. 4A is a flowchart of processing performed by the first failure data capture interface shown in FIG. 3 in accordance with a preferred embodiment of the present invention; and
FIG. 4B is a flowchart of first failure data capture information collection performed during reboot of a service processor subsystem of the data processing system shown in FIG. 2 in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
With reference now to the figures, FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented. Network data processing system 100 is a network of computers in which the present invention may be implemented. Network data processing system 100 contains a network 102, which is the medium used to provide communications links between various devices and computers connected together within network data processing system 100. Network 102 may include connections, such as wire, wireless communication links, or fiber optic cables.
In the depicted example, server 104 is connected to network 102 along with storage unit 106. In addition, clients 108, 110, and 112 are connected to network 102. These clients 108, 110, and 112 may be, for example, personal computers or network computers. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to clients 108-112. Clients 108, 110, and 112 are clients to server 104. Network data processing system 100 may include additional servers, clients, and other devices not shown. In the depicted example, network data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. Of course, network data processing system 100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN). FIG. 1 is intended as an example, and not as an architectural limitation for the present invention.
Referring to FIG. 2, a block diagram of a data processing system that may be implemented as a server, such as server 104 in FIG. 1, is depicted in accordance with a preferred embodiment of the present invention. Data processing system 200 is an example system in which code or instructions implementing the processes of the present invention may be located. Data processing system 200 may be a symmetric multiprocessor (SMP) system including a plurality of processors 202 and 204 connected to system bus 206. Alternatively, a single processor system may be employed. Also,connected to system bus 206 is memory controller/cache 208, which provides an interface to local memory 209. I/O bus bridge 210 is connected to system bus 206 and provides an interface to I/O bus 212. Memory controller/cache 208 and I/O bus bridge 210 may be integrated as depicted.
Peripheral component interconnect (PCI) bus bridge 214 connected to I/O bus 212 provides an interface to PCI local bus 216. A number of modems may be connected to PCI local bus 216. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to clients 108-112 in FIG. 1 may be provided through modem 218 and network adapter 220 connected to PCI local bus 216 through add-in connectors.
Additional PCI bus bridges 222 and 224 provide interfaces for additional PCI local buses 226 and 228, from which additional modems or network adapters may be supported. In this manner, data processing system 200 allows connections to multiple network computers. A memory-mapped graphics adapter 230 and hard disk 232 may also be connected to I/O bus 212 as depicted, either directly or indirectly.
In the depicted example, service processor (SP) 244 is connected to I/O bus 212 by direct component connection. SP processor 244 is connected to SP flash memory 245, SP dynamic random access memory (DRAM) 241, and non-volatile random access memory (NVRAM) 242. All of these components form an SP unit or module. SP flash memory 245 is an example of the flash memory in which firmware used for an initial program load (IPL) may be stored. SP DRAM 241 is a memory in which firmware binaries from SP flash memory 245 are loaded for execution by SP processor 244. NVRAM 242 may be used to hold data that is to be retained when the system is powered down. In this example, flash memory 245 provides storage for an initial program load firmware, which is used to initialize the hardware in data processing system 200. Additionally, flash memory 245 provides a persistent storage for storing a data dump comprising first failure data capture information collected in response to detection of a system or application error or fault condition.
Those of ordinary skill in the art will appreciate that the hardware depicted in FIG. 2 may vary. For example, other peripheral devices, such as optical disk drives and the like, also may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural limitations with respect to the present invention.
The data processing system depicted in FIG. 2 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
The first failure data capture system instructions are preferably executed by SP processor 244 of data processing system 200 shown in FIG. 2. In accordance with the present invention, hardware device drivers and software components that detect failures make persistent records of the failures using a software facility provided for this purpose, herein called the first failure data capture (FFDC) system. The FFDC system logic is implemented as part of the SP subsystem collectively designated as SP subsystem 240 in FIG. 2. SP subsystem 240 may be implemented as a distinct data processing system or alternatively may be implemented as a subsystem of a integrated in a host data processing system. In the examples provided herein, SP subsystem 240 is shown and described as comprising a subsystem integrated with data processing system 200, and such an implementation is exemplary only. SP subsystem 240 creates a failure report if an error or fault condition is detected in data processing system 200. FFDC error data may be associated with an application error or an operating system error. FFDC information is stored as a data dump and associated header information written to, for example, SP flash memory 245 shown in FIG. 2.
FIG. 3 is a diagrammatic illustration of a first failure data capture system implemented according to an embodiment of the present invention. One or more application programs 302 communicates with FFDC interface 304 implemented in accordance with the principles of the present invention. FFDC interface 304 stores and retrieves failure reports through, in one example, an operating system (O/S) error logging subsystem 306 to an O/S error log persistent storage 310, or alternatively, to a error stack FFDC persistent storage 308. FFDC persistent storage 308 and O/S error log persistent storage 310 may be recorded in SP flash memory 245. In an alternate embodiment, FFDC persistent storage 308 and error log persistent storage 310 may comprise the same storage within data processing system 200. FFDC persistent storage 308 could store information that would not normally go into O/S error log persistent storage 310. FFDC persistent storage 308 and error logging persistent storage 310 are components available in SP subsystem 240 offered by International Business Machines Corporation.
Preferably, when making a failure record, the FFDC system component provides enough information so that: 1) the failure is adequately described so that later analysis efforts may determine the nature and scope of the failure condition; and 2) specific details that are of importance to the data processing system environment such that the manufacturer of data processing system 200 can determine how the condition came to exist so any flaws in the data processing system environment can be identified and repaired.
FFDC interface 304 writes a dump to SP flash memory 245 upon detection of a failure condition. The dump may have encapsulated data such as the time at which the failure report was recorded. Other data describing the failure condition may be recorded as well.
Data processing system 200 may then be serviced for collection and analysis of the FFDC data. Alternatively, FFDC data may be communicated to a central repository, such as a central server functioning as a repository for the storage of FFDC data.
FIG. 4A is a flowchart of processing performed by FFDC interface 306 in accordance with a preferred embodiment of the present invention. SP subsystem 240 is initialized (step 402) and begins running in a stable state (step 404). SP subsystem 240 monitors for system errors or faults (step 406) until a fault condition is encountered. When a fault condition is encountered, SP subsystem 240 determines if the error is recoverable (step 408), that is if the system kernel is still running and SP subsystem 240 remains stable. If the error is recoverable, SP subsystem 240 invokes an FFDC routine and a dump including error or fault data is captured by FFDC interface 304 of FIG. 3 (step 410). The FFDC routine invoked at step 410 is implemented as computer executable logic that runs on the system kernel. The dump is encapsulated and placed in a persistent storage, such as SP flash memory 245 of SP subsystem 240. Once the FFDC information is collected and stored, data processing system 200 is returned to a stable state (step 412), and SP subsystem 240 returns to monitoring for fault conditions.
If, however, the error or fault condition is determined to be unrecoverable, FFDC information may be collected in one of two manners in accordance with a preferred embodiment of the present invention. If the SP state is evaluated as suitable for collection of the FFDC information at step 414, a fault type evaluation is made (step 415). For example, the fault may be determined to be an unexpected application failure such as a critical application failure, a threshold exceeded failure, or the like. An FFDC routine then runs (step 416), after which SP subsystem 240 reboots (step 418). The system kernel is then restarted (step 420), and SP subsystem 240 returns to a stable state for monitoring system fault conditions (step 432).
In other situations, the SP may not be in a state suitable for collection of FFDC information at step 414. For example, SP subsystem 240 may hang, panic, or enter a non-responsive state. Alternatively, SP subsystem 240 may be unexpectedly brought down by a host initiated reset, e.g., a hard boot. An SP reboot is then performed and FFDC information collection is invoked during boot in accordance with a preferred embodiment of the present invention (step 426). FFDC collection performed in accordance with step 426 is implemented as a dump collection routine in the system boot code and is executed during a reboot in accordance with a preferred embodiment of the present invention as described more fully below. An attempt is then made to restart the system kernel (step 430), and SP subsystem 240 is returned to a stable state (step 432) for monitoring fault events upon a successful kernel restart.
FIG. 4B is a flowchart of FFDC collection performed during reboot of SP subsystem 240 in accordance with a preferred embodiment of the present invention. The processing steps of FIG. 4B correspond to processing step 426 of FIG. 4A. The collection of FFDC information is performed by a dump collection routine executed during system boot when data processing system 200 is not in a state to collect the FFDC information upon the fault condition. In a preferred embodiment, the FFDC collection routine is implemented as a firmware plugin that extends the operating system boot code. As such, the FFDC collection begins at SP subsystem 240 reboot (step 440). SP subsystem 240 hardware, such as SP processor 244, SP flash memory 245, and the like shown in FIG. 2, begins reinitialization (step 442). The dump collection firmware then begins execution (step 444). The dump collection firmware logic first evaluates the reset type for a boot dump collection reset type (step 446). In a preferred embodiment, a boot dump collection reset type may be identified as a unit check reset, a kernel panic reset, or a host-initiated reset. If the reset type is not evaluated as a boot dump collection reset type at step 446, the boot dump collection routine exits and SP subsystem 240 continues the boot process (step 470).
If the reset type is evaluated as a boot dump collection reset type at step 446, SP flash memory 245 is initialized for data storage (step 448). SP flash memory 245 is then evaluated to determine if a valid dump exists in SP flash memory 245 (step 450). For example, when a dump is written by SP subsystem 240, a valid dump indictor bit in the dump header may be asserted to indicate the dump is valid. Accordingly, an address of SP flash memory 245 may be read at step 450 for evaluation of a dump indicator bit and thus the presence or absence of a valid dump in SP flash memory 245. The boot collection dump routine preserves the FFDC data dump (step 451), and the boot dump collection routine then exits and SP subsystem continues the boot process (step 470) if a valid dump is identified in SP flash memory 245.
A new dump is created and stored in SP flash memory 245 (step 452) if a valid dump is not identified at step 450. The boot dump collection routine then evaluates SP flash memory 245 for additional storage capacity (step 454).
Preferably, the boot dump collection routine collects or calculates dump data on a priority basis. Generation of a valid dump header, for example, may be assigned a higher priority than calculation of error detection data as a valid dump header is often more critical in a dump analysis than error detection values calculated on the dump data. Table A is an exemplary priority list that may be evaluated by the boot dump collection routine for determining additional data to add to a dump being generated and corresponding data item locations. Data item locations designated as “Calculated” are calculated by the boot dump collection routine logic.
TABLE A
Priority Data Item Location
1 Headers Calculated
2 DRAM Buffers DRAM
3 NVRAM Buffers NVRAM
If the boot dump collection routine determines SP flash memory 245 has remaining capacity for storage of additional dump information at step 454, an evaluation is made to determine if any priority item remains for dump collection (step 456). The highest remaining priority item is read or calculated (step 458) if the boot dump collection routine determines any priority items remain to be added to the dump at step 456. The data item is then compressed (step 460) and an error detection code, such as a cyclic redundancy check (CRC) value, is calculated on the data item (step 462). The data item is then added to the dump in SP flash memory 245 (step 464), and the boot dump collection routine updates the dump header to indicate inclusion of the added item to the dump (step 466). The boot dump collection routine then returns to evaluate SP flash memory 245 for additional capacity for dump storage.
When either an evaluation is made that the capacity of SP flash memory 245 for dump storage has been consumed at step 454, or that no priority items remain to be added to the dump at step 456, the boot dump collection routine proceeds to finalize the dump (step 468). For example, the boot dump collection routine may complete the dump headers, calculate error detection values, and close the dump file. The boot dump collection routine then exits and SP subsystem 240 continues the boot process (step 470). Upon completion of SP subsystem 240 boot, system processing returns to step 430 of FIG. 4A.
Thus, the boot dump collection routine provides a mechanism for collection of FFDC data when a system fault condition results in a system state where the service processor is unable to collect FFDC data without a reboot. By implementing FFDC collection during a service processor boot, FFDC information may be collected even if the system fault results in impairment of the data processing system to the extent that the system is inoperable, i.e. the system kernel is unable to be brought up after the system fault. For example, the FFDC information may be collected by execution of the boot dump collection routine at step 426 of FIG. 4A. In the event that the kernel is unable to be brought up at step 430 of FIG. 4A, the FFDC information collected by the boot dump collection routine may still be retrieved by manually removing SP subsystem 240 from data processing system 200. Thus, the FFDC information may be analyzed even in the event that the system fault renders the data processing system inoperable.
In accordance with another embodiment of the present invention, the first failure data capture system of the present invention facilitates dynamic reprioritization of data that is collected in a data dump. Often, the most significant data for properly evaluating a system fault cause is dependent on the fault type. For example, data retrieved from DRAM buffers may be the most critical data for properly evaluating a particular type of system failure, while data retrieved from an NVRAM buffer may be the most critical data for properly evaluating another type of system failure. In accordance with a preferred embodiment of the present invention, the items in the priority list described above may be dynamically prioritized dependent on the an evaluated system fault type. For example, each of the priority items of the priority list shown in Table A may have separate index values associated with a reset type evaluated at step 446 of FIG. 4B. Alternatively, each of a plurality of priority lists may be associated with a particular type of system fault, such as a reset type evaluated by the first failure data capture system at step 446 of FIG. 4B. The first failure data capture system then collects items prioritized in accordance with the evaluated fault condition.
As described, a first failure data capture system provides mechanisms for data dump collection of first failure data capture information for recoverable application failures and non-recoverable system failures where the service processor remains in a state suitable for data dump generation. Additionally, the first failure data capture system provides a mechanism implemented as a boot dump collection routine for the collection of FFDC information when a system fault condition results in a system state requiring execution of a system reboot. Firmware executed during boot of the service processor collects FFDC information prior to an attempt to restart the system kernel. Moreover, dynamic reprioritization of data items collected by the first failure data capture system is provided.
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (8)

1. A method of generating a data dump in a data processing system, the method comprising the computer implemented steps of:
initializing a system boot of the data processing system;
executing a firmware that includes first failure data capture logic;
creating a data dump in a persistent storage of the data processing system;
evaluating a fault type of the data processing system; and
writing a plurality of data items to the data dump, wherein the writing of the data items is dynamically reprioritized dependent on the fault type.
2. A computer program product encoded in a computer recordable medium and operable for generating a data dump in a data processing system when executed by the data processing system, the computer program product comprising:
first instructions for evaluating a reset type of the data processing system;
second instructions for determining whether a valid data dump is maintained by the data processing system; and
third instructions, responsive to determining that a valid data dump is not maintained by the data processing system, for executing first failure data capture logic during a boot of the data processing system, including sub-instructions that obtain a priority item from a plurality of priority items to write to a data dump in the persistent storage.
3. The computer program product of claim 2, wherein the third instructions evaluate a capacity of a persistent storage.
4. The computer program product of claim 3, wherein the sub-instructions that obtain a priority item from a plurality of priority items to write to a data dump in the persistent storage are responsive to determining that additional capacity remains in the persistent storage.
5. The computer program product of claim 4, wherein the plurality of priority items are sequenced according to a reset type.
6. A computer program product encoded in a computer recordable medium and operable for generating a data dump in a data processing system when executed by the data processing system, the computer program product comprising:
first instructions for evaluating a reset type of the data processing system;
second instructions for determining whether a valid data dump is maintained by the data processing system;
third instructions, responsive to determining that a valid data dump is not maintained by the data processing system, for executing first failure data capture logic during a boot of the data processing system;
fourth instructions for evaluating a plurality of priority items in a priority list; and
fifth instruction, responsive to the fourth instructions evaluating each of the plurality of priority items as having been written to a data dump, that finalize the data dump for storage.
7. The computer program product of claim 6, wherein the fifth instructions, responsive to the data dump being finalized for storage, that terminate execution of the first failure data capture logic.
8. The computer program product of claim 7, further comprising:
sixth instructions, responsive to the first failure data capture logic being terminated, that attempt to restart a system kernel of the data processing system.
US10/821,045 2004-04-08 2004-04-08 Method, data processing system, and computer program product for collecting first failure data capture information Active 2025-10-15 US7308609B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/821,045 US7308609B2 (en) 2004-04-08 2004-04-08 Method, data processing system, and computer program product for collecting first failure data capture information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/821,045 US7308609B2 (en) 2004-04-08 2004-04-08 Method, data processing system, and computer program product for collecting first failure data capture information

Publications (2)

Publication Number Publication Date
US20050240826A1 US20050240826A1 (en) 2005-10-27
US7308609B2 true US7308609B2 (en) 2007-12-11

Family

ID=35137873

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/821,045 Active 2025-10-15 US7308609B2 (en) 2004-04-08 2004-04-08 Method, data processing system, and computer program product for collecting first failure data capture information

Country Status (1)

Country Link
US (1) US7308609B2 (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060184827A1 (en) * 2005-02-15 2006-08-17 Robinson Timothy A Method for responding to a control module failure
US20070180330A1 (en) * 2006-02-02 2007-08-02 Dell Products L.P. Systems and methods for management and capturing of optical drive failure errors
US20090222700A1 (en) * 2008-02-29 2009-09-03 Wade Carter Providing System Reset Information To Service Provider
US20090271602A1 (en) * 2008-04-29 2009-10-29 Ibm Corporation Method for Recovering Data Processing System Failures
US20090327679A1 (en) * 2008-04-23 2009-12-31 Huang David H Os-mediated launch of os-independent application
US20100082932A1 (en) * 2008-09-30 2010-04-01 Rothman Michael A Hardware and file system agnostic mechanism for achieving capsule support
US7788537B1 (en) * 2006-01-31 2010-08-31 Emc Corporation Techniques for collecting critical information from a memory dump
US20110179314A1 (en) * 2010-01-21 2011-07-21 Patel Nehal K Method and system of error logging
US8127099B2 (en) * 2006-12-26 2012-02-28 International Business Machines Corporation Resource recovery using borrowed blocks of memory
US20120072778A1 (en) * 2010-08-12 2012-03-22 Harman Becker Automotive Systems Gmbh Diagnosis system for removable media drive
US8381014B2 (en) 2010-05-06 2013-02-19 International Business Machines Corporation Node controller first failure error management for a distributed system
US20130111264A1 (en) * 2010-07-06 2013-05-02 Mitsubishi Electric Corporation Processor device and program
US20140173357A1 (en) * 2012-12-18 2014-06-19 HGST Netherlands B.V. Salvaging event trace information in power loss interruption scenarios
US8812916B2 (en) 2011-06-02 2014-08-19 International Business Machines Corporation Failure data management for a distributed computer system
US8984336B1 (en) * 2012-02-20 2015-03-17 Symantec Corporation Systems and methods for performing first failure data captures
US20160070486A1 (en) * 2014-09-04 2016-03-10 HGST Netherlands B.V. Debug data saving in host memory on pcie solid state drive
US20160147605A1 (en) * 2014-11-26 2016-05-26 Inventec (Pudong) Technology Corporation System error resolving method
US9424120B1 (en) * 2016-01-29 2016-08-23 International Business Machines Corporation Prioritizing first failure data capture (FFDC) data for analysis
US20160301609A1 (en) * 2015-04-07 2016-10-13 Exinda Networks Pty, Ltd. Method and system for triggering augmented data collection on a network based on traffic patterns
US9916192B2 (en) 2012-01-12 2018-03-13 International Business Machines Corporation Thread based dynamic data collection
US10025650B2 (en) 2015-09-17 2018-07-17 International Business Machines Corporation Determining a trace of a system dump
US10367752B2 (en) 2016-11-18 2019-07-30 International Business Machines Corporation Data packet management in a memory constrained environment
US10467101B2 (en) * 2014-05-20 2019-11-05 Bull Sas Method of obtaining information stored in processing module registers of a computer just after the occurrence of a fatal error

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1859353A4 (en) * 2005-03-07 2012-02-22 Intel Corp Self-adaptive multicast file transfer protocol
US8108880B2 (en) * 2007-03-07 2012-01-31 International Business Machines Corporation Method and system for enabling state save and debug operations for co-routines in an event-driven environment
US7725770B2 (en) * 2007-04-01 2010-05-25 International Business Machines Corporation Enhanced failure data collection system apparatus and method
US8250402B2 (en) * 2008-03-24 2012-08-21 International Business Machines Corporation Method to precondition a storage controller for automated data collection based on host input
US8099630B2 (en) * 2008-07-29 2012-01-17 International Business Machines Corporation Hardware diagnostics determination during initial program loading
US8566798B2 (en) * 2008-10-15 2013-10-22 International Business Machines Corporation Capturing context information in a currently occurring event
US9235404B2 (en) 2012-06-27 2016-01-12 Microsoft Technology Licensing, Llc Firmware update system
US8972973B2 (en) 2012-06-27 2015-03-03 Microsoft Technology Licensing, Llc Firmware update discovery and distribution
US9110761B2 (en) 2012-06-27 2015-08-18 Microsoft Technology Licensing, Llc Resource data structures for firmware updates
US9032423B2 (en) 2013-06-21 2015-05-12 Microsoft Technology Licensing, Llc Dependency based configuration package activation
US20180336086A1 (en) * 2016-01-29 2018-11-22 Hewlett Packard Enterprise Development Lp System state information monitoring
CN115061759A (en) * 2022-05-24 2022-09-16 联想(北京)有限公司 Data acquisition method, related device and storage medium

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4932028A (en) 1988-06-21 1990-06-05 Unisys Corporation Error log system for self-testing in very large scale integrated circuit (VLSI) units
US5119377A (en) 1989-06-16 1992-06-02 International Business Machines Corporation System and method for software error early detection and data capture
US5128885A (en) 1990-02-23 1992-07-07 International Business Machines Corporation Method for automatic generation of document history log exception reports in a data processing system
US5463768A (en) 1994-03-17 1995-10-31 General Electric Company Method and system for analyzing error logs for diagnostics
US5696897A (en) * 1994-01-31 1997-12-09 Sun Microsystems, Inc. Method and apparatus for a multi-layer system quiescent suspend and resume operation
US5860115A (en) 1993-06-08 1999-01-12 International Business Machines Corporation Requesting a dump of information stored within a coupling facility, in which the dump includes serviceability information from an operating system that lost communication with the coupling facility
US5884019A (en) 1995-08-07 1999-03-16 Fujitsu Limited System and method for collecting dump information in a multi-processor data processing system
JP2000137630A (en) 1998-11-04 2000-05-16 Nec Corp Memory dump system and method therefor
US6105150A (en) 1997-10-14 2000-08-15 Fujitsu Limited Error information collecting method and apparatus
US6148415A (en) 1993-06-11 2000-11-14 Hitachi, Ltd. Backup switching control system and method
US6182243B1 (en) 1992-09-11 2001-01-30 International Business Machines Corporation Selective data capture for software exception conditions
US6279120B1 (en) * 1997-07-25 2001-08-21 Siemens Aktiengesellschaft Method for storing computer status data given a malfunction that requires a subsequent restarting of the computer
US6502208B1 (en) * 1997-03-31 2002-12-31 International Business Machines Corporation Method and system for check stop error handling
US6526524B1 (en) 1999-09-29 2003-02-25 International Business Machines Corporation Web browser program feedback system
US6775698B1 (en) * 1997-12-11 2004-08-10 Cisco Technology, Inc. Apparatus and method for downloading core file in a network device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4932028A (en) 1988-06-21 1990-06-05 Unisys Corporation Error log system for self-testing in very large scale integrated circuit (VLSI) units
US5119377A (en) 1989-06-16 1992-06-02 International Business Machines Corporation System and method for software error early detection and data capture
US5128885A (en) 1990-02-23 1992-07-07 International Business Machines Corporation Method for automatic generation of document history log exception reports in a data processing system
US6182243B1 (en) 1992-09-11 2001-01-30 International Business Machines Corporation Selective data capture for software exception conditions
US5860115A (en) 1993-06-08 1999-01-12 International Business Machines Corporation Requesting a dump of information stored within a coupling facility, in which the dump includes serviceability information from an operating system that lost communication with the coupling facility
US6148415A (en) 1993-06-11 2000-11-14 Hitachi, Ltd. Backup switching control system and method
US5696897A (en) * 1994-01-31 1997-12-09 Sun Microsystems, Inc. Method and apparatus for a multi-layer system quiescent suspend and resume operation
US5463768A (en) 1994-03-17 1995-10-31 General Electric Company Method and system for analyzing error logs for diagnostics
US5884019A (en) 1995-08-07 1999-03-16 Fujitsu Limited System and method for collecting dump information in a multi-processor data processing system
US6502208B1 (en) * 1997-03-31 2002-12-31 International Business Machines Corporation Method and system for check stop error handling
US6279120B1 (en) * 1997-07-25 2001-08-21 Siemens Aktiengesellschaft Method for storing computer status data given a malfunction that requires a subsequent restarting of the computer
US6105150A (en) 1997-10-14 2000-08-15 Fujitsu Limited Error information collecting method and apparatus
US6775698B1 (en) * 1997-12-11 2004-08-10 Cisco Technology, Inc. Apparatus and method for downloading core file in a network device
JP2000137630A (en) 1998-11-04 2000-05-16 Nec Corp Memory dump system and method therefor
US6526524B1 (en) 1999-09-29 2003-02-25 International Business Machines Corporation Web browser program feedback system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
IBM Technical Disclosure Bulletin, "Error Log Analysis", vol. 23, No. 6, Nov. 1980, pp. 2493-2504.
IBM Technical Disclosure Bulletin, "Multiple Address Space First Failure Data Capture", vol. 27, No. 06A, Jun. 1994, pp. 625-626.
IBM Technical Disclosure Bulletin, "Systems Network Architecture Distribution Services Agent In-Progress Queue Methods and Recovery", vol. 38, No. 02, Feb. 1995, pp. 465-472.
IBM Technical Disclosure Bulletin, "Tracing, Formatting and Storage Referencing in an MVS Multitasking Online System", vol. 29, No. 3, Aug. 1986, pp. 1224-1227.

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7406624B2 (en) * 2005-02-15 2008-07-29 General Motors Corporation Method for responding to a control module failure
US20060184827A1 (en) * 2005-02-15 2006-08-17 Robinson Timothy A Method for responding to a control module failure
US7788537B1 (en) * 2006-01-31 2010-08-31 Emc Corporation Techniques for collecting critical information from a memory dump
US20070180330A1 (en) * 2006-02-02 2007-08-02 Dell Products L.P. Systems and methods for management and capturing of optical drive failure errors
US7971101B2 (en) 2006-02-02 2011-06-28 Dell Products L.P. Systems and methods for management and capturing of optical drive failure errors
US8127099B2 (en) * 2006-12-26 2012-02-28 International Business Machines Corporation Resource recovery using borrowed blocks of memory
US20090222700A1 (en) * 2008-02-29 2009-09-03 Wade Carter Providing System Reset Information To Service Provider
US8127179B2 (en) * 2008-02-29 2012-02-28 Arris Group, Inc. Providing system reset information to service provider
US20090327679A1 (en) * 2008-04-23 2009-12-31 Huang David H Os-mediated launch of os-independent application
US8539200B2 (en) * 2008-04-23 2013-09-17 Intel Corporation OS-mediated launch of OS-independent application
US7818622B2 (en) * 2008-04-29 2010-10-19 International Business Machines Corporation Method for recovering data processing system failures
US20090271602A1 (en) * 2008-04-29 2009-10-29 Ibm Corporation Method for Recovering Data Processing System Failures
US20100082932A1 (en) * 2008-09-30 2010-04-01 Rothman Michael A Hardware and file system agnostic mechanism for achieving capsule support
US8631186B2 (en) 2008-09-30 2014-01-14 Intel Corporation Hardware and file system agnostic mechanism for achieving capsule support
US8990486B2 (en) 2008-09-30 2015-03-24 Intel Corporation Hardware and file system agnostic mechanism for achieving capsule support
US20110179314A1 (en) * 2010-01-21 2011-07-21 Patel Nehal K Method and system of error logging
US8122291B2 (en) * 2010-01-21 2012-02-21 Hewlett-Packard Development Company, L.P. Method and system of error logging
US8381014B2 (en) 2010-05-06 2013-02-19 International Business Machines Corporation Node controller first failure error management for a distributed system
US20130111264A1 (en) * 2010-07-06 2013-05-02 Mitsubishi Electric Corporation Processor device and program
US8583960B2 (en) * 2010-07-06 2013-11-12 Mitsubishi Electric Corporation Processor device and program
US8661288B2 (en) * 2010-08-12 2014-02-25 Harman Becker Automotive Systems Gmbh Diagnosis system for removable media drive
US20120072778A1 (en) * 2010-08-12 2012-03-22 Harman Becker Automotive Systems Gmbh Diagnosis system for removable media drive
US8812916B2 (en) 2011-06-02 2014-08-19 International Business Machines Corporation Failure data management for a distributed computer system
US9916192B2 (en) 2012-01-12 2018-03-13 International Business Machines Corporation Thread based dynamic data collection
US10740166B2 (en) 2012-01-12 2020-08-11 International Business Machines Corporation Thread based dynamic data collection
US8984336B1 (en) * 2012-02-20 2015-03-17 Symantec Corporation Systems and methods for performing first failure data captures
US20140173357A1 (en) * 2012-12-18 2014-06-19 HGST Netherlands B.V. Salvaging event trace information in power loss interruption scenarios
US9690642B2 (en) * 2012-12-18 2017-06-27 Western Digital Technologies, Inc. Salvaging event trace information in power loss interruption scenarios
US10467101B2 (en) * 2014-05-20 2019-11-05 Bull Sas Method of obtaining information stored in processing module registers of a computer just after the occurrence of a fatal error
US20160070486A1 (en) * 2014-09-04 2016-03-10 HGST Netherlands B.V. Debug data saving in host memory on pcie solid state drive
US10474618B2 (en) * 2014-09-04 2019-11-12 Western Digital Technologies, Inc. Debug data saving in host memory on PCIE solid state drive
US20160147605A1 (en) * 2014-11-26 2016-05-26 Inventec (Pudong) Technology Corporation System error resolving method
US9565109B2 (en) * 2015-04-07 2017-02-07 Exinda Networks Pty, Ltd. Method and system for triggering augmented data collection on a network based on traffic patterns
US20160301609A1 (en) * 2015-04-07 2016-10-13 Exinda Networks Pty, Ltd. Method and system for triggering augmented data collection on a network based on traffic patterns
US10025650B2 (en) 2015-09-17 2018-07-17 International Business Machines Corporation Determining a trace of a system dump
US10169131B2 (en) 2015-09-17 2019-01-01 International Business Machines Corporation Determining a trace of a system dump
US9612895B1 (en) 2016-01-29 2017-04-04 International Business Machines Corporation Method for prioritizing first failure data capture (FFDC) data for analysis
US9424120B1 (en) * 2016-01-29 2016-08-23 International Business Machines Corporation Prioritizing first failure data capture (FFDC) data for analysis
US10367752B2 (en) 2016-11-18 2019-07-30 International Business Machines Corporation Data packet management in a memory constrained environment
US11012368B2 (en) 2016-11-18 2021-05-18 International Business Machines Corporation Data packet management in a memory constrained environment

Also Published As

Publication number Publication date
US20050240826A1 (en) 2005-10-27

Similar Documents

Publication Publication Date Title
US7308609B2 (en) Method, data processing system, and computer program product for collecting first failure data capture information
US6834363B2 (en) Method for prioritizing bus errors
US6748550B2 (en) Apparatus and method for building metadata using a heartbeat of a clustered system
US7533292B2 (en) Management method for spare disk drives in a raid system
US8359495B2 (en) System and method for using failure casting to manage failures in computer systems
US6976197B2 (en) Apparatus and method for error logging on a memory module
US9684554B2 (en) System and method for using failure casting to manage failures in a computed system
US7328376B2 (en) Error reporting to diagnostic engines based on their diagnostic capabilities
US6665813B1 (en) Method and apparatus for updateable flash memory design and recovery with minimal redundancy
US7711991B2 (en) Error monitoring of partitions in a computer system using partition status indicators
US8201019B2 (en) Data storage device in-situ self test, repair, and recovery
US7870441B2 (en) Determining an underlying cause for errors detected in a data processing system
US7574621B2 (en) Method and system for identifying and recovering a file damaged by a hard drive failure
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
US20050097141A1 (en) Autonomic filesystem recovery
EP2329384B1 (en) Memory management techniques selectively using mitigations to reduce errors
EP2312443A2 (en) Information processing apparatus, method of controlling information processing apparatus and control program
US6832342B2 (en) Method and apparatus for reducing hardware scan dump data
US8537662B2 (en) Global detection of resource leaks in a multi-node computer system
US8711684B1 (en) Method and apparatus for detecting an intermittent path to a storage system
US20060168165A1 (en) Provisional application management with automated acceptance tests and decision criteria
CN118377656B (en) System unrecoverable fault processing method and device, electronic equipment and storage medium
CN118656307B (en) Fault detection method, server, medium and product of baseboard management controller
Koerner et al. The z990 first error data capture concept
JPH0830473A (en) Process restoring method

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DICKENSON, MARC ALAN;JACOBS, BRENT WILLIAM;LIM, MICHAEL YOUHOUR;REEL/FRAME:014595/0912;SIGNING DATES FROM 20040325 TO 20040330

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12