[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118295864B - Linux operating system hardware error identification method and system - Google Patents

Linux operating system hardware error identification method and system Download PDF

Info

Publication number
CN118295864B
CN118295864B CN202410718889.0A CN202410718889A CN118295864B CN 118295864 B CN118295864 B CN 118295864B CN 202410718889 A CN202410718889 A CN 202410718889A CN 118295864 B CN118295864 B CN 118295864B
Authority
CN
China
Prior art keywords
data set
hardware
information
hardware error
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410718889.0A
Other languages
Chinese (zh)
Other versions
CN118295864A (en
Inventor
车烈权
石光银
蔡卫卫
高传集
孙思清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202410718889.0A priority Critical patent/CN118295864B/en
Publication of CN118295864A publication Critical patent/CN118295864A/en
Application granted granted Critical
Publication of CN118295864B publication Critical patent/CN118295864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and a system for identifying hardware errors of a Linux operating system, and relates to the technical field of operating systems. In order to efficiently identify hardware errors of a server, the method comprises the following steps: acquiring hardware error log information and storing the hardware error log information into a data set; formatting and standardizing the hardware error log information of the data set to obtain a TXTLINE-format data set; manually or automatically marking key information of a hardware error log in a data set, defining a regular expression and training a NER model based on the marking information; and collecting hardware error log information in real time, automatically identifying an error format in the log information by using a regular expression, and automatically identifying a specific error type of the error format in the log information by using a NER model. The invention can realize the rapid identification and marking of hardware errors, is convenient for operation and maintenance personnel to rapidly solve hardware faults, improves the stability and reliability of the server, and reduces economic loss and maintenance cost caused by hardware faults.

Description

Linux operating system hardware error identification method and system
Technical Field
The invention relates to the technical field of operating systems, in particular to a method and a system for identifying hardware errors of a Linux operating system.
Background
In modern operating systems and cloud computing architectures, the recognition processing mechanism of hardware errors is critical to ensure the stability and reliability of the system. If operation and maintenance personnel can rapidly identify hardware errors, error hardware sources can be timely processed online, and stability and reliability of cloud computing service can be greatly improved.
Hardware error identification in an operating system mainly comprises means of manually checking a panel, analyzing logs, analyzing the logs by an automatic script tool and the like. The manual check panel and the analysis log depend on manual monitoring, and a technician is required to check regularly, so that important error prompts are easily missed; the limited amount of information that a panel and log can display, typically only provides a brief error code or status indication, and cannot provide detailed context and analysis of errors; human interpretation of error codes or indicator lights may be misinterpreted, particularly in the face of complex or rare hardware problems. The automation scripts and tools can help to relieve the burden of manual operations, automatically identify error information in the log or monitor hardware status through preset rules. But accuracy is limited: the accuracy of the automation tool depends on the perfection of the preset rules; for the newly-appearing or complex hardware problems, if the rule is not updated in time, the rule cannot be accurately identified; configuration and maintenance costs: expertise is required to configure and maintain automation rules, which need to be updated continuously as the system environment changes; response limits: while known errors can be quickly identified, automated tools have limited ability to handle unknown errors or to provide detailed diagnostic solutions.
In actual operation and maintenance, analysis of hardware errors mainly depends on built-in mechanisms such as EDAC (Error Detection and Correction), mcelog/rasdaemon, SMART Monitoring (Analysis, and Reporting Technology), PCIe AER (Advanced Error Reporting), kernel log Analysis, NMI Watchdog and the like to monitor, identify and report various hardware problems. These methods have advantages and disadvantages, but the common problems include low recognition accuracy, incomplete information, inefficiency caused by excessive dependence on manpower, and the like.
In the current cloud computing and big data age, the stability of servers and data centers is of paramount importance. However, hardware errors become one of the major threats to this stability. Hardware errors, including but not limited to memory failures, CPU failures, hard disk failures, etc., often result in downtime or severe performance degradation of the system, bringing significant economic loss and service interruption to businesses and users. Although existing hardware error handling techniques are capable of recognizing and preventing these errors to some extent, their accuracy and efficiency remain to be improved due to the lack of intelligent error analysis and prediction capabilities.
The consequences of hardware errors are particularly serious. In a cloud computing environment, downtime of one server may affect tens of thousands of end users, and in severe cases may even result in a temporary interruption of the entire service or application. In addition, hardware errors can increase the operation and maintenance cost of enterprises, and a great deal of manpower and time are required to repair faults, so that the normal operation and financial conditions of the enterprises are adversely affected. Therefore, how to identify the hardware error of the server more accurately and efficiently without affecting other normal providing services becomes a difficult problem that needs to be solved by a wide cloud manufacturer.
Disclosure of Invention
Aiming at the needs and the shortcomings of the prior art development, the invention provides a method and a system for identifying hardware errors of a Linux operating system, which realize the rapid identification and marking of the hardware errors, are convenient for operation and maintenance personnel to rapidly solve hardware faults and improve the stability and the reliability of a server.
In a first aspect, the present invention provides a method for identifying hardware errors of a Linux operating system, which solves the above technical problems by adopting the following technical scheme:
A Linux operating system hardware error identification method comprises the following steps:
S1, acquiring hardware error log information and storing the hardware error log information into a data set;
S2, formatting and standardizing the hardware error log information of the data set to obtain a TXTLINE-format data set, wherein each row represents an independent hardware error event;
s3, manually or automatically marking key information of a hardware error log in the data set, defining a regular expression and training an NER model based on the marking information;
S4, acquiring hardware error log information in real time, automatically identifying an error format in the log information by using a regular expression, and automatically identifying a specific error type of the error format in the log information by using a NER model.
Optionally, simulating various hardware error scenes by using a memory error injection module to generate hardware error log information;
Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file;
collecting server hardware error log information of a Linux operating system by using third-party software;
and acquiring the hardware error log information and storing the hardware error log information in a data set.
Optionally, step S3 is executed, where the hardware error log key information of the dataset is manually marked through the visual interface, the hardware error log key information of the dataset is automatically marked by using a predefined automatic marking mapping template, the accuracy of the automatic marking result is manually confirmed, and then the dataset after marking is divided into a training dataset and a test dataset;
based on the marking information:
a) Defining a regular expression according to the marking result of the training data set, scanning and analyzing the testing data set by using log analysis software supporting the regular expression, automatically marking the contents as key information of a hardware error log by the log analysis software when the contents matched with the regular expression appear in the testing data set, manually comparing and confirming the accuracy of the identification result, and manually optimizing the regular expression or outputting the regular expression according to the comparison result to execute the step S4;
b) Training a pre-training model by using the marked training data set to obtain an NER model, testing the NER model which is completed training by using the unmarked test data set, manually comparing and confirming the accuracy of the identification result, and optimizing the NER model or outputting the NER model according to the comparison result to execute the step S4.
Preferably, the key information of the hardware error log includes the type of error, the hardware component, the processing actions taken by the Linux operating system, and the error code.
Optionally, after executing step S4, manually auditing the recognition results of the regular expression and the NER model, manually correcting when the recognition results have errors, and then storing the correction results and the corresponding hardware error logs into the data set to optimize the regular expression and the NER model.
In a second aspect, the present invention provides a Linux operating system hardware error identification system, which solves the technical problems as follows:
a Linux operating system hardware error identification system, comprising:
The information acquisition module is used for acquiring hardware error log information and storing the hardware error log information into a data set;
The information processing module is used for formatting and standardizing the hardware error log information of the data set to obtain the TXTLINE-format data set, wherein each row represents an independent hardware error event;
the information marking module is used for assisting personnel to manually or automatically mark key information of the hardware error log in the data set;
the definition module is used for defining a regular expression according to the marking information of the information marking module;
the training module is used for training the NER model according to the manually marked data set;
The error identification module is used for automatically identifying the error format in the log information by utilizing the regular expression aiming at the hardware error log information acquired in real time, and automatically identifying the specific error type of the error format in the log information by utilizing the NER model.
Optionally, simulating various hardware error scenes by using a memory error injection module to generate hardware error log information;
Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file;
collecting server hardware error log information of a Linux operating system by using third-party software;
the information acquisition module acquires the hardware error log information and stores the hardware error log information in a data set.
Optionally, the related information marking module provides a visual interface to assist personnel in manually marking the hardware error log key information of the data set, or a predefined automatic marking mapping template is arranged in the information marking module, and the hardware error log key information of the data set is automatically marked through the automatic marking mapping template;
the hardware error recognition system further comprises a dividing module for dividing the data set into a training data set and a test data set;
The definition module defines a regular expression according to the marking result of the training data set, and invokes log analysis software supporting the regular expression to scan and analyze the testing data set, when the content matched with the regular expression appears in the testing data set, the log analysis software automatically marks the content as key information of a hardware error log, personnel compares and confirms the accuracy of the identification result through a visual interface provided by the information marking module, and manually optimizes the regular expression or outputs the regular expression for subsequent use according to the comparison result;
The training module trains the pre-training model by using the manually marked training data set to obtain the NER model, then inputs the data of the test data set into the NER model which is completed training, compares and confirms the accuracy of the NER model identification result through the visual interface provided by the information marking module, and optimizes the NER model or outputs the NER model for subsequent use according to the comparison result.
Preferably, the key information of the hardware error log involved includes the type of error, the hardware component, the processing actions taken by the Linux operating system, and the error code.
Optionally, for the automatic recognition result of the error recognition module, personnel check the accuracy of the recognition result through a visual interface provided by the information marking module, and perform manual correction when the recognition result is in error, and then store the correction result and the corresponding hardware error log information into a data set to perform optimization of the regular expression and the NER model.
Compared with the prior art, the method and the system for identifying the hardware errors of the Linux operating system have the beneficial effects that:
1. The invention can realize the rapid identification and marking of hardware errors, is convenient for operation and maintenance personnel to rapidly solve hardware faults, improves the stability and reliability of the server, and reduces economic loss and maintenance cost caused by hardware faults;
2. According to the method, firstly, a regular expression is defined and a NER model is trained based on key information of a large number of historical hardware error log information, then the accuracy of the regular expression and the NER model is tested, optimization of the regular expression and the NER model is carried out according to the accuracy, and rapid identification and marking of hardware errors are further improved.
Drawings
FIG. 1 is a flow chart of a method according to a first embodiment of the invention;
fig. 2 is a block diagram of a module connection according to a second embodiment of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.
Embodiment one: with reference to fig. 1, this embodiment proposes a method for identifying hardware errors of a Linux operating system, which includes the following steps:
S1, acquiring hardware error log information and storing the hardware error log information into a data set.
The specific way of obtaining the hardware error log information in this step is as follows:
(1) Simulating various hardware error scenes by using a memory error injection module to generate hardware error log information; the supplementary ones are: the memory error injection module, EINJ module, is mainly used in development and testing environments. Through EINJ modules, developers and testers can manually inject errors into the system to simulate the occurrence of hardware errors. This enables the developer to verify the responsiveness of the system to various hardware errors, including the validity of the error detection, reporting, and recovery mechanisms. EINJ rely on a specific ACPI table, called EINJ table (Error Injection Table). The table defines parameters and methods of hardware error injection, including the type of injection error, the target, and the trigger mechanism. The Linux kernel allows applications or tools in user space to trigger error injection by parsing EINJ the table and providing the corresponding interface. EINJ implementation is an important component of the APEI specification, which provides a mechanism to simulate hardware errors, thereby helping developers and testers verify and improve the processing logic of the Linux system for hardware errors. The memory error injection module can simulate not only a common hardware fault type but also rare or complex fault conditions so as to ensure the comprehensiveness and depth of training data, wherein the hardware error injection supports CPU, memory and PCI errors;
(2) Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file; the supplementary ones are: extracting readable strings from executable files typically involves parsing the internal structure of the file, typically including: ① The tool is used: for an executable file (PE format) of the Windows platform, a pefile library of Python can be used for extracting the character string resources, pefile is an open-source tool, and can help to read and analyze the structure of the PE file and extract the character string resources in the PE file; ② Writing a script: scripts can be written to open an executable file and read its contents in binary mode, then a string extraction function or regular expression is used to find and extract readable strings, which usually requires some knowledge of the file format in order to correctly locate and extract string data; ③ Using file operations: when processing a text file, a file operation function (such as fgets ()) can be used for reading the file content row by row and storing the file content as a character string, and the method is suitable for a file with simpler text content and does not involve complex format analysis; ④ Converting the data type: if a string (e.g., a number) is encountered during extraction that needs to be converted to a particular data type, this can be accomplished by a conversion function provided by the programming language, e.g., in Python, the string can be converted to a corresponding numeric type using functions of int (), float (), etc.; ⑤ Third party tool: in addition to the programming method, third party tools may be used to extract strings from executable files, which tools typically have a graphical user interface and provide one-touch extraction functionality that is well suited for users without a programming context. Note that when extracting a character string, attention is paid to the problem of file encoding, and different encoding modes may cause that the extracted character string cannot be displayed correctly;
(3) Collecting server hardware error log information of a Linux operating system by using third-party software; the usual methods are: ① Using rsyslog services: rsyslog is a powerful journaling service in the Linux system that can be configured to send system journals to a remote journaling system, such as Graylog, via the syslog protocol. The collection port and destination address of the log may be specified by editing rsyslog's configuration file, e.g., configuration rsyslog uses UDP 1515 port to collect the log and send it to Graylog server; ② Using logrotate tools: logrotate is a log management tool that can manage the size and number of log files, by configuring logrotate, set the round robin policy of log files, e.g., cut by time or size, and specify the number of log files that remain, which helps to ensure that log files do not occupy too much disk space and old log files can be cleaned periodically; ③ Collecting application logs: in addition to system logs, various applications may also generate their own log files, record program running states, error information, etc., where the location of the log files depends on the configuration of the application, and the collection manner of the application logs is configured according to needs, for example, using a log collection tool such as filebeat to monitor and forward the logs to a centralized log processing system. In addition to the above methods, the use of specialized log analysis tools such as ELASTIC STACK (previously known as ELK Stack, including ELASTICSEARCH, LOGSTASH and Kibana) or Splunk, which provide powerful log collection, storage, search and analysis functions, are also contemplated;
the hardware error log information obtained by the modes (1), (2) and (3) is stored in a data set.
S2, formatting and standardizing the hardware error log information of the data set to obtain the TXTLINE-format data set, wherein each row represents an independent hardware error event.
The implementation of this step requires in particular the following operations: ① Journal analysis: analyzing the hardware error log information acquired in the step S1 by using a log analysis tool (such as grok plugins of Logstar), extracting key information, for example, defining a specific mode to match key fields in the hardware error log, such as a time stamp, an error code, a hardware component name and the like; ② Formatting: formatting the parsed fields as needed, such as converting timestamp format, unified error code representation mode, etc., which can be implemented by configuration files of log processing tools to ensure that each log has a consistent format; ③ Standardized storage: storing the parsed and formatted log data in a suitable data structure, such as a database table or data box (DATAFRAME), during which the data may be further cleaned to remove invalid or incomplete records; ④ The output is TXTLINE format: finally, converting the processed data into TXTLINE format, ensuring that each line is a separate hardware error event, which can be accomplished by writing a script or using a data processing tool (such as pandas library of Python); ⑤ Verification and testing: after the steps are completed, the output data set is verified and tested, and the accuracy and the integrity of the output data set are ensured. This includes checking for consistency of the data, integrity of the error event, and whether the predetermined format requirements are met; ⑥ Document record: the configuration files of the whole process and use are recorded for future auditing and reproduction. In general, through the above operations, the hardware error log information obtained in step S1 may be converted into a formatted and standardized TXTLINE format data set, which facilitates subsequent data analysis and problem diagnosis.
S3, manually or automatically marking key information of a hardware error log in the data set, defining a regular expression based on the marking information and training an NER model.
The method comprises the steps of executing the step, manually marking hardware error log key information of a data set through a visual interface, or automatically marking the hardware error log key information of the data set by utilizing a predefined automatic marking mapping template, wherein the key information of the hardware error log comprises an error type, a hardware component, a processing action adopted by a Linux operating system and an error code, manually confirming the accuracy of an automatic marking result, and then dividing the marked data set into a training data set and a test data set according to the proportion of 7:3 or 8:2.
The predefined automatic annotation mapping template is as follows:
[
{% for entity in input %}
{
"start_offset": {{ entity.start_offset }},
"end_offset": {{ entity.end_offset}},
"label": "{{ entity.label }}"
}{% if not loop.last %},{% endif %}
{% endfor %}
]
After the automatic marking service is implemented, the return value is returned according to the following format:
[{
'start_offset': "",
'end_offset': "",
'label': ""
}]
based on the marking information:
a) Defining a regular expression according to the marking result of the training data set, scanning and analyzing the testing data set by using log analysis software supporting the regular expression, automatically marking the contents as key information of a hardware error log by the log analysis software when the contents matched with the regular expression appear in the testing data set, manually comparing and confirming the accuracy of the identification result, and manually optimizing the regular expression or outputting the regular expression according to the comparison result to execute the step S4;
b) Training a pre-training model by using the marked training data set to obtain a NER model, testing the NER model which is completely trained by using the unmarked test data set, comparing and confirming the accuracy of the recognized result of the NER model, and optimizing the NER model or outputting the NER model according to the comparison result to execute the step S4; the supplementary ones are: in the model training process, various technologies including cross verification, regularization and super parameter tuning are adopted, the methods not only help the model to stably represent unseen data, but also prevent the occurrence of the over-fitting phenomenon, and the prediction error of the model can be gradually reduced by continuously comparing the output prediction result with the actual marking result and adjusting the internal parameters of the model, so that the recognition and marking capability of hardware error information are improved; after the training, the key information of the hardware error can be automatically identified and marked from the log information, and powerful support is provided for system administrators and maintenance personnel in the process of diagnosing and repairing the hardware fault.
S4, acquiring hardware error log information in real time, automatically identifying an error format in the log information by using a regular expression, and automatically identifying a specific error type of the error format in the log information by using a NER model.
And after the step S4 is executed, manually checking the recognition results of the regular expression and the NER model, manually correcting when the recognition results have errors, and then storing the correction results and the corresponding hardware error logs into a data set to optimize the regular expression and the NER model. The operation can perfect the whole automatic hardware error recognition process, ensure that the model is continuously adapted to the update of log data and the change of user demands, and the continuous manual correction and model optimization process can remarkably improve the hardware error recognition and marking efficiency, thereby providing more accurate and efficient service for users.
Embodiment two: referring to fig. 2, this embodiment proposes a Linux operating system hardware error identification system, which includes:
The information acquisition module is used for acquiring hardware error log information and storing the hardware error log information into a data set;
The information processing module is used for formatting and standardizing the hardware error log information of the data set to obtain the TXTLINE-format data set, wherein each row represents an independent hardware error event;
The information marking module is used for assisting personnel to manually or automatically mark key information of a hardware error log in the data set, wherein the key information of the hardware error log comprises an error type, a hardware component, a processing action taken by a Linux operating system and an error code;
the definition module is used for defining a regular expression according to the marking information of the information marking module;
the training module is used for training the NER model according to the manually marked data set;
The error identification module is used for automatically identifying the error format in the log information by utilizing the regular expression aiming at the hardware error log information acquired in real time, and automatically identifying the specific error type of the error format in the log information by utilizing the NER model.
In this embodiment, (1) a memory error injection module is used to simulate various hardware error scenarios and generate hardware error log information; the supplementary ones are: the memory error injection module, EINJ module, is mainly used in development and testing environments. Through EINJ modules, developers and testers can manually inject errors into the system to simulate the occurrence of hardware errors. This enables the developer to verify the responsiveness of the system to various hardware errors, including the validity of the error detection, reporting, and recovery mechanisms. EINJ rely on a specific ACPI table, called EINJ table (Error Injection Table). The table defines parameters and methods of hardware error injection, including the type of injection error, the target, and the trigger mechanism. The Linux kernel allows applications or tools in user space to trigger error injection by parsing EINJ the table and providing the corresponding interface. EINJ implementation is an important component of the APEI specification, which provides a mechanism to simulate hardware errors, thereby helping developers and testers verify and improve the processing logic of the Linux system for hardware errors. The memory error injection module can simulate not only a common hardware fault type but also rare or complex fault conditions so as to ensure the comprehensiveness and depth of training data, wherein the hardware error injection supports CPU, memory and PCI errors;
(2) Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file; the supplementary ones are: extracting readable strings from executable files typically involves parsing the internal structure of the file, typically including: ① The tool is used: for an executable file (PE format) of the Windows platform, a pefile library of Python can be used for extracting the character string resources, pefile is an open-source tool, and can help to read and analyze the structure of the PE file and extract the character string resources in the PE file; ② Writing a script: scripts can be written to open an executable file and read its contents in binary mode, then a string extraction function or regular expression is used to find and extract readable strings, which usually requires some knowledge of the file format in order to correctly locate and extract string data; ③ Using file operations: when processing a text file, a file operation function (such as fgets ()) can be used for reading the file content row by row and storing the file content as a character string, and the method is suitable for a file with simpler text content and does not involve complex format analysis; ④ Converting the data type: if a string (e.g., a number) is encountered during extraction that needs to be converted to a particular data type, this can be accomplished by a conversion function provided by the programming language, e.g., in Python, the string can be converted to a corresponding numeric type using functions of int (), float (), etc.; ⑤ Third party tool: in addition to the programming method, third party tools may be used to extract strings from executable files, which tools typically have a graphical user interface and provide one-touch extraction functionality that is well suited for users without a programming context. Note that when extracting a character string, attention is paid to the problem of file encoding, and different encoding modes may cause that the extracted character string cannot be displayed correctly;
(3) Collecting server hardware error log information of a Linux operating system by using third-party software; the usual methods are: ① Using rsyslog services: rsyslog is a powerful journaling service in the Linux system that can be configured to send system journals to a remote journaling system, such as Graylog, via the syslog protocol. The collection port and destination address of the log may be specified by editing rsyslog's configuration file, e.g., configuration rsyslog uses UDP 1515 port to collect the log and send it to Graylog server; ② Using logrotate tools: logrotate is a log management tool that can manage the size and number of log files, by configuring logrotate, set the round robin policy of log files, e.g., cut by time or size, and specify the number of log files that remain, which helps to ensure that log files do not occupy too much disk space and old log files can be cleaned periodically; ③ Collecting application logs: in addition to system logs, various applications may also generate their own log files, record program running states, error information, etc., where the location of the log files depends on the configuration of the application, and the collection manner of the application logs is configured according to needs, for example, using a log collection tool such as filebeat to monitor and forward the logs to a centralized log processing system. In addition to the above methods, the use of specialized log analysis tools such as ELASTIC STACK (previously known as ELK Stack, including ELASTICSEARCH, LOGSTASH and Kibana) or Splunk, which provide powerful log collection, storage, search and analysis functions, are also contemplated;
the information acquisition module acquires the hardware error log information and stores the hardware error log information in a data set.
In this embodiment, the information processing module is used to obtain the TXTLINE format data set, and the operations to be performed include: ① Journal analysis: analyzing the hardware error log information acquired in the step S1 by using a log analysis tool (such as grok plugins of Logstar), extracting key information, for example, defining a specific mode to match key fields in the hardware error log, such as a time stamp, an error code, a hardware component name and the like; ② Formatting: formatting the parsed fields as needed, such as converting timestamp format, unified error code representation mode, etc., which can be implemented by configuration files of log processing tools to ensure that each log has a consistent format; ③ Standardized storage: storing the parsed and formatted log data in a suitable data structure, such as a database table or data box (DATAFRAME), during which the data may be further cleaned to remove invalid or incomplete records; ④ The output is TXTLINE format: finally, converting the processed data into TXTLINE format, ensuring that each line is a separate hardware error event, which can be accomplished by writing a script or using a data processing tool (such as pandas library of Python); ⑤ Verification and testing: after the steps are completed, the output data set is verified and tested, and the accuracy and the integrity of the output data set are ensured. This includes checking for consistency of the data, integrity of the error event, and whether the predetermined format requirements are met; ⑥ Document record: the configuration files of the whole process and use are recorded for future auditing and reproduction. In general, through the above operations, the hardware error log information obtained in step S1 may be converted into a formatted and standardized TXTLINE format data set, which facilitates subsequent data analysis and problem diagnosis.
In this embodiment, the related information marking module provides a visual interface, and assists personnel to manually mark the hardware error log key information of the data set, or an automatic marking mapping template defined in advance is built in the information marking module, and the hardware error log key information of the data set is automatically marked through the automatic marking mapping template.
The predefined automatic annotation mapping template is as follows:
[
{% for entity in input %}
{
"start_offset": {{ entity.start_offset }},
"end_offset": {{ entity.end_offset}},
"label": "{{ entity.label }}"
}{% if not loop.last %},{% endif %}
{% endfor %}
]
After the automatic marking service is implemented, the return value is returned according to the following format:
[{
'start_offset': "",
'end_offset': "",
'label': ""
}]
The hardware error recognition system also comprises a division module for dividing the data set into a training data set and a test data set, wherein the division can be performed according to the proportion of 7:3 or 8:2.
The definition module defines a regular expression according to the marking result of the training data set, and invokes log analysis software supporting the regular expression to scan and analyze the testing data set, when the content matched with the regular expression appears in the testing data set, the log analysis software automatically marks the content as key information of a hardware error log, personnel compares and confirms the accuracy of the identification result through a visual interface provided by the information marking module, and manually optimizes the regular expression or outputs the regular expression for subsequent use according to the comparison result;
The training module trains the pre-training model by using the marked training data set to obtain the NER model, then inputs the unmarked test data set data into the NER model which is completed training, compares the accuracy of the NER model recognition result by personnel through the visual interface provided by the information marking module, and optimizes the NER model or outputs the NER model for subsequent use according to the comparison result.
In this embodiment, for the automatic recognition result of the error recognition module, a person checks the accuracy of the recognition result through the visual interface provided by the information marking module, and performs manual correction when the recognition result is in error, and then stores the correction result and the corresponding hardware error log information into the data set to perform optimization of the regular expression and the NER model. The operation can perfect the whole automatic hardware error recognition process, ensure that the model is continuously adapted to the update of log data and the change of user demands, and the continuous manual correction and model optimization process can remarkably improve the hardware error recognition and marking efficiency, thereby providing more accurate and efficient service for users.
In summary, by adopting the method and the system for identifying the hardware errors of the Linux operating system, the hardware errors can be quickly identified and marked, so that operation and maintenance personnel can quickly solve the hardware faults, the stability and the reliability of the server are improved, and the economic loss and the maintenance cost caused by the hardware faults are reduced.
The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.

Claims (8)

1. The method for identifying the hardware errors of the Linux operating system is characterized by comprising the following steps:
S1, acquiring hardware error log information and storing the hardware error log information into a data set;
S2, formatting and standardizing the hardware error log information of the data set to obtain a TXTLINE-format data set, wherein each row represents an independent hardware error event;
S3, manually or automatically marking key information of a hardware error log in a data set, defining a regular expression and training an NER model based on the marking information, wherein the process specifically comprises the following steps: manually marking the hardware error log key information of the data set through a visual interface, automatically marking the hardware error log key information of the data set by utilizing a predefined automatic marking mapping template, manually confirming the accuracy of an automatic marking result, and then dividing the marked data set into a training data set and a test data set; based on the marking information: a) Defining a regular expression according to the marking result of the training data set, scanning and analyzing the testing data set by using log analysis software supporting the regular expression, automatically marking the contents as key information of a hardware error log by the log analysis software when the contents matched with the regular expression appear in the testing data set, manually comparing and confirming the accuracy of the identification result, and manually optimizing the regular expression or outputting the regular expression according to the comparison result to execute the step S4; b) Training a pre-training model by using the marked training data set to obtain a NER model, testing the NER model which is completely trained by using the unmarked test data set, manually comparing and confirming the accuracy of the identification result, and optimizing the NER model or outputting the NER model according to the comparison result to execute the step S4;
S4, acquiring hardware error log information in real time, automatically identifying an error format in the log information by using a regular expression, and automatically identifying a specific error type of the error format in the log information by using a NER model.
2. The method for identifying hardware errors of a Linux operating system according to claim 1, wherein a memory error injection module is utilized to simulate various hardware error scenes to generate hardware error log information;
Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file;
collecting server hardware error log information of a Linux operating system by using third-party software;
and acquiring the hardware error log information and storing the hardware error log information in a data set.
3. The method of claim 1, wherein the key information of the hardware error log includes an error type, a hardware component, a processing action taken by the Linux operating system, and an error code.
4. The method for identifying hardware errors of a Linux operating system according to claim 1, wherein after step S4 is executed, the identification results of the regular expression and the NER model are manually checked, when the identification results have errors, the manual correction is performed, and then the correction results and the corresponding hardware error logs are stored in a data set to optimize the regular expression and the NER model.
5. A Linux operating system hardware error identification system, comprising:
The information acquisition module is used for acquiring hardware error log information and storing the hardware error log information into a data set;
The information processing module is used for formatting and standardizing the hardware error log information of the data set to obtain the TXTLINE-format data set, wherein each row represents an independent hardware error event;
The dividing module is used for dividing the data set subjected to formatting and standardization into a training data set and a test data set;
The information marking module is used for providing a visual interface and assisting personnel in manually marking the hardware error log key information of the data set, or a predefined automatic marking mapping template is arranged in the information marking module, and the hardware error log key information of the data set is automatically marked through the automatic marking mapping template;
the definition module is used for defining a regular expression according to the marking result of the training data set, calling log analysis software supporting the regular expression to scan and analyze the testing data set, automatically marking the contents as key information of a hardware error log by the log analysis software when the contents matched with the regular expression appear in the testing data set, comparing and confirming the accuracy of the identification result by personnel through a visual interface provided by the information marking module, and manually optimizing the regular expression according to the comparison result or outputting the regular expression for subsequent use;
The training module is used for training the pre-training model by using the marked training data set to obtain an NER model, inputting the unmarked test data set data into the NER model which is completed training, comparing and confirming the accuracy of the NER model identification result by personnel through a visual interface provided by the information marking module, and optimizing the NER model or outputting the NER model for subsequent use according to the comparison result;
The error identification module is used for automatically identifying the error format in the log information by utilizing the regular expression aiming at the hardware error log information acquired in real time, and automatically identifying the specific error type of the error format in the log information by utilizing the NER model.
6. The system for recognizing hardware errors in a Linux operating system according to claim 5, wherein a memory error injection module is used to simulate various hardware error scenes to generate hardware error log information;
Extracting hardware error log information from an operating system kernel log and a BMC maintenance log by using a method for extracting readable character strings from an executable file;
collecting server hardware error log information of a Linux operating system by using third-party software;
The information acquisition module acquires the hardware error log information and stores the hardware error log information in a data set.
7. The Linux operating system hardware error identification system of claim 5, wherein the key information of the hardware error log includes an error type, a hardware component, a processing action taken by the Linux operating system, and an error code.
8. The Linux operating system hardware error recognition system according to claim 5, wherein for the automatic recognition result of the error recognition module, personnel check the accuracy of the recognition result through a visual interface provided by the information marking module, and manually correct the recognition result when the recognition result is in error, and then store the correction result and the corresponding hardware error log information into the data set to optimize the regular expression and the NER model.
CN202410718889.0A 2024-06-05 2024-06-05 Linux operating system hardware error identification method and system Active CN118295864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410718889.0A CN118295864B (en) 2024-06-05 2024-06-05 Linux operating system hardware error identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410718889.0A CN118295864B (en) 2024-06-05 2024-06-05 Linux operating system hardware error identification method and system

Publications (2)

Publication Number Publication Date
CN118295864A CN118295864A (en) 2024-07-05
CN118295864B true CN118295864B (en) 2024-08-13

Family

ID=91688316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410718889.0A Active CN118295864B (en) 2024-06-05 2024-06-05 Linux operating system hardware error identification method and system

Country Status (1)

Country Link
CN (1) CN118295864B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113986864A (en) * 2021-11-11 2022-01-28 建信金融科技有限责任公司 Log data processing method and device, electronic equipment and storage medium
CN115169490A (en) * 2022-07-25 2022-10-11 济南浪潮数据技术有限公司 Log classification method, device and equipment and computer readable storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IN2013MU02794A (en) * 2013-08-27 2015-07-03 Tata Consultancy Services Ltd
US9928155B2 (en) * 2015-11-18 2018-03-27 Nec Corporation Automated anomaly detection service on heterogeneous log streams
CN106844145A (en) * 2016-12-29 2017-06-13 北京奇虎科技有限公司 A kind of server hardware fault early warning method and device
CN112068981B (en) * 2020-09-24 2022-06-21 中国人民解放军国防科技大学 Knowledge base-based fault scanning recovery method and system in Linux operating system
US12026046B2 (en) * 2022-03-07 2024-07-02 Adobe Inc. Error log anomaly detection
CN117707902A (en) * 2023-11-28 2024-03-15 杭州安恒信息技术股份有限公司 Automatic log analysis method, system, electronic device and storage medium based on machine learning
CN117669484A (en) * 2023-12-07 2024-03-08 南方电网大数据服务有限公司 Chip simulation log checking method, device and readable medium
CN117743092A (en) * 2023-12-19 2024-03-22 上海东普信息科技有限公司 Log data processing method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113986864A (en) * 2021-11-11 2022-01-28 建信金融科技有限责任公司 Log data processing method and device, electronic equipment and storage medium
CN115169490A (en) * 2022-07-25 2022-10-11 济南浪潮数据技术有限公司 Log classification method, device and equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN118295864A (en) 2024-07-05

Similar Documents

Publication Publication Date Title
US11429614B2 (en) Systems and methods for data quality monitoring
CN109992476B (en) Log analysis method, server and storage medium
US20050188269A1 (en) System and method for providing a health model for software
CN111966587A (en) Data acquisition method, device and equipment
CN117421217A (en) Automatic software function test method, system, terminal and medium
CN116795572A (en) Method, device, medium and equipment for rapidly processing faults of automobile diagnosis software
CN112632330A (en) Method and device for routing inspection of ATM equipment, computer equipment and storage medium
CN118295864B (en) Linux operating system hardware error identification method and system
CN109508204B (en) Front-end code quality detection method and device
CN118133962A (en) Correlation analysis method, device and system of fault event and storage medium
CN117421231A (en) Automatic software testing method, system and device
CN116955207A (en) Automatic test method, system and medium for identifying software and hardware completed by test case
CN113037521B (en) Method for identifying state of communication equipment, communication system and storage medium
CN113220585A (en) Automatic fault diagnosis method and related device
CN112732588A (en) Artificial intelligence code verification system, method and device based on cloud computing
CN117076327B (en) Automatic interface detection and repair method and system
CN117971605B (en) Automatic log information collection method and system based on database abnormality
CN114490163B (en) Fault self-healing method and device and electronic equipment
US9753798B1 (en) Method and apparatus for electronic design automation
CN117707987B (en) Test case detection method and device, electronic equipment and storage medium
CN113220594B (en) Automatic test method, device, equipment and storage medium
CN111953544B (en) Fault detection method, device, equipment and storage medium of server
CN118819678A (en) Task commissioning method and device
Li et al. An Empirical Study of the Bug Link Rate
CN116820919A (en) Software defect management method, apparatus, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant