CN103970661A - Method for batched server memory fault detection through IPMI tool - Google Patents
Method for batched server memory fault detection through IPMI tool Download PDFInfo
- Publication number
- CN103970661A CN103970661A CN201410211110.2A CN201410211110A CN103970661A CN 103970661 A CN103970661 A CN 103970661A CN 201410211110 A CN201410211110 A CN 201410211110A CN 103970661 A CN103970661 A CN 103970661A
- Authority
- CN
- China
- Prior art keywords
- result
- machine
- txt
- memory
- echo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明提供一种利用IPMI工具进行批量服务器内存故障检测的方法,属于故障检测领域,本发明通过ipmi工具对网内所有服务器的bmc日志进行记录扫描,从结果中分析有内存问题的机器,通过脚本对网内批量服务器进行批量检查,对有内存ecc报错机器进行快速确认,实现批量机器的内存批量检查。减小了测试的时间,提高了工作效率。The present invention provides a kind of method that utilizes IPMI tool to carry out batch server memory fault detection, belongs to the field of fault detection, the present invention records and scans the bmc log of all servers in the network through ipmi tool, analyzes the machine that has memory problem from the result, through The script performs batch checks on batch servers in the network, quickly confirms machines with memory ECC errors, and realizes batch checks on the memory of batch machines. The test time is reduced and the work efficiency is improved.
Description
技术领域 technical field
本发明本发明涉及批量部署服务器拥有bmc记录内存故障功能条件下批量内存问题检测的方法,具体地说是一种利用IPMI工具进行批量服务器内存故障检测的方法。 The present invention relates to a method for batch memory problem detection under the condition that a batch deployment server has a bmc record memory fault function, specifically a method for using an IPMI tool to detect a batch server memory fault.
背景技术 Background technique
在计算机中,机器校验架构(MCA)是指在操作系统中CPU报告硬件错误的一种机制,是cpu的一个ras特性;例如当一个ECC错误产生的时,如内存错误,位于cpu中的各种特定模型的寄存器(MSRs)会检测到有错误产生,将会触发MCA机制;而后产生一个系统中断,并将由各种寄存器(MSRs)记录下当时各种状态信息,交给bmc芯片予以记录,所以目前主板集成bmc芯片可以记录内存运行错误,尤其是ecc报错,bmc有独立的网络配置,可以配置为独立ip,所有机器的bmc ip地址可以配置为同网段以便于集中管理。 In computers, Machine Check Architecture (MCA) refers to a mechanism for the CPU to report hardware errors in the operating system. It is a ras feature of the CPU; for example, when an ECC error occurs, such as a memory error, it is located in the CPU. Various model-specific registers (MSRs) will detect errors and trigger the MCA mechanism; then a system interrupt will be generated, and various state information at that time will be recorded by various registers (MSRs) and handed over to the bmc chip for recording , so the current motherboard integrated bmc chip can record memory operation errors, especially ECC error reporting, bmc has an independent network configuration, which can be configured as an independent ip, and the bmc ip addresses of all machines can be configured as the same network segment for centralized management.
目前大量互联网用户采购批量服务器,且随着远程管理技术的逐渐成熟,对服务器的管理不再依赖于服务器所在机房本地管理,而是通过网络远程控制,这样在服务器出现内存故障如ECC ERROR错误时,如果不通过bmc检查则无法及时发现问题,可能对后期服务器运行的稳定性带来影响,所以需要定时对所有服务器进行bmc日志检查,但对于批量部署的机器,单台逐一测试的时间太长,工作效率太低。 At present, a large number of Internet users purchase batch servers, and with the gradual maturity of remote management technology, the management of servers no longer depends on the local management of the computer room where the server is located, but is controlled remotely through the network, so that when the server has a memory failure such as ECC ERROR error , if you do not pass the bmc check, you will not be able to find the problem in time, which may affect the stability of the later server operation, so you need to regularly check the bmc log of all servers, but for machines deployed in batches, it takes too long to test one by one , work efficiency is too low.
发明内容 Contents of the invention
本发明通过批量检查和搜集各服务器ipmi接口数据的方法,集中所有搜集信息,筛选出有问题的机器,及时进行故障维护。 The present invention gathers all the collected information through the method of checking and collecting the ipmi interface data of each server in batches, screens out problematic machines, and performs fault maintenance in time.
一种利用IPMI工具进行批量服务器内存故障检测的方法,通过ipmi工具对网内所有服务器的bmc日志进行记录扫描,从结果中分析有内存问题的机器,通过脚本对网内批量服务器进行批量检查,对有内存ecc报错机器进行快速确认,实现批量机器的内存批量检查。 A method of using IPMI tools to detect batch server memory faults, record and scan the bmc logs of all servers in the network through the ipmi tool, analyze the machines with memory problems from the results, and perform batch checks on the batch servers in the network through scripts, Quickly confirm the memory ECC error reporting machine, and realize the memory batch inspection of batch machines.
1)、找一台windows系统机器,配置ip后连接网络,确保和用户服务器管理网络连通, 1) Find a Windows system machine, connect to the network after configuring the ip, and ensure that it is connected to the user server management network.
2)、修改默认脚本以配合实际网络环境: 2), modify the default script to match the actual network environment:
3)、在windows机器上执行脚本,配合ipmitool.exe和libeay32.dll工具文件,执行的最终结果放在当前目录的result.txt文件中, 3) Execute the script on the windows machine, cooperate with the ipmitool.exe and libeay32.dll tool files, and put the final result of the execution in the result.txt file in the current directory.
4)、对检测出有问题的机器进行内存故障处理。 4) Perform memory fault handling on the machine that detects a problem.
默认实现脚本sel.bat如下: The default implementation script sel.bat is as follows:
echo off echo off
for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (
echo ##############################################################################################>> result.txt echo ################################################## ################################################>> result.txt
echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
echo **********************************************************************************************>> result.txt echo *************************************************** ***********************************************>> result.txt
)。 ).
本发明的有益效果是: The beneficial effects of the present invention are:
1. 自动批量检查,提高效率。 1. Automatic batch inspection to improve efficiency.
2. 可定制化脚本,适合不同的网络配置环境。 2. The script can be customized, suitable for different network configuration environments.
3. 实现方式简单,易于操作。 3. The implementation method is simple and easy to operate.
具体实施方式 Detailed ways
实现过程: Implementation process:
默认实现脚本sel.bat如下: The default implementation script sel.bat is as follows:
echo off echo off
for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (
echo ##############################################################################################>> result.txt echo ################################################## ################################################>> result.txt
echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
echo **********************************************************************************************>> result.txt echo *************************************************** ***********************************************>> result.txt
) )
1、找一台windows系统机器,配置ip后连接网络,确保和用户服务器管理网络连通, 1. Find a windows system machine, connect to the network after configuring ip, and ensure that it is connected to the user server management network.
2、修改默认脚本以配合实际网络环境: 2. Modify the default script to match the actual network environment:
如现场网段为192.168.1.1-192.168.1.200,相应的,将sel.bat中修改: If the on-site network segment is 192.168.1.1-192.168.1.200, modify the sel.bat accordingly:
for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (
修改为 for /L %%i in (1,1,200) do ( Change to for /L %%i in (1,1,200) do (
echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt
修改为 echo 192.168.1.%%i% >>result.txt Change to echo 192.168.1.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt 修改为 ipmitool.exe -H 192.168.1.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt is changed to ipmitool.exe -H 192.168.1.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
3、在windows机器上执行脚本,配合ipmitool.exe和libeay32.dll工具文件,执行的最终结果放在当前目录的result.txt文件中,如下格式,下面示例说明10.7.12.82这台服务器有ecc错误,其他空的说明没有: 3. Execute the script on the windows machine, cooperate with the ipmitool.exe and libeay32.dll tool files, and put the final result of the execution in the result.txt file in the current directory. The format is as follows. The following example shows that the server on 10.7.12.82 has an ECC error , the other empty description does not:
############################################################################################## #################################################### ###############################################
10.7.12.82 10.7.12.82
1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert 1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert 3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
********************************************************************************************** ***************************************************** ***********************************************
############################################################################################## #################################################### ###############################################
10.7.12.83 10.7.12.83
********************************************************************************************** ***************************************************** ***********************************************
############################################################################################## #################################################### ###############################################
10.7.12.84 10.7.12.84
********************************************************************************************** ***************************************************** ***********************************************
4、对检测出有问题的机器进行内存故障处理。 4. Troubleshoot the memory of the machine where the problem is detected.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410211110.2A CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410211110.2A CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103970661A true CN103970661A (en) | 2014-08-06 |
Family
ID=51240190
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410211110.2A Pending CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970661A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104268045A (en) * | 2014-09-29 | 2015-01-07 | 浪潮电子信息产业股份有限公司 | Testing method for startup and shutdown in remote control system |
CN104333617A (en) * | 2014-11-18 | 2015-02-04 | 浪潮电子信息产业股份有限公司 | Method for automatically setting static state IP for rack cabinet in Linux system |
CN104360922A (en) * | 2014-10-20 | 2015-02-18 | 浪潮电子信息产业股份有限公司 | Method for automatically monitoring BMC working state based on ipmitool |
CN104714863A (en) * | 2015-02-06 | 2015-06-17 | 浪潮电子信息产业股份有限公司 | Method for completely storing Raid card logs on basis of Linux operation system after system crashes |
CN105045689A (en) * | 2015-06-25 | 2015-11-11 | 浪潮电子信息产业股份有限公司 | Method for monitoring and alarming hard disks by using RAID card batch detection |
CN106126368A (en) * | 2016-08-22 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | Method for analyzing memory fault address under LINUX |
CN106484639A (en) * | 2016-10-10 | 2017-03-08 | 郑州云海信息技术有限公司 | A kind of method that CPU register information is obtained by ipmi agreement |
CN106603343A (en) * | 2017-01-11 | 2017-04-26 | 郑州云海信息技术有限公司 | A method for testing stability of servers in batch |
CN106991026A (en) * | 2017-04-28 | 2017-07-28 | 郑州云海信息技术有限公司 | It is a kind of to pass through the method that network carries out server memory Rank margin test in batches |
CN106997323A (en) * | 2017-04-05 | 2017-08-01 | 广东浪潮大数据研究有限公司 | A kind of recording method of server B MC problem repetition steps |
CN107092549A (en) * | 2017-04-26 | 2017-08-25 | 郑州云海信息技术有限公司 | A kind of automatic monitoring and the instrument and method for parsing memory failure |
CN107463455A (en) * | 2017-08-01 | 2017-12-12 | 联想(北京)有限公司 | A kind of method and device for detecting memory failure |
CN108763005A (en) * | 2018-05-30 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of memory ECC failures error-reporting method and system |
CN109032807A (en) * | 2018-08-08 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory |
CN110032486A (en) * | 2019-03-06 | 2019-07-19 | 平安科技(深圳)有限公司 | Server test method, device, computer equipment and storage medium |
CN114968065A (en) * | 2021-02-19 | 2022-08-30 | 北京神州数码云科信息技术有限公司 | An optimization application based on ipmitool tool in reading and writing FRU |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020144177A1 (en) * | 1998-12-10 | 2002-10-03 | Kondo Thomas J. | System recovery from errors for processor and associated components |
CN102799506A (en) * | 2012-06-29 | 2012-11-28 | 浪潮电子信息产业股份有限公司 | Method for positioning fault memory |
CN103473141A (en) * | 2013-09-13 | 2013-12-25 | 浪潮电子信息产业股份有限公司 | Method for out-of-band check and modification of BIOS (basic input/output system) setting options |
CN103593211A (en) * | 2013-11-01 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Method for refreshing and writing firmware programs through out-of-band isolation |
-
2014
- 2014-05-19 CN CN201410211110.2A patent/CN103970661A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020144177A1 (en) * | 1998-12-10 | 2002-10-03 | Kondo Thomas J. | System recovery from errors for processor and associated components |
CN102799506A (en) * | 2012-06-29 | 2012-11-28 | 浪潮电子信息产业股份有限公司 | Method for positioning fault memory |
CN103473141A (en) * | 2013-09-13 | 2013-12-25 | 浪潮电子信息产业股份有限公司 | Method for out-of-band check and modification of BIOS (basic input/output system) setting options |
CN103593211A (en) * | 2013-11-01 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Method for refreshing and writing firmware programs through out-of-band isolation |
Non-Patent Citations (1)
Title |
---|
乐晨: "ipmitool对linux服务器进行IPMI管理", 《HTTP://MY.OSCHINA.NET/DAVEHE/BLOG/88801》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104268045A (en) * | 2014-09-29 | 2015-01-07 | 浪潮电子信息产业股份有限公司 | Testing method for startup and shutdown in remote control system |
CN104360922A (en) * | 2014-10-20 | 2015-02-18 | 浪潮电子信息产业股份有限公司 | Method for automatically monitoring BMC working state based on ipmitool |
CN104333617A (en) * | 2014-11-18 | 2015-02-04 | 浪潮电子信息产业股份有限公司 | Method for automatically setting static state IP for rack cabinet in Linux system |
CN104333617B (en) * | 2014-11-18 | 2018-05-25 | 浪潮电子信息产业股份有限公司 | A kind of method that rack cabinets set static IP automatically under linux system |
CN104714863A (en) * | 2015-02-06 | 2015-06-17 | 浪潮电子信息产业股份有限公司 | Method for completely storing Raid card logs on basis of Linux operation system after system crashes |
CN105045689A (en) * | 2015-06-25 | 2015-11-11 | 浪潮电子信息产业股份有限公司 | Method for monitoring and alarming hard disks by using RAID card batch detection |
CN106126368A (en) * | 2016-08-22 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | Method for analyzing memory fault address under LINUX |
CN106484639A (en) * | 2016-10-10 | 2017-03-08 | 郑州云海信息技术有限公司 | A kind of method that CPU register information is obtained by ipmi agreement |
CN106603343A (en) * | 2017-01-11 | 2017-04-26 | 郑州云海信息技术有限公司 | A method for testing stability of servers in batch |
CN106997323A (en) * | 2017-04-05 | 2017-08-01 | 广东浪潮大数据研究有限公司 | A kind of recording method of server B MC problem repetition steps |
CN107092549A (en) * | 2017-04-26 | 2017-08-25 | 郑州云海信息技术有限公司 | A kind of automatic monitoring and the instrument and method for parsing memory failure |
CN106991026A (en) * | 2017-04-28 | 2017-07-28 | 郑州云海信息技术有限公司 | It is a kind of to pass through the method that network carries out server memory Rank margin test in batches |
CN107463455A (en) * | 2017-08-01 | 2017-12-12 | 联想(北京)有限公司 | A kind of method and device for detecting memory failure |
CN107463455B (en) * | 2017-08-01 | 2020-10-30 | 联想(北京)有限公司 | Method and device for detecting memory fault |
CN108763005A (en) * | 2018-05-30 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of memory ECC failures error-reporting method and system |
CN108763005B (en) * | 2018-05-30 | 2021-07-27 | 郑州云海信息技术有限公司 | A kind of memory ECC fault reporting method and system |
CN109032807A (en) * | 2018-08-08 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory |
CN110032486A (en) * | 2019-03-06 | 2019-07-19 | 平安科技(深圳)有限公司 | Server test method, device, computer equipment and storage medium |
CN110032486B (en) * | 2019-03-06 | 2022-08-09 | 平安科技(深圳)有限公司 | Server testing method and device, computer equipment and storage medium |
CN114968065A (en) * | 2021-02-19 | 2022-08-30 | 北京神州数码云科信息技术有限公司 | An optimization application based on ipmitool tool in reading and writing FRU |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103970661A (en) | Method for batched server memory fault detection through IPMI tool | |
KR102268355B1 (en) | Cloud deployment infrastructure validation engine | |
US9569325B2 (en) | Method and system for automated test and result comparison | |
US9135150B2 (en) | Automated execution of functional test scripts on a remote system within a unit testing framework | |
WO2020087954A1 (en) | Method, apparatus, device and system for grabbing trace of nvme hard disk | |
US8291379B2 (en) | Runtime analysis of a computer program to identify improper memory accesses that cause further problems | |
CN104268076A (en) | Testing method suitable for automatically testing memory bandwidth of each processor platform | |
CN106293984A (en) | A kind of computer glitch automatically processes mode and device | |
US20160274997A1 (en) | End user monitoring to automate issue tracking | |
CN105912086A (en) | Power module fault diagnosis method, power module and whole cabinet server | |
JP2011145824A (en) | Information processing apparatus, fault analysis method, and fault analysis program | |
CN103984613A (en) | Method for automatically testing floating point calculation performance of CPU (Central Processing Unit) | |
CN112416634A (en) | File processing method and device and storage medium | |
US9842044B2 (en) | Commit sensitive tests | |
WO2024250776A1 (en) | Fault detection method and apparatus for external device | |
US9354962B1 (en) | Memory dump file collection and analysis using analysis server and cloud knowledge base | |
WO2020087956A1 (en) | Method, apparatus, device and system for capturing trace of nvme hard disc | |
CN106776219B (en) | A method for detecting the burn-in of the whole server | |
Chuah et al. | Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis | |
WO2021056913A1 (en) | Fault locating method, apparatus and system based on i2c communication | |
CN115562918A (en) | Computer system fault testing method and device, electronic equipment and readable medium | |
CN107133134A (en) | A kind of efficient RAID card Auto-Test System and method | |
US20120311206A1 (en) | Facilitating processing in a communications environment using stop signaling | |
Chuah et al. | Using message logs and resource use data for cluster failure diagnosis | |
JPWO2011051999A1 (en) | Information processing apparatus and information processing apparatus control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140806 |
|
WD01 | Invention patent application deemed withdrawn after publication |