[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103970661A - Method for batched server memory fault detection through IPMI tool - Google Patents

Method for batched server memory fault detection through IPMI tool Download PDF

Info

Publication number
CN103970661A
CN103970661A CN201410211110.2A CN201410211110A CN103970661A CN 103970661 A CN103970661 A CN 103970661A CN 201410211110 A CN201410211110 A CN 201410211110A CN 103970661 A CN103970661 A CN 103970661A
Authority
CN
China
Prior art keywords
result
machine
txt
memory
echo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410211110.2A
Other languages
Chinese (zh)
Inventor
李双星
任华进
陈彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410211110.2A priority Critical patent/CN103970661A/en
Publication of CN103970661A publication Critical patent/CN103970661A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明提供一种利用IPMI工具进行批量服务器内存故障检测的方法,属于故障检测领域,本发明通过ipmi工具对网内所有服务器的bmc日志进行记录扫描,从结果中分析有内存问题的机器,通过脚本对网内批量服务器进行批量检查,对有内存ecc报错机器进行快速确认,实现批量机器的内存批量检查。减小了测试的时间,提高了工作效率。The present invention provides a kind of method that utilizes IPMI tool to carry out batch server memory fault detection, belongs to the field of fault detection, the present invention records and scans the bmc log of all servers in the network through ipmi tool, analyzes the machine that has memory problem from the result, through The script performs batch checks on batch servers in the network, quickly confirms machines with memory ECC errors, and realizes batch checks on the memory of batch machines. The test time is reduced and the work efficiency is improved.

Description

一种利用IPMI工具进行批量服务器内存故障检测的方法A Method of Batch Server Memory Fault Detection Using IPMI Tool

技术领域 technical field

本发明本发明涉及批量部署服务器拥有bmc记录内存故障功能条件下批量内存问题检测的方法,具体地说是一种利用IPMI工具进行批量服务器内存故障检测的方法。 The present invention relates to a method for batch memory problem detection under the condition that a batch deployment server has a bmc record memory fault function, specifically a method for using an IPMI tool to detect a batch server memory fault.

背景技术 Background technique

在计算机中,机器校验架构(MCA)是指在操作系统中CPU报告硬件错误的一种机制,是cpu的一个ras特性;例如当一个ECC错误产生的时,如内存错误,位于cpu中的各种特定模型的寄存器(MSRs)会检测到有错误产生,将会触发MCA机制;而后产生一个系统中断,并将由各种寄存器(MSRs)记录下当时各种状态信息,交给bmc芯片予以记录,所以目前主板集成bmc芯片可以记录内存运行错误,尤其是ecc报错,bmc有独立的网络配置,可以配置为独立ip,所有机器的bmc ip地址可以配置为同网段以便于集中管理。 In computers, Machine Check Architecture (MCA) refers to a mechanism for the CPU to report hardware errors in the operating system. It is a ras feature of the CPU; for example, when an ECC error occurs, such as a memory error, it is located in the CPU. Various model-specific registers (MSRs) will detect errors and trigger the MCA mechanism; then a system interrupt will be generated, and various state information at that time will be recorded by various registers (MSRs) and handed over to the bmc chip for recording , so the current motherboard integrated bmc chip can record memory operation errors, especially ECC error reporting, bmc has an independent network configuration, which can be configured as an independent ip, and the bmc ip addresses of all machines can be configured as the same network segment for centralized management.

目前大量互联网用户采购批量服务器,且随着远程管理技术的逐渐成熟,对服务器的管理不再依赖于服务器所在机房本地管理,而是通过网络远程控制,这样在服务器出现内存故障如ECC ERROR错误时,如果不通过bmc检查则无法及时发现问题,可能对后期服务器运行的稳定性带来影响,所以需要定时对所有服务器进行bmc日志检查,但对于批量部署的机器,单台逐一测试的时间太长,工作效率太低。 At present, a large number of Internet users purchase batch servers, and with the gradual maturity of remote management technology, the management of servers no longer depends on the local management of the computer room where the server is located, but is controlled remotely through the network, so that when the server has a memory failure such as ECC ERROR error , if you do not pass the bmc check, you will not be able to find the problem in time, which may affect the stability of the later server operation, so you need to regularly check the bmc log of all servers, but for machines deployed in batches, it takes too long to test one by one , work efficiency is too low.

发明内容 Contents of the invention

本发明通过批量检查和搜集各服务器ipmi接口数据的方法,集中所有搜集信息,筛选出有问题的机器,及时进行故障维护。 The present invention gathers all the collected information through the method of checking and collecting the ipmi interface data of each server in batches, screens out problematic machines, and performs fault maintenance in time.

一种利用IPMI工具进行批量服务器内存故障检测的方法,通过ipmi工具对网内所有服务器的bmc日志进行记录扫描,从结果中分析有内存问题的机器,通过脚本对网内批量服务器进行批量检查,对有内存ecc报错机器进行快速确认,实现批量机器的内存批量检查。 A method of using IPMI tools to detect batch server memory faults, record and scan the bmc logs of all servers in the network through the ipmi tool, analyze the machines with memory problems from the results, and perform batch checks on the batch servers in the network through scripts, Quickly confirm the memory ECC error reporting machine, and realize the memory batch inspection of batch machines.

1)、找一台windows系统机器,配置ip后连接网络,确保和用户服务器管理网络连通, 1) Find a Windows system machine, connect to the network after configuring the ip, and ensure that it is connected to the user server management network.

2)、修改默认脚本以配合实际网络环境: 2), modify the default script to match the actual network environment:

3)、在windows机器上执行脚本,配合ipmitool.exe和libeay32.dll工具文件,执行的最终结果放在当前目录的result.txt文件中, 3) Execute the script on the windows machine, cooperate with the ipmitool.exe and libeay32.dll tool files, and put the final result of the execution in the result.txt file in the current directory.

4)、对检测出有问题的机器进行内存故障处理。 4) Perform memory fault handling on the machine that detects a problem.

默认实现脚本sel.bat如下: The default implementation script sel.bat is as follows:

echo off echo off

for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (

echo ##############################################################################################>> result.txt echo ################################################## ################################################>> result.txt

echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt

ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt

echo **********************************************************************************************>> result.txt echo *************************************************** ***********************************************>> result.txt

)。 ).

本发明的有益效果是: The beneficial effects of the present invention are:

1. 自动批量检查,提高效率。 1. Automatic batch inspection to improve efficiency.

2. 可定制化脚本,适合不同的网络配置环境。 2. The script can be customized, suitable for different network configuration environments.

3. 实现方式简单,易于操作。 3. The implementation method is simple and easy to operate.

具体实施方式 Detailed ways

实现过程: Implementation process:

默认实现脚本sel.bat如下: The default implementation script sel.bat is as follows:

echo off echo off

for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (

echo ##############################################################################################>> result.txt echo ################################################## ################################################>> result.txt

echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt

ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt

echo **********************************************************************************************>> result.txt echo *************************************************** ***********************************************>> result.txt

) )

1、找一台windows系统机器,配置ip后连接网络,确保和用户服务器管理网络连通, 1. Find a windows system machine, connect to the network after configuring ip, and ensure that it is connected to the user server management network.

2、修改默认脚本以配合实际网络环境: 2. Modify the default script to match the actual network environment:

如现场网段为192.168.1.1-192.168.1.200,相应的,将sel.bat中修改: If the on-site network segment is 192.168.1.1-192.168.1.200, modify the sel.bat accordingly:

for /L %%i in (82,1,90) do ( for /L %%i in (82,1,90) do (

修改为 for /L %%i in (1,1,200) do ( Change to for /L %%i in (1,1,200) do (

echo 10.7.12.%%i% >>result.txt echo 10.7.12.%%i% >>result.txt

修改为 echo 192.168.1.%%i% >>result.txt Change to echo 192.168.1.%%i% >>result.txt

ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt 修改为 ipmitool.exe -H 192.168.1.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt is changed to ipmitool.exe -H 192.168.1.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt

3、在windows机器上执行脚本,配合ipmitool.exe和libeay32.dll工具文件,执行的最终结果放在当前目录的result.txt文件中,如下格式,下面示例说明10.7.12.82这台服务器有ecc错误,其他空的说明没有: 3. Execute the script on the windows machine, cooperate with the ipmitool.exe and libeay32.dll tool files, and put the final result of the execution in the result.txt file in the current directory. The format is as follows. The following example shows that the server on 10.7.12.82 has an ECC error , the other empty description does not:

############################################################################################## #################################################### ###############################################

10.7.12.82 10.7.12.82

1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert 1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert

3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert 3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert

********************************************************************************************** ***************************************************** ***********************************************

############################################################################################## #################################################### ###############################################

10.7.12.83 10.7.12.83

********************************************************************************************** ***************************************************** ***********************************************

############################################################################################## #################################################### ###############################################

10.7.12.84 10.7.12.84

********************************************************************************************** ***************************************************** ***********************************************

4、对检测出有问题的机器进行内存故障处理。 4. Troubleshoot the memory of the machine where the problem is detected.

Claims (3)

1. one kind is utilized IPMI instrument to carry out the method that bulk service device memory failure detects, it is characterized in that, by ipmi instrument, the bmc daily record of netting interior Servers-all is carried out to writing scan, from result, analyze the machine that has memory problem, carry out batch inspection by script to netting interior bulk service device, to there being the internal memory ecc machine that reports an error to confirm fast, realizing the internal memory of machine in batches and check in batches.
2. method according to claim 1, is characterized in that
1), look for a windows system machine, interconnection network after configuration ip, guarantees and client server supervising the network is communicated with,
2), amendment default script is to coordinate real network environment:
3), on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of execution is placed in the result.txt file of current directory,
4), carry out memory failure processing to detecting problematic machine.
3. method according to claim 1, is characterized in that acquiescence realizes script sel.bat as follows:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)。
CN201410211110.2A 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool Pending CN103970661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410211110.2A CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410211110.2A CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Publications (1)

Publication Number Publication Date
CN103970661A true CN103970661A (en) 2014-08-06

Family

ID=51240190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410211110.2A Pending CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Country Status (1)

Country Link
CN (1) CN103970661A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268045A (en) * 2014-09-29 2015-01-07 浪潮电子信息产业股份有限公司 Testing method for startup and shutdown in remote control system
CN104333617A (en) * 2014-11-18 2015-02-04 浪潮电子信息产业股份有限公司 Method for automatically setting static state IP for rack cabinet in Linux system
CN104360922A (en) * 2014-10-20 2015-02-18 浪潮电子信息产业股份有限公司 Method for automatically monitoring BMC working state based on ipmitool
CN104714863A (en) * 2015-02-06 2015-06-17 浪潮电子信息产业股份有限公司 Method for completely storing Raid card logs on basis of Linux operation system after system crashes
CN105045689A (en) * 2015-06-25 2015-11-11 浪潮电子信息产业股份有限公司 Method for monitoring and alarming hard disks by using RAID card batch detection
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 Method for analyzing memory fault address under LINUX
CN106484639A (en) * 2016-10-10 2017-03-08 郑州云海信息技术有限公司 A kind of method that CPU register information is obtained by ipmi agreement
CN106603343A (en) * 2017-01-11 2017-04-26 郑州云海信息技术有限公司 A method for testing stability of servers in batch
CN106991026A (en) * 2017-04-28 2017-07-28 郑州云海信息技术有限公司 It is a kind of to pass through the method that network carries out server memory Rank margin test in batches
CN106997323A (en) * 2017-04-05 2017-08-01 广东浪潮大数据研究有限公司 A kind of recording method of server B MC problem repetition steps
CN107092549A (en) * 2017-04-26 2017-08-25 郑州云海信息技术有限公司 A kind of automatic monitoring and the instrument and method for parsing memory failure
CN107463455A (en) * 2017-08-01 2017-12-12 联想(北京)有限公司 A kind of method and device for detecting memory failure
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN109032807A (en) * 2018-08-08 2018-12-18 郑州云海信息技术有限公司 A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory
CN110032486A (en) * 2019-03-06 2019-07-19 平安科技(深圳)有限公司 Server test method, device, computer equipment and storage medium
CN114968065A (en) * 2021-02-19 2022-08-30 北京神州数码云科信息技术有限公司 An optimization application based on ipmitool tool in reading and writing FRU

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144177A1 (en) * 1998-12-10 2002-10-03 Kondo Thomas J. System recovery from errors for processor and associated components
CN102799506A (en) * 2012-06-29 2012-11-28 浪潮电子信息产业股份有限公司 Method for positioning fault memory
CN103473141A (en) * 2013-09-13 2013-12-25 浪潮电子信息产业股份有限公司 Method for out-of-band check and modification of BIOS (basic input/output system) setting options
CN103593211A (en) * 2013-11-01 2014-02-19 浪潮电子信息产业股份有限公司 Method for refreshing and writing firmware programs through out-of-band isolation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144177A1 (en) * 1998-12-10 2002-10-03 Kondo Thomas J. System recovery from errors for processor and associated components
CN102799506A (en) * 2012-06-29 2012-11-28 浪潮电子信息产业股份有限公司 Method for positioning fault memory
CN103473141A (en) * 2013-09-13 2013-12-25 浪潮电子信息产业股份有限公司 Method for out-of-band check and modification of BIOS (basic input/output system) setting options
CN103593211A (en) * 2013-11-01 2014-02-19 浪潮电子信息产业股份有限公司 Method for refreshing and writing firmware programs through out-of-band isolation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乐晨: "ipmitool对linux服务器进行IPMI管理", 《HTTP://MY.OSCHINA.NET/DAVEHE/BLOG/88801》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268045A (en) * 2014-09-29 2015-01-07 浪潮电子信息产业股份有限公司 Testing method for startup and shutdown in remote control system
CN104360922A (en) * 2014-10-20 2015-02-18 浪潮电子信息产业股份有限公司 Method for automatically monitoring BMC working state based on ipmitool
CN104333617A (en) * 2014-11-18 2015-02-04 浪潮电子信息产业股份有限公司 Method for automatically setting static state IP for rack cabinet in Linux system
CN104333617B (en) * 2014-11-18 2018-05-25 浪潮电子信息产业股份有限公司 A kind of method that rack cabinets set static IP automatically under linux system
CN104714863A (en) * 2015-02-06 2015-06-17 浪潮电子信息产业股份有限公司 Method for completely storing Raid card logs on basis of Linux operation system after system crashes
CN105045689A (en) * 2015-06-25 2015-11-11 浪潮电子信息产业股份有限公司 Method for monitoring and alarming hard disks by using RAID card batch detection
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 Method for analyzing memory fault address under LINUX
CN106484639A (en) * 2016-10-10 2017-03-08 郑州云海信息技术有限公司 A kind of method that CPU register information is obtained by ipmi agreement
CN106603343A (en) * 2017-01-11 2017-04-26 郑州云海信息技术有限公司 A method for testing stability of servers in batch
CN106997323A (en) * 2017-04-05 2017-08-01 广东浪潮大数据研究有限公司 A kind of recording method of server B MC problem repetition steps
CN107092549A (en) * 2017-04-26 2017-08-25 郑州云海信息技术有限公司 A kind of automatic monitoring and the instrument and method for parsing memory failure
CN106991026A (en) * 2017-04-28 2017-07-28 郑州云海信息技术有限公司 It is a kind of to pass through the method that network carries out server memory Rank margin test in batches
CN107463455A (en) * 2017-08-01 2017-12-12 联想(北京)有限公司 A kind of method and device for detecting memory failure
CN107463455B (en) * 2017-08-01 2020-10-30 联想(北京)有限公司 Method and device for detecting memory fault
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN108763005B (en) * 2018-05-30 2021-07-27 郑州云海信息技术有限公司 A kind of memory ECC fault reporting method and system
CN109032807A (en) * 2018-08-08 2018-12-18 郑州云海信息技术有限公司 A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory
CN110032486A (en) * 2019-03-06 2019-07-19 平安科技(深圳)有限公司 Server test method, device, computer equipment and storage medium
CN110032486B (en) * 2019-03-06 2022-08-09 平安科技(深圳)有限公司 Server testing method and device, computer equipment and storage medium
CN114968065A (en) * 2021-02-19 2022-08-30 北京神州数码云科信息技术有限公司 An optimization application based on ipmitool tool in reading and writing FRU

Similar Documents

Publication Publication Date Title
CN103970661A (en) Method for batched server memory fault detection through IPMI tool
KR102268355B1 (en) Cloud deployment infrastructure validation engine
US9569325B2 (en) Method and system for automated test and result comparison
US9135150B2 (en) Automated execution of functional test scripts on a remote system within a unit testing framework
WO2020087954A1 (en) Method, apparatus, device and system for grabbing trace of nvme hard disk
US8291379B2 (en) Runtime analysis of a computer program to identify improper memory accesses that cause further problems
CN104268076A (en) Testing method suitable for automatically testing memory bandwidth of each processor platform
CN106293984A (en) A kind of computer glitch automatically processes mode and device
US20160274997A1 (en) End user monitoring to automate issue tracking
CN105912086A (en) Power module fault diagnosis method, power module and whole cabinet server
JP2011145824A (en) Information processing apparatus, fault analysis method, and fault analysis program
CN103984613A (en) Method for automatically testing floating point calculation performance of CPU (Central Processing Unit)
CN112416634A (en) File processing method and device and storage medium
US9842044B2 (en) Commit sensitive tests
WO2024250776A1 (en) Fault detection method and apparatus for external device
US9354962B1 (en) Memory dump file collection and analysis using analysis server and cloud knowledge base
WO2020087956A1 (en) Method, apparatus, device and system for capturing trace of nvme hard disc
CN106776219B (en) A method for detecting the burn-in of the whole server
Chuah et al. Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis
WO2021056913A1 (en) Fault locating method, apparatus and system based on i2c communication
CN115562918A (en) Computer system fault testing method and device, electronic equipment and readable medium
CN107133134A (en) A kind of efficient RAID card Auto-Test System and method
US20120311206A1 (en) Facilitating processing in a communications environment using stop signaling
Chuah et al. Using message logs and resource use data for cluster failure diagnosis
JPWO2011051999A1 (en) Information processing apparatus and information processing apparatus control method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140806

WD01 Invention patent application deemed withdrawn after publication