CN114138579A - Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium - Google Patents
Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium Download PDFInfo
- Publication number
- CN114138579A CN114138579A CN202111436978.9A CN202111436978A CN114138579A CN 114138579 A CN114138579 A CN 114138579A CN 202111436978 A CN202111436978 A CN 202111436978A CN 114138579 A CN114138579 A CN 114138579A
- Authority
- CN
- China
- Prior art keywords
- gpu
- test
- prometheus
- data
- monitoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000002452 interceptive effect Effects 0.000 title claims description 19
- 238000010998 test method Methods 0.000 title description 5
- 238000012545 processing Methods 0.000 title description 4
- 238000012360 testing method Methods 0.000 claims abstract description 209
- 238000012544 monitoring process Methods 0.000 claims abstract description 56
- 238000000034 method Methods 0.000 claims abstract description 41
- 230000015654 memory Effects 0.000 claims abstract description 36
- 230000000007 visual effect Effects 0.000 claims abstract description 19
- 230000003993 interaction Effects 0.000 claims abstract description 15
- 238000001514 detection method Methods 0.000 claims abstract description 5
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 claims abstract 5
- 230000008569 process Effects 0.000 claims description 19
- 230000002159 abnormal effect Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 9
- 238000009530 blood pressure measurement Methods 0.000 claims description 5
- 238000004088 simulation Methods 0.000 claims description 5
- 238000009662 stress testing Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000012430 stability testing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- UPPMZCXMQRVMME-UHFFFAOYSA-N valethamate Chemical compound CC[N+](C)(CC)CCOC(=O)C(C(C)CC)C1=CC=CC=C1 UPPMZCXMQRVMME-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2236—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2273—Test methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2289—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by configuration test
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a Prometheus-based GPU interaction testing method, a Prometheus-based GPU interaction testing device, computer equipment and a Prometheus-based GPU interaction testing medium, wherein the method comprises the following steps: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration; detecting whether the FW version of the GPU is consistent with the FW version required by the test; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and performing visual test data output detection test by Grafana.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a Prometheus-based GPU interaction test method, device, equipment and readable medium.
Background
With the development of artificial intelligence technology, the application scenarios of the GPU server are increasing, and for the GPU server, the stability of the GPU is crucial, and whether the whole machine can keep continuous and stable work under the condition of high power when in use is concerned, and the stability of the GPU is usually measured by using a GPU pressure test in the test.
There are many types of GPU stress testing, for example: gpu-burn-master, Thermal Test in NVQual tool, nbody, etc.
However, the GPU pressure test described above generally only pressurizes the GPU singly, neglects the actual working environment of the GPU server, and does not consider the influence of other components on the GPU stability.
In addition, after a tester uses the GPU pressurizing tool to pressurize during testing, the tester only pays attention to whether the logs generated by the pressurizing tool and the system logs are abnormal or not, such as error reporting, and the tester cannot well analyze instantaneous data and fluctuation changes of other indexes of the GPU.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method for GPU interactive testing based on Prometheus. The method improves a GPU pressure testing method, and is used for pressurizing a GPU server to the whole machine, pressurizing a CPU, a memory, a hard disk and a network card while pressurizing the GPU, so that the interactive testing method for the GPU pressure is realized, and the problem that only the GPU is pressurized in the general GPU pressure testing is solved. In the interactive test process, a Prometous-based test monitoring system is introduced to monitor the fluctuation condition of each index of the GPU, the system is used for acquiring data required in the test in real time, and Grafana is matched to form visual data, so that the log analysis and processing and the specific positioning of problems are facilitated for testers, and the problems that test items of the test results are incomplete and inaccurate are solved.
The embodiment of the invention also aims to provide a Prometheus-based GPU interaction testing device.
The embodiment of the invention also aims to provide the computer equipment.
An object of an embodiment of the present invention is also to provide a computer-readable storage medium.
Based on the above purpose, an aspect of the embodiments of the present invention provides a Prometheus-based GPU interactive test method. The method comprises the steps of configuring a GPU pressure test environment, installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In some embodiments, configuring the GPU stress test environment, installing the GPU driver, and the CUDA includes: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
In some embodiments, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the connection condition of the link and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In some embodiments, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In some embodiments, simulating the actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk, and the network card comprises: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In some embodiments, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components comprises: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In some embodiments, the visualized test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
On the other hand, the embodiment of the invention also provides a Prometheus-based GPU interaction testing device. The device comprises a test environment configuration unit, a test environment detection unit and a test environment detection unit, wherein the test environment configuration unit is used for configuration and detection of a GPU stress test environment; the pressure environment simulation unit is configured for simulating the pressure environment of the GPU; a GPU stress test unit configured for GPU stress testing; the Prometheus monitoring unit is configured for monitoring test index data in the test process; and a test result output unit configured to output a test result and analyze the test result.
In some embodiments, the test environment configuration unit is configured to configure a GPU stress test environment, install a GPU driver and a CUDA, detect whether GPU identification information is consistent with an actual configuration, detect a connection condition of a link if not, detect whether an FW version of the GPU is consistent with an FW version required by the test, and perform FW version refresh of the GPU if not.
In some embodiments, the pressure environment simulation unit is configured to simulate an actual pressure environment of the GPU server, and pressurize the CPU, the memory, the hard disk, and the network card.
In some embodiments, the GPU stress test unit is configured to pressurize the GPU by a GPU-burn-master.
In some embodiments, the Prometheus monitoring unit is configured to perform real-time data acquisition by the Prometheus monitoring system, including indexes such as power consumption, temperature, performance status, GPU usage rate, and video memory usage rate of the GPU, and monitor the pressurization data of other components.
In some embodiments, the test result output unit is configured to output the Grafana visual test data, if the GPU test data is normal, the system generates no error log, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In some embodiments, configuring the GPU stress test environment, installing the GPU driver, and the CUDA includes: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
In some embodiments, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the connection condition of the link and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In some embodiments, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In some embodiments, simulating the actual pressure environment of the GPU server, pressurizing the CPU, the memory, the hard disk, and the network card comprises: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In some embodiments, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components includes: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In some embodiments, the visualized test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has at least the following beneficial technical effects:
the Prometheus-based GPU interactive testing method improves the GPU pressure testing method by adopting a Prometheus-based GPU interactive testing device, pressurizes the whole GPU server, pressurizes a CPU, a memory, a hard disk and a network card while pressurizing the GPU, realizes the interactive testing method of the GPU pressure, and solves the problem that only the GPU is pressurized during the general GPU pressure testing. In the interactive test process, a Prometous-based test monitoring system is introduced to monitor the fluctuation condition of each index of the GPU, the system is used for acquiring data required in the test in real time, and Grafana is matched to form visual data, so that the log analysis and processing and the specific positioning of problems are facilitated for testers, and the problems that test items of the test results are incomplete and inaccurate are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic diagram of an embodiment of a Prometheus-based GPU interactive test method provided by the present invention;
FIG. 2 is a schematic diagram of an embodiment of a Prometheus-based GPU interaction testing apparatus according to the present invention;
FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;
FIG. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the foregoing, in a first aspect of the embodiments of the present invention, an embodiment of a method for GPU interactive testing based on Prometheus is provided. Fig. 1 is a schematic diagram illustrating an embodiment of a Prometheus-based GPU interaction testing method according to the present invention. As shown in fig. 1, the method for testing GPU interaction based on Prometheus according to the embodiment of the present invention includes the following steps:
001. configuring a GPU pressure test environment, and installing a GPU driver and a CUDA;
002. detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step;
003. detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step;
004. simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card;
005. pressurizing the GPU by a GPU-burn-master tool;
006. acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and
007. and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
In this embodiment, the interactive testing method for the Prometheus-based GPU server provided by the invention can make the GPU pressure testing environment closer to the real working environment of the GPU server, the real-time testing data collected by the Prometheus system has rich and accurate indexes, and the Grafana visual interface is matched, so that the tester can conveniently observe the testing data, the problem during testing is more detailed, and the accuracy of the GPU stability testing is greatly improved.
The Prometous monitoring system is used for monitoring the test data in the GPU interactive test process, so that the data are more accurate and reliable, a Grafana visual interface is matched, the result can be observed conveniently, and an idea is provided for analysis and positioning when the test has problems.
In some embodiments of the present invention, configuring the GPU stress test environment, installing the GPU driver, and the CUDA comprises: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
The unloading system drives the nuveau by the GPU, and the specific instruction is as follows:
vim/boot/efi/EFI/redhat/gru.cfg
after LANG _ en _ us.utf-8, modprobe.blackbet.noveau vga.791 is input and the exit is saved
echo“blacklist nouveau”>>/etc/modprobe.d/blacklist.conf
yum-y remove xorg-x11-drv-nouveau
Restarting, and detecting whether unloading is successful or not by using lsmod | grep noveau;
installing a GPU driver, and downloading a corresponding driver according to the actual GPU model;
installing a CUDA (compute unified device architecture) and a GPU driver,/. x. run, and paying attention to not installing a CUDA self-contained driver;
and (3) configuring CUDA environment variables, wherein specific instructions are as follows:
adding the following to/. bashrc
export
LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-11.1/bin:$PATH
Save exit and execute source-/. bashrc
nvcc-V detects whether CUDA installation is successful.
In some embodiments of the present invention, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the link connection condition and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In the embodiment, the actual configuration information is stored, the nvidia-smi command after the installation of the GPU driver is used for monitoring the recognition condition of the GPU, whether the two commands are consistent or not is compared, if so, the next step is performed, and if not, the actual link connection condition is detected by using the lspci | grep-i nvidia command.
In some embodiments of the present invention, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In the embodiment, an FW version file required by a test is stored, and a nvflash tool is used for detecting a GPU FW; and comparing whether the two files are consistent or not, if so, carrying out the next step, and if not, refreshing by using the nvflash tool and the corresponding FW version file.
In some embodiments of the present invention, simulating an actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk, and the network card includes: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In this embodiment, a stress tool is installed, and the stress tool is used to pressurize the CPU, and the specific commands are as follows:
nohup stress-c < number of processes > -t 172800&
Installing a memtester tool, and pressurizing the memory by using the memtester tool, wherein the specific instruction is as follows:
memtester < number of applied test memories > < number of tests >
Installing a fio tool, pressurizing the hard disk by using the fio tool, and writing various parameters required by a fio test into fio _ parameter.
nohup fio fio_parameter.txt&
Connecting a testing end machine and an auxiliary end machine by using a network cable, installing an iperf tool at two ends, and pressurizing the network card by using the iperf tool, wherein the specific instructions are as follows:
an auxiliary end: iperf-s
And (3) testing end: iperf-c < auxiliary end ip > -w 512k-i 1-t 172800-P < process number >
And (3) pressurizing the GPU through the GPU-burn-master, wherein the specific instruction is as follows:
unzip gpu-burn-master.zip
cd gpu-burn-master
make
./gpu-burn-d$((60*60*48))|tee-a gpu-burn-result.log。
in some embodiments of the present invention, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components comprises: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In this embodiment, the DCGM tool is installed with the following specific instructions:
dpkg-i datacenter-gpu-manager_1.7.2_amd64.deb
deploying the monitoring index by using the gpu-monitoring-tools, wherein the specific instruction is as follows:
git clone https://gitee.com/JackTpy/gpu-monitoring-tools.git
go env-w GOPROXY=https://goproxy.cn
cd gpu-monitoring-tools/
make binary
make install
dcgm-exporter
vim/etc/systemd/system/dcgm-exporter.service
the following are entered:
[Unit]
Description=dcgm-exporter service
[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
save and exit
systemctl daemon-reload
systemctl enable dcgm-exporter
systemctl start dcgm-exporter
systemctl status dcgm-exporter
Monitoring the CPU by using node _ CPU _ seconds _ total;
finding a subset monitoring memory by using the node _ memory as a prefix;
monitoring the hard disk by using node _ disk _ reads _ completed _ total and node _ disk _ writes _ completed _ total;
monitoring the network card by using node _ network _ receive _ bytes _ total;
prometous is installed, and the specific instructions are as follows:
tar-C/usr/local/-xvf prometheus-2.20.1.linux-amd64.tar.gz
ln-sv/usr/local/prometheus-2.20.1.linux-amd64//usr/local/Prometheu
-s
/usr/local/Prometheus/prometheus--config.file=/usr/local/Prometheus
/prometheus.yml&
the server IP:9090 is a Prometheus monitoring page.
In some embodiments of the present invention, the visual test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning of the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
In this embodiment, the Grafana visual test data is output, and the specific instruction is as follows:
rpm-ivh grafana-5.4.2-1.x86_64.rpm--force--nodeps
systemctl daemon-reload
systemctl start grafana-server.service
systemctl enable grafana-server.service
and 3000 is a Grafana page at the server IP, and the default of the user name and the password is admin.
If the GPU test data is normal and no error log is generated in the system, the test is passed, and the method comprises the following steps:
in the pressure measurement process, the whole machine has no problems of hang machine, blue screen, dead machine and black screen;
each GPU test data index of the Grafana page is within a normal range, and other parts are monitored normally;
collecting system logs and BMC logs, wherein the specific instructions are as follows:
ipmitool sel elist>/root/GPU_stress_log/sel.log
cat/var/log/messages>/root/GPU_stress_log/messages
cat/var/log/dmesg>/root/GPU_stress_log/dmesg
cat/var/log/mcelog>/root/GPU_stress_log/mcelog
and if no error information such as fail, error and the like appears in the log, the test is passed.
If the GPU test data is abnormal, analyzing and positioning the problems according to the test data, wherein the method comprises the following steps:
the abnormal indexes of the GPU test data are observed, the pressure test data of other components at the same moment and a period of time before and after the same moment are taken out for specific analysis, longitudinal comparison is facilitated, the components influence the stability of the GPU at the moment or in the period, positioning analysis of problems is facilitated, and ideas are provided for solving practical problems.
In view of the foregoing, a second aspect of the embodiments of the present invention provides a Prometheus-based GPU interaction testing apparatus. Fig. 2 is a schematic diagram illustrating an embodiment of a Prometheus-based GPU interaction testing apparatus according to the present invention. As shown in fig. 2, the Prometheus-based GPU interaction testing apparatus according to the embodiment of the present invention includes the following components: the test environment configuration unit 011 is used for configuring and detecting a GPU stress test environment; a pressure environment simulation unit 012 configured to simulate a pressure environment of the GPU; a GPU stress test unit 013 configured for GPU stress testing; a Prometheus monitoring unit 014 configured to monitor test index data in a test process; and a test result output unit 015 configured to output a test result and analyze the test result.
In some embodiments of the invention, the test environment configuration unit 011 is further configured to: configuring a GPU pressure test environment, installing a GPU driver and a CUDA, detecting whether GPU identification information is consistent with actual configuration or not, detecting the connection condition of a link if the GPU identification information is inconsistent with the actual configuration, detecting whether the FW version of the GPU is consistent with the FW version required by the test or not, and refreshing the FW version of the GPU if the GPU identification information is inconsistent with the actual configuration.
In some embodiments of the present invention, the pressure environment simulation unit 012 is further configured to: and simulating the actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk and the network card.
In some embodiments of the invention, the GPU stress test unit 013 is further configured to: the GPU is pressurized by the GPU-burn-master.
In some embodiments of the invention, the Prometheus monitoring unit 014 is further configured to: the Prometheus monitoring system acquires real-time data, including indexes such as power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of the GPU, and simultaneously monitors the pressurization data of other parts.
In some embodiments of the present invention, the test result output unit 015 is further configured to: and outputting Grafana visual test data, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, the computer apparatus of the embodiment of the present invention includes the following means: at least one processor 021; and a memory 022, the memory 022 storing computer instructions 023 executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
The invention also provides a computer readable storage medium. FIG. 4 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer readable storage medium 031 stores a computer program 032 which, when executed by a processor, performs the method as described above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes in the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for centralized server testing can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (D0L), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, D0L, or wireless technologies such as infrared, radio, and microwave are all included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.
Claims (10)
1. A Prometheus-based GPU interaction testing method is characterized by comprising the following steps:
configuring a GPU pressure test environment, and installing a GPU driver and a CUDA;
detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step;
detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step;
simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card;
pressurizing the GPU by a GPU-burn-master tool;
acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and
and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
2. The Prometheus-based GPU interactive testing method of claim 1, wherein configuring a GPU stress testing environment, installing a GPU driver and CUDA comprises:
unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and
the CUDA is installed and environment variables are configured for the CUDA.
3. The Prometheus-based GPU interactive testing method according to claim 1, wherein detecting whether a GPU identification condition is consistent with an actual configuration, if so, performing the next step, and if not, detecting a link connection condition and continuing the step includes:
storing actual configuration information;
monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver;
and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
4. The Prometheus-based GPU interactive testing method as claimed in claim 1, wherein detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step, comprising:
storing an FW version file of the test requirements;
detecting an FW version of the GPU through an nvflash tool;
and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
5. The Prometheus-based GPU interactive testing method of claim 1, wherein simulating an actual pressure environment of a GPU server to pressurize a CPU, a memory, a hard disk, and a network card comprises:
pressurizing the CPU through the stress tool;
pressurizing the memory by a memtester tool;
pressurizing the hard disk through a fio tool; and
the network card is pressurized by an iperf tool.
6. The Prometheus-based GPU interactive testing method of claim 1, wherein performing real-time data acquisition by a Prometheus monitoring system and monitoring pressurization data of other components comprises:
installing a DCGM tool, and managing and monitoring a GPU;
deploying the monitoring index by using gpu-monitoring-tools; and
and (5) installing Prometheus to monitor the test index data in the test process.
7. The Prometous-based GPU interaction testing method according to claim 1, characterized in that visual test data output is performed by Grafana, if GPU test data is normal, no error log is generated in the system, the test is passed, if GPU test data is abnormal, the problem analysis and positioning according to the test data comprises:
installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system;
in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no fail and error report, a hard disk smartlog is normal, and the network card bandwidth performance is normal, and the test is confirmed to be passed;
and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
8. A Prometheus-based GPU interaction testing device is characterized by comprising:
the test environment configuration unit is configured for configuration and detection of a GPU pressure test environment;
the pressure environment simulation unit is configured for simulating the pressure environment of the GPU;
a GPU stress test unit configured for GPU stress testing;
the Prometheus monitoring unit is configured for monitoring test index data in the test process; and
and the test result output unit is configured for outputting the test result and analyzing the test result.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111436978.9A CN114138579A (en) | 2021-11-29 | 2021-11-29 | Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111436978.9A CN114138579A (en) | 2021-11-29 | 2021-11-29 | Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114138579A true CN114138579A (en) | 2022-03-04 |
Family
ID=80389282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111436978.9A Withdrawn CN114138579A (en) | 2021-11-29 | 2021-11-29 | Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114138579A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391124A (en) * | 2022-10-27 | 2022-11-25 | 瀚博半导体(上海)有限公司 | Method and device for testing power consumption of graphic chip |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104407951A (en) * | 2014-11-05 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for automatically testing complete server |
CN107423183A (en) * | 2017-04-25 | 2017-12-01 | 郑州云海信息技术有限公司 | A kind of GTX series video card calculates the applied voltage test method of performance |
CN110413462A (en) * | 2019-06-29 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of server stress test method and device |
CN113392005A (en) * | 2021-06-16 | 2021-09-14 | 中国工商银行股份有限公司 | Large file processing test method and system |
-
2021
- 2021-11-29 CN CN202111436978.9A patent/CN114138579A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104407951A (en) * | 2014-11-05 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for automatically testing complete server |
CN107423183A (en) * | 2017-04-25 | 2017-12-01 | 郑州云海信息技术有限公司 | A kind of GTX series video card calculates the applied voltage test method of performance |
CN110413462A (en) * | 2019-06-29 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of server stress test method and device |
CN113392005A (en) * | 2021-06-16 | 2021-09-14 | 中国工商银行股份有限公司 | Large file processing test method and system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391124A (en) * | 2022-10-27 | 2022-11-25 | 瀚博半导体(上海)有限公司 | Method and device for testing power consumption of graphic chip |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107704392B (en) | Test case processing method and server | |
US9183123B2 (en) | Performance tests in a continuous deployment pipeline | |
US20030177417A1 (en) | System and method for remote performance analysis and optimization of computer systems | |
WO2017000424A1 (en) | Protocol detection method and apparatus | |
US8996928B2 (en) | Devices for indicating a physical layer error | |
CN113760704A (en) | Web UI (user interface) testing method, device, equipment and storage medium | |
CN111309590B (en) | Automatic testing method and simulator for financial transaction platform | |
CN110188036A (en) | A kind of method for testing software and device | |
CN114201408A (en) | Regression testing method, device, computer equipment and storage medium | |
KR20140102113A (en) | Commit sensitive tests | |
CN114138579A (en) | Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium | |
US20030177414A1 (en) | Model for performance tuning applications | |
CN112527312B (en) | Test method and test device for embedded system | |
CN113127364A (en) | Performance test method and device, electronic equipment and storage medium | |
CN107273296A (en) | The method of testing and test device of a kind of software | |
JP2020190556A (en) | Test measurement system and method for testing device under test | |
CN115248782B (en) | Automatic testing method and device and computer equipment | |
CN117149550A (en) | Solid state disk performance detection method and device and electronic equipment | |
CN116506007A (en) | Optical module firmware testing system and method | |
CN116662197A (en) | Automatic interface testing method, system, computer and readable storage medium | |
CN115373984A (en) | Code coverage rate determining method and device | |
CN113656319A (en) | Regression testing method and device, electronic equipment and storage medium | |
CN113590498A (en) | Method and system for testing application starting time of desktop operating system | |
CN117112398B (en) | Incremental code coverage rate detection method and device, electronic equipment and storage medium | |
Hakeem et al. | Performance Testing Framework for Software Mobile Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220304 |