CN112540900B

CN112540900B - Real-time monitoring and analyzing method for large-scale parallel program

Info

Publication number: CN112540900B
Application number: CN201910892876.4A
Authority: CN
Inventors: 冯赟龙; 刘勇; 何王全; 陈华蓉; 宋佳伟; 王敬宇; 彭达佳; 孙川; 罗威; 张威; 梁艳
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2022-11-25
Anticipated expiration: 2039-09-20
Also published as: CN112540900A

Abstract

The invention discloses a real-time monitoring and analyzing method for a large-scale parallel program, which comprises the following steps of S1: selecting m performance indexes capable of reflecting the program running state; s2: collecting selected running state index data; s3: combining the index data acquired by the same process in the S2 for n adjacent times into a longitudinal vector, and calculating the cosine similarity of the same index between different processes; s4: calculating other indexes of the problem process according to the step S3, if the process is judged to be the problem process according to calculated values obtained by all the remaining indexes, judging the problem process to be an abnormal process, and if the calculation result of one or more indexes does not exceed a threshold value, judging the problem process to be a suspicious process; s5: and outputting the normal process, the suspicious process and the abnormal process obtained in the S3 and the S4 to a display screen. The invention can reduce the overhead and interference on the application program while realizing the monitoring and analysis of the parallel application program.

Description

Real-time monitoring and analyzing method for large-scale parallel program

Technical Field

The invention belongs to the technical field of computer parallel program optimization, and particularly relates to a real-time monitoring and analyzing method for a large-scale parallel program.

Background

Heterogeneous many-core processors have more complex hardware architectures, increasing the difficulty of debugging and tuning when developing application programs, and many applications have potential errors and performance risks. Most high-performance computing application software is large in operation scale and long in operation time, errors such as program deadlock and hanging and the like or serious performance problems are prone to occurring in the program operation process, the correct operation and the high efficiency of operation of the application software are seriously influenced, and the application output of a high-performance computing system and the progress of related scientific research projects are influenced.

In the traditional parallel program monitoring software and system, data is generally acquired by adopting an insertion or sampling method for monitoring the performance of the parallel program, and analysis is performed after the program is operated, such as gprofs; monitoring the running state of a program usually acquires current execution stack information of the program to assist problems of deadlock, abnormality and the like of the debugging program, and the typical representative is STAT software; when a common data acquisition method based on instrumentation, sampling and the like is applied to a heterogeneous many-core system, software instrumentation codes introduced by the instrumentation method generate interference on codes of an application program and bring extra operation cost, a sampling program introduced by the sampling method generates competition with the application program of a large amount of hardware resources, although the method based on a hardware performance counter can reduce software work during partial performance data acquisition through support of hardware and can reduce cost, system calling still brings about not little cost and interference when the method is accessed based on system calling.

Therefore, there is a need in the art to develop a monitoring and analyzing method for heterogeneous many-core processors to solve the problems of overhead and interference in the detection of massively parallel applications.

Disclosure of Invention

The invention aims to provide a real-time monitoring and analyzing method for a large-scale parallel program, which solves the problems of high cost and high interference in the monitoring and analyzing process of the large-scale parallel application program.

In order to achieve the purpose, the invention adopts the technical scheme that: a real-time monitoring and analyzing method for a large-scale parallel program is based on a heterogeneous many-core processor and comprises the following steps:

s1: selecting m performance indexes capable of reflecting the program running state according to the support of a hardware performance counter;

s2: collecting the selected performance index data, the collecting method comprising the steps of:

a1: acquiring computing node list information used by each process of the parallel application program through a job management system;

a2: dividing tasks according to a computing node list which can be processed by a hardware maintenance interface at the same time, and establishing a task group and a task group queue;

a3: each subprocess of the parallel application program obtains a task group from the task queue of A2 and distributes the task group to a plurality of threads for execution;

a4: calling a hardware maintenance interface by a thread, and reading data in a hardware performance counter on a physical computing node;

s31: before analyzing the parallel application program, selecting the same parallel application program with normal performance indexes, forming index data acquired by N adjacent times of the same process into a longitudinal vector, calculating the cosine similarity of the same index among different processes, and storing the cosine similarity calculated by different indexes as a threshold value;

s32: forming index data acquired by the same process in the S2 for n times in an adjacent mode into a longitudinal vector, calculating cosine similarity of the same index between different processes, wherein according to a threshold value set in the S31, if the cosine similarity of two different processes exceeds the threshold value, the two processes are a problem process pair, if one process and processes more than 2/3 of a parallel application program form the problem process pair, the process is judged to be a problem process, and if not, the process is judged to be an undetermined process;

s4: calculating other indexes of the problem process according to the step S32, if the process is judged to be the problem process according to calculated values obtained by all the remaining indexes, judging the problem process to be an abnormal process, and if the calculation result of one or more indexes does not exceed a threshold value, judging the problem process to be a suspicious process;

calculating other indexes of the process to be determined according to the step S32, if the process is also judged to be a process to be determined according to calculated values obtained by all the remaining indexes, judging the process to be determined to be a normal process, and if the calculation result of one or more indexes does not exceed a threshold value, judging the process to be determined to be a suspicious process;

s5: and outputting the normal process, the suspicious process and the abnormal process obtained in the S32 and the S4 to a display screen.

The technical scheme of further improvement in the technical scheme is as follows:

1. in the above scheme, in step A4, the hardware maintenance interface reads data of the hardware performance counters on the plurality of physical computing nodes at the same time.

2. In the above scheme, in step S4, the index data of the suspicious process is retained and fed back to the display screen.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

according to the real-time monitoring and analyzing method for the large-scale parallel program, data are acquired in a mode of combining the hardware maintenance interface and the hardware performance counter, so that the overhead generated by system calling is avoided, and meanwhile, the data acquisition efficiency is improved; meanwhile, based on the similar behaviors among the processes of the parallel application program, the number of indexes needing to be collected is reduced through cosine similarity analysis, so that the analysis method can further reduce the overhead and the interference.

Drawings

FIG. 1 is a schematic flow diagram of a real-time monitoring and analysis method for massively parallel programs;

FIG. 2 is a schematic diagram of longitudinal vector selection in the present invention;

FIG. 3 is a schematic diagram of the selection of transverse vectors in the present invention.

Detailed Description

The invention is further described below with reference to the following examples:

example (b): a real-time monitoring and analyzing method for a large-scale parallel program is based on a heterogeneous many-core processor and comprises the following steps of:

s2: collecting selected performance index data, such as the number of instructions executed in each processor cycle, the number of memory access instructions executed in each processor cycle, the idle processor cycle ratio, the instruction Cache miss rate and the like, wherein the collecting method comprises the following steps:

here, the computing nodes that the hardware maintenance interface can process simultaneously have a certain rule on physical layout;

the hardware maintenance interface reads data of hardware performance counters on a plurality of physical computing nodes at the same time;

s4: calculating other indexes of the problem process according to the step S3, if the process is judged to be the problem process according to calculated values obtained by all the remaining indexes, judging that the problem process is an abnormal process, and if the calculation result of one or more indexes does not exceed a threshold value, judging that the problem process is a suspicious process;

Wherein, the core part is: a soft and hard cooperative data acquisition method based on a maintenance interface and an abnormal process detection method based on the cosine similarity analysis of a longitudinal state vector during operation;

1. a method for acquiring soft and hard cooperative data based on a maintenance interface, referring to fig. 1:

the hardware maintenance interface can realize batch collection and collection of hundreds of computing node data, but a plurality of tasks are required to be processed, wherein the collection cannot generate too large load to the hardware maintenance interface, only a few key indexes are selected for collection, and meanwhile, the concurrency of data collection and processing is improved through the technologies of task division, multiprocess, multithreading and the like;

task division: according to the design of a maintenance interface, dividing a process capable of realizing acquisition in one-time calling into one task;

and (4) multi-process: establishing a task queue and a task group, and taking one task group from the task queue by each process for execution;

multithreading: each process spawns multiple threads to execute tasks in the task group.

2. An abnormal process detection method based on cosine similarity analysis of a running longitudinal state vector, which is described in the accompanying drawing 2~3:

the subprocesses of the parallel program are generated for realizing acceleration of a certain task, and often have highly similar code behaviors, so that the detection of the abnormal process can be realized by analyzing the similarity degree of the running behaviors of the subprocesses of the parallel program.

The runtime behavior of the process is represented by constructing a runtime state vector, for the same process, the same index acquired at different times can construct a longitudinal vector, and for a plurality of indexes acquired at the same time, the transverse vector can be constructed, however, enough performance indexes need to be acquired for constructing the transverse vector, and the cost and the interference are relatively large, so that the method adopting the longitudinal vector is considered:

(1) M important program running state indexes are selected for monitoring;

(2) Forming a vector by using data acquired for adjacent n times, setting a similarity threshold, calculating cosine similarity of an index 1 among different processes, indicating a process pair exceeding the threshold, and if a process shows that the process is not similar to most (such as more than 2/3) of the processes, indicating that the process possibly has problems;

(3) And (3) calculating the conditions of other indexes according to the step (2), mutually verifying the conditions of all indexes, if the abnormal processes obtained by all indexes are consistent, giving the abnormal processes certainly, and if the abnormal processes are inconsistent, participating in analysis by a user.

In the figure, a, b, c, d and e in a process respectively represent an index of the process, and a plurality of data values collected by the index on a time line are shown.

When the real-time monitoring and analyzing method for the large-scale parallel program is adopted, the data are acquired in a mode of combining the hardware maintenance interface and the hardware performance counter, so that the overhead generated by system calling is avoided, and the data acquisition efficiency is improved; meanwhile, based on the similar behaviors among the processes of the parallel application program, the number of indexes to be collected is reduced through cosine similarity analysis, so that the analysis method can further reduce the overhead and the interference.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

a heterogeneous many-core processor: the processor adopts a master-slave heterogeneous structure and consists of a control core and an operation core.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A real-time monitoring and analyzing method for a large-scale parallel program is characterized by comprising the following steps based on a heterogeneous many-core processor:

s2: collecting the selected performance index data, wherein the collecting method comprises the following steps:

s31: before analyzing the parallel application program, selecting a same parallel application program with normal performance indexes, forming index data acquired by N adjacent times of the same process into a longitudinal vector, calculating cosine similarity of the same index among different processes, and storing the cosine similarity calculated by different indexes as a threshold value;

s32: forming index data acquired by the same process for n times in the S2 into a longitudinal vector, calculating the cosine similarity of the same index between different processes, wherein according to a threshold value set in the S31, if the cosine similarity of two different processes exceeds the threshold value, the two processes are a problem process pair, if one process and processes more than 2/3 of parallel application programs form the problem process pair, the process is judged to be a problem process, otherwise, the process is judged to be an undetermined process;

s5: outputting the normal process, the suspicious process and the abnormal process obtained in the S32 and the S4 to a display screen;

in step A4, the hardware maintenance interface reads data of hardware performance counters on a plurality of physical compute nodes simultaneously.

2. The real-time monitoring and analyzing method for massively parallel programs according to claim 1, wherein: in step S4, the index data of the suspicious process is retained and fed back to the display screen.