CN115729783A

CN115729783A - Failure risk monitoring method, device, storage medium and program product

Info

Publication number: CN115729783A
Application number: CN202211520954.6A
Authority: CN
Inventors: 袁野; 胡大奎; 张翼; 高进; 李晨阳
Original assignee: Peoples Insurance Company of China
Current assignee: Peoples Insurance Company of China
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-03
Anticipated expiration: 2042-11-30
Also published as: CN115729783B

Abstract

Embodiments of the present application provide a failure risk monitoring method, device, storage medium, and program product. The method includes obtaining the current operating status of the system, the current operating status includes the current values of multiple operating indicators, and combining the current operating status with multiple historical The running status is matched separately to obtain the corresponding matching degrees of multiple historical running statuses. Different historical running statuses correspond to different sampling times. The historical running status includes multiple running indicators that have the largest matching degree among the historical values corresponding to the sampling time. Fault events in the historical operating state corresponding to the value, and generate fault risk prompts based on the fault events. The method provided in this embodiment can improve the comprehensiveness and accuracy of monitoring.

Description

Failure risk monitoring method, device, storage medium and program product

技术领域technical field

本申请实施例涉及软件测试技术领域，尤其涉及一种故障风险监控方法、设备、存储介质及程序产品。The embodiments of the present application relate to the technical field of software testing, and in particular to a failure risk monitoring method, device, storage medium and program product.

背景技术Background technique

为了保障软件的正常运行，生产运维人员通常会对应用系统部署的服务器、数据库、网络、客户端的运行情况进行监控，并根据监控到的指标数据来追查和定位问题。In order to ensure the normal operation of the software, production operation and maintenance personnel usually monitor the operation of the servers, databases, networks, and clients deployed by the application system, and track down and locate problems based on the monitored indicator data.

相关技术中，通常是对某一项或多项指标超过阈值的情况进行预警。In related technologies, an early warning is usually given when one or more indicators exceed a threshold.

然而，实现本申请过程中，发明人发现现有技术中至少存在如下问题：现有的方式仅能对监控到的指标进行预警，且经常会发生存在故障风险的时候未发出故障预警，监控的全面性和准确性较低。However, during the process of implementing this application, the inventors found that there are at least the following problems in the prior art: the existing methods can only provide early warnings for the monitored indicators, and it often happens that failure warnings are not issued when there is a risk of failure. Comprehensiveness and accuracy are low.

发明内容Contents of the invention

本申请实施例提供一种故障风险监控方法、设备、存储介质及程序产品，以提高监控的全面性和准确性。Embodiments of the present application provide a failure risk monitoring method, device, storage medium, and program product, so as to improve the comprehensiveness and accuracy of monitoring.

第一方面，本申请实施例提供一种故障风险监控方法，包括：In the first aspect, the embodiment of the present application provides a failure risk monitoring method, including:

获取系统的当前运行状态；所述当前运行状态包括多个运行指标的当前值；Obtain the current operating state of the system; the current operating state includes the current values of multiple operating indicators;

将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值；Matching the current operating state with multiple historical operating states to obtain matching degrees corresponding to the multiple historical operating states; different historical operating states correspond to different sampling times; the historical operating states include multiple The historical value of the running indicator at the corresponding sampling time;

获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。Obtaining a plurality of fault events in the historical operating state corresponding to the maximum value among the matching degrees, and generating a fault risk prompt according to the fault events.

在一种可能的设计中，所述将所述当前运行状态与多个历史运行状态分别进行匹配，包括：In a possible design, said matching the current running state with multiple historical running states respectively includes:

若多个所述当前值均未超出对应的预设阈值范围，则将所述当前运行状态与多个历史运行状态分别进行匹配。If none of the multiple current values exceeds the corresponding preset threshold range, the current running state is matched with multiple historical running states respectively.

在一种可能的设计中，所述获取系统的当前运行状态之后，还包括：In a possible design, after the acquisition of the current operating state of the system, further includes:

若多个所述当前值中存在至少一个当前值均超出对应的预设阈值范围，则生成对应的故障预警；If at least one of the multiple current values exceeds the corresponding preset threshold range, a corresponding fault warning is generated;

从多个所述历史运行状态中筛选获得多个待匹配运行状态；所述待匹配运行状态对应的故障事件包括所述故障预警对应的故障事件；Obtaining a plurality of operating states to be matched by screening from a plurality of historical operating states; the fault event corresponding to the operating state to be matched includes the fault event corresponding to the fault warning;

所述将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度，包括：The step of matching the current operating state with multiple historical operating states to obtain matching degrees respectively corresponding to the multiple historical operating states includes:

将所述当前运行状态与多个所述待匹配运行状态分别进行匹配，获得多个所述待匹配运行状态分别对应的匹配度。Matching the current operating state with the plurality of operating states to be matched respectively to obtain matching degrees respectively corresponding to the operating states to be matched.

在一种可能的设计中，所述将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度，包括：In a possible design, the matching of the current running state and multiple historical running states to obtain matching degrees corresponding to the multiple historical running states respectively includes:

获取所述当前运行状态的第一向量和多个所述历史运行状态的第二向量；Acquiring the first vector of the current running state and a plurality of second vectors of the historical running state;

针对多个所述历史运行状态中的每个历史运行状态的第二向量，计算所述第一向量和所述第二向量之间的马氏距离，并根据所述马氏距离确定所述历史运行状态对应的匹配度。Computing the Mahalanobis distance between the first vector and the second vector for the second vector of each historical operating state among the plurality of historical operating states, and determining the historical The matching degree corresponding to the running status.

在一种可能的设计中，所述将所述当前运行状态与多个历史运行状态分别进行匹配之前，还包括：In a possible design, before the matching of the current running state and multiple historical running states respectively, further includes:

将预设采集周期划分为多个时间区间；Divide the preset collection period into multiple time intervals;

针对每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，并将采集的历史运行状态与对应采集时刻发生的故障事件进行关联存储。For each time interval, the historical operation state is collected according to the sampling frequency corresponding to the time interval, and the collected historical operation state is associated and stored with the fault event occurring at the corresponding collection time.

在一种可能的设计中，所述根据所述时间区间对应的采样频率进行历史运行状态的采集之前，还包括：In a possible design, before performing the collection of the historical operation state according to the sampling frequency corresponding to the time interval, it also includes:

根据置信度需求和抽样误差需求，确定故障事件的需求量；According to the confidence requirements and sampling error requirements, determine the demand for fault events;

根据所述需求量采集多个故障事件；collecting a plurality of fault events according to the demand;

将所述多个故障事件划分至多个不同的时间区间内；dividing the plurality of fault events into a plurality of different time intervals;

针对每个时间区间，获取所述时间区间内发生的故障事件的数量；For each time interval, obtain the number of fault events occurring in the time interval;

根据多个所述时间区间分别对应的故障事件的数量，确定多个所述时间区间分别对应的采样频率。According to the number of fault events respectively corresponding to the multiple time intervals, the sampling frequencies respectively corresponding to the multiple time intervals are determined.

在一种可能的设计中，所述根据多个所述时间区间分别对应的故障事件的数量，确定多个所述时间区间分别对应的采样频率，包括：In a possible design, the determining the sampling frequencies respectively corresponding to the multiple time intervals according to the number of fault events respectively corresponding to the multiple time intervals includes:

确定所述预设采集周期对应的采集次数；Determining the number of acquisitions corresponding to the preset acquisition period;

针对每个时间区间，计算所述时间区间对应的故障事件的数量与所述多个故障事件的总量之间的比值，并根据所述比值和所述采集次数确定所述时间区间对应的采样频率。For each time interval, calculate the ratio between the number of fault events corresponding to the time interval and the total number of the plurality of fault events, and determine the sampling corresponding to the time interval according to the ratio and the number of acquisitions frequency.

在一种可能的设计中，所述针对每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集之前，还包括：In a possible design, for each time interval, before collecting the historical operation state according to the sampling frequency corresponding to the time interval, it also includes:

根据软件迭代频率确定目标时长；Determine the target duration according to the software iteration frequency;

所述针对每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，包括：For each time interval, the collection of historical operation status according to the sampling frequency corresponding to the time interval includes:

针对所述目标时长对应的多个预设采集周期内的每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，并将采集的历史运行状态加入样本集；For each time interval in the plurality of preset collection periods corresponding to the target duration, collect the historical operation state according to the sampling frequency corresponding to the time interval, and add the collected historical operation state to the sample set;

所述将所述当前运行状态与多个历史运行状态分别进行匹配，包括：The matching of the current running state with multiple historical running states respectively includes:

将所述当前运行状态与所述样本集中的多个历史运行状态分别进行匹配。The current running state is matched with multiple historical running states in the sample set respectively.

第二方面，本申请实施例提供一种故障风险监控设备，包括：In the second aspect, the embodiment of the present application provides a failure risk monitoring device, including:

获取模块，用于获取系统的当前运行状态；所述当前运行状态包括多个运行指标的当前值；An acquisition module, configured to acquire the current operating state of the system; the current operating state includes current values of multiple operating indicators;

匹配模块，用于将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值；A matching module, configured to match the current operating state with a plurality of historical operating states, respectively, to obtain matching degrees corresponding to a plurality of the historical operating states; different historical operating states correspond to different sampling times; the historical operating states Including multiple historical values of the operating indicators at corresponding sampling times;

生成模块，用于获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。A generating module, configured to obtain a plurality of fault events in historical operating states corresponding to the maximum value among the matching degrees, and generate fault risk prompts according to the fault events.

第三方面，本申请实施例提供一种故障风险监控设备，包括：至少一个处理器和存储器；In a third aspect, an embodiment of the present application provides a failure risk monitoring device, including: at least one processor and a memory;

所述存储器存储计算机执行指令；the memory stores computer-executable instructions;

所述至少一个处理器执行所述存储器存储的计算机执行指令，使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计所述的方法。The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method described in the above first aspect and various possible designs of the first aspect.

第四方面，本申请实施例提供一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机执行指令，当处理器执行所述计算机执行指令时，实现如上第一方面以及第一方面各种可能的设计所述的方法。In the fourth aspect, the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the described method.

第五方面，本申请实施例提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现如上第一方面以及第一方面各种可能的设计所述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, including a computer program. When the computer program is executed by a processor, the method described in the above first aspect and various possible designs of the first aspect is implemented.

本实施例提供的故障风险监控方法、设备、存储介质及程序产品，该方法包括获取系统的当前运行状态，当前运行状态包括多个运行指标的当前值，将当前运行状态与多个历史运行状态分别进行匹配，获得多个历史运行状态分别对应的匹配度，不同历史运行状态对应不同的采样时间，历史运行状态包括多个运行指标在对应采样时间的历史值获取多个匹配度中最大值对应的历史运行状态下的故障事件，并根据故障事件生成故障风险提示。本实施例提供的故障风险监控方法，通过获取当前的多个运行指标，并将多个运行指标与预先采集的历史指标进行匹配，并基于匹配度最大的历史运行状态下发生的故障事件，生成故障风险提示，从而能够提高监控的全面性和准确性。The failure risk monitoring method, device, storage medium and program product provided in this embodiment, the method includes obtaining the current operating state of the system, the current operating state includes the current values of multiple operating indicators, and combining the current operating state with the multiple historical operating states Matching is performed separately to obtain the matching degrees corresponding to multiple historical operating states. Different historical operating states correspond to different sampling times. The historical operating states include multiple operating indicators at the historical values corresponding to the sampling time. Fault events in the historical operating state of the system, and generate fault risk prompts based on the fault events. The fault risk monitoring method provided in this embodiment obtains multiple current operating indicators, matches the multiple operating indicators with pre-collected historical indicators, and generates fault events based on the historical operating status with the highest matching degree Failure risk prompts can improve the comprehensiveness and accuracy of monitoring.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present application. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.

图1为本申请实施例提供的故障风险监控方法的应用场景示意图；FIG. 1 is a schematic diagram of an application scenario of a failure risk monitoring method provided in an embodiment of the present application;

图2为本申请实施例提供的故障风险监控方法的流程示意图一；Fig. 2 is a schematic flow diagram 1 of the failure risk monitoring method provided by the embodiment of the present application;

图3为本申请实施例提供的故障风险监控方法的流程示意图二；FIG. 3 is a schematic flow diagram II of the failure risk monitoring method provided by the embodiment of the present application;

图4为本申请实施例提供的针对应用服务器集群的抽样示意图；FIG. 4 is a schematic diagram of a sample of an application server cluster provided by an embodiment of the present application;

图5为本申请实施例提供的故障风险监控设备的结构示意图；FIG. 5 is a schematic structural diagram of a failure risk monitoring device provided in an embodiment of the present application;

图6为本申请实施例提供的故障风险监控设备的硬件结构示意图。FIG. 6 is a schematic diagram of a hardware structure of a failure risk monitoring device provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.

为了保障软件的正常运行，生产运维人员通常会对应用系统部署的服务器、数据库、网络、客户端的运行情况进行监控，通过设置性能监控指标和指标阈值实现故障风险预警，并根据监控到的指标数据来追查和定位问题。In order to ensure the normal operation of the software, production operation and maintenance personnel usually monitor the operation of the servers, databases, networks, and clients deployed in the application system, and realize early warning of failure risks by setting performance monitoring indicators and indicator thresholds, and according to the monitored indicators data to track down and locate problems.

相关的性能风险预警系统通常是对某一项或多项指标超过阈值的情况进行预警，但是实际应用场景中存在两方面不足：一是现有预警系统仅能对监控到的指标进行预警，然而存在一些性能指标因为监控难度大而无法进行全面的监控，这些监控漏洞带来的潜在风险用传统的阈值预警系统无法覆盖，监控不全面；二是当监控到的各项指标值均低于阈值时，阈值预警系统会认为应用运行正常不会发出预警，但是实际仍然有发生故障的可能性，监控准确性较低。Related performance risk early warning systems usually provide early warning when one or more indicators exceed the threshold, but there are two deficiencies in actual application scenarios: First, the existing early warning system can only provide early warning for the monitored indicators, but There are some performance indicators that cannot be comprehensively monitored due to the difficulty of monitoring. The potential risks brought by these monitoring loopholes cannot be covered by the traditional threshold early warning system, and the monitoring is not comprehensive; the second is when the values of the monitored indicators are all lower than the threshold , the threshold warning system will think that the application is running normally and will not issue a warning, but there is still a possibility of failure in reality, and the monitoring accuracy is low.

为解决上述技术问题，本申请发明人研究发现可以通过收集应用系统在过去一段时间内各项运行指标值和发生故障的数据作为样本，计算应用系统当前各项运行指标值与样本的匹配度，找出匹配度最大的样本，该样本可以确定为与当前应用系统的运行状态最为接近，如果该样本包含故障事件，则可以判断系统存在故障风险，同时还可以将样本故障事件的解决方案和细节提供给相关人员参考。基于此，本申请实施例提供一种故障风险监控方法，能够提高监控的全面性和准确性。In order to solve the above technical problems, the inventors of the present application found that the matching degree between the current operating index values of the application system and the samples can be calculated by collecting the operating index values and failure data of the application system in the past period of time as samples. Find the sample with the highest matching degree. This sample can be determined to be the closest to the running state of the current application system. If the sample contains a fault event, it can be judged that the system has a fault risk. At the same time, the solution and details of the sample fault event can be Provided to relevant personnel for reference. Based on this, an embodiment of the present application provides a failure risk monitoring method, which can improve the comprehensiveness and accuracy of monitoring.

图1为本申请实施例提供的故障风险监控方法的应用场景示意图。如图1所示，服务器101与监控设备102通信连接。FIG. 1 is a schematic diagram of an application scenario of a fault risk monitoring method provided by an embodiment of the present application. As shown in FIG. 1 , a server 101 is communicatively connected with a monitoring device 102 .

在具体实现过程中，服务器101采集并将当前运行状态发送给监控设备102，。监控设备102获取当前运行状态；所述当前运行状态包括多个运行指标的当前值；将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值；获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。本申请实施例提供的故障风险监控方法通过获取当前的多个运行指标，并将多个运行指标与预先采集的历史指标进行匹配，并基于匹配度最大的历史运行状态下发生的故障事件，生成故障风险提示，从而能够提高监控的全面性和准确性。In a specific implementation process, the server 101 collects and sends the current running status to the monitoring device 102'. The monitoring device 102 acquires a current operating state; the current operating state includes current values of multiple operating indicators; matching the current operating state with multiple historical operating states respectively, and obtaining matching matches corresponding to the multiple historical operating states Different historical operating states correspond to different sampling times; the historical operating states include multiple historical values of the operating indicators at the corresponding sampling times; obtain the faults in the historical operating states corresponding to the maximum value among the multiple matching degrees event, and generate a fault risk prompt according to the fault event. The fault risk monitoring method provided by the embodiment of the present application acquires multiple current operating indicators, matches the multiple operating indicators with pre-collected historical indicators, and generates Failure risk prompts can improve the comprehensiveness and accuracy of monitoring.

其中，监控设备102可以为终端设备或服务器。Wherein, the monitoring device 102 may be a terminal device or a server.

服务器101可以为一种服务器或多种服务器，当为多种服务器时，可以基于服务器的运行指标的异同情况，判断是否将多种服务器分别进行监控。The server 101 may be one type of server or multiple types of servers. When multiple types of servers are used, it may be determined whether to monitor the various types of servers based on the similarities and differences of the operating indicators of the servers.

示例性的，服务器101可以为应用服务器，应用服务器的当前运行状态可以包括多个运行指标，例如：服务资源类：CPU使用率、磁盘空间使用率、磁盘索引节点inode使用率、内存空间使用率、僵尸进程数；网络类：带宽利用率、TCP连接状态；应用程序接口(Application Programming Interface，API)类：API正确率、API相应时间；业务量类：当前在线用户数等。服务器102还可以为数据库服务器，数据库服务器的当前运行状态可以包括多个运行指标，例如：服务资源类：CPU使用率、磁盘空间使用率、磁盘索引节点inode使用率、内存空间使用率、僵尸进程数；网络类：带宽利用率、TCP连接状态；数据库类：慢SQL消耗的时间、表空间使用率、活动连接数占最大连接数的比例、预警alert日志死锁错误(ORA-60)个数等。Exemplarily, the server 101 may be an application server, and the current operating status of the application server may include multiple operating indicators, for example: service resource types: CPU usage, disk space usage, disk index node inode usage, memory space usage , the number of zombie processes; network category: bandwidth utilization rate, TCP connection status; application programming interface (Application Programming Interface, API) category: API accuracy rate, API response time; business volume category: current number of online users, etc. The server 102 can also be a database server, and the current operating state of the database server can include multiple operating indicators, such as: service resource class: CPU utilization, disk space utilization, disk index node inode utilization, memory space utilization, zombie process Network category: bandwidth utilization, TCP connection status; database category: time consumed by slow SQL, table space usage, the ratio of active connections to the maximum number of connections, and the number of deadlock errors (ORA-60) in early warning alert logs wait.

若服务器101位应用服务器和数据库服务器时，由于应用服务器和数据库服务器需要收集的指标项不同，在进行采样和风险评估时可以将应用服务器和数据库服务器分开考虑，对应的故障事件也可以划分明确的归属。需要注意的是，为了方便捕获问题，可以按事件报错的位置即问题的表象而非产生的根因进行分类，即分为应用服务器端发现的问题和数据库服务器端发现的问题。If 101 servers are application servers and database servers, the application servers and database servers can be considered separately during sampling and risk assessment because the indicators to be collected by the application servers and database servers are different, and the corresponding fault events can also be clearly divided belong. It should be noted that, in order to facilitate the capture of problems, they can be classified according to the location of the event error report, that is, the appearance of the problem rather than the root cause, that is, the problems found on the application server side and the problems found on the database server side.

另外，故障事件的发生也分为两种情况：一种是系统的某一项运行指标超过设定的阈值，即指标表现异常，预警系统自动发出故障预警，此类事件均对应了明确的异常指标项，可以直接根据指标项的归属判断是应用服务器端的问题或是数据库服务器端的问题；第二种是在各项指标表现正常的情况下，由运行维护人员或用户上报的故障事件，此类事件需要经人工判断来确定分类。In addition, the occurrence of fault events can also be divided into two situations: one is that a certain operating index of the system exceeds the set threshold, that is, the index performance is abnormal, and the early warning system automatically issues a fault early warning. Such events correspond to clear abnormalities Index items can be directly judged based on the attribution of the index items to determine whether it is a problem on the application server side or a problem on the database server side; Events require human judgment to determine classification.

需要说明的是，图1所示的场景示意图仅仅是一个示例，本申请实施例描述的故障风险监控方法以及场景是为了更加清楚地说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域普通技术人员可知，随着系统的演变和新业务场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。It should be noted that the schematic diagram of the scene shown in Figure 1 is only an example, and the fault risk monitoring method and scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a reference to the implementation of the present application. Those skilled in the art know that, with the evolution of the system and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

下面以具体地实施例对本申请的技术方案进行详细说明。下面这几个具体的实施例可以相互结合，对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

图2为本申请实施例提供的故障风险监控方法的流程示意图一。如图2所示，该方法包括：FIG. 2 is a first schematic flowchart of a failure risk monitoring method provided by an embodiment of the present application. As shown in Figure 2, the method includes:

201、获取系统的当前运行状态；所述当前运行状态包括多个运行指标的当前值。201. Acquire the current operating state of the system; the current operating state includes current values of multiple operating indicators.

本实施例的执行主体可以为终端设备或服务器，如图1所示的监控设备102。The execution subject of this embodiment may be a terminal device or a server, such as the monitoring device 102 shown in FIG. 1 .

本实施例中的系统可以为如图1所示的服务器101安装运行的系统。The system in this embodiment may be a system installed and running on the server 101 as shown in FIG. 1 .

202、将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值。202. Match the current operating state with multiple historical operating states to obtain matching degrees corresponding to the multiple historical operating states; different historical operating states correspond to different sampling times; the historical operating states include multiple The historical value of the operating index at the corresponding sampling time.

具体的，可以遍历应用系统当前的各项运行指标，当某项运行指标超过对应的阈值时自动触发预警，将预警事件保存至事件列表中，并在前端页面提示预警的指标项和指标值。假设检测到了m个超阈值的指标项，进而可以基于m的数值进行分别处理。Specifically, the current operating indicators of the application system can be traversed, and when an operating indicator exceeds the corresponding threshold, an early warning is automatically triggered, the early warning event is saved in the event list, and the indicator item and indicator value of the early warning are prompted on the front-end page. Assuming that m index items exceeding the threshold are detected, they can be processed separately based on the value of m.

本实施例中，当前运行状态和历史运行状态的匹配方式有多种，在一些实施例中，所述将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度，可以包括：获取所述当前运行状态的第一向量和多个所述历史运行状态的第二向量；针对多个所述历史运行状态中的每个历史运行状态的第二向量，计算所述第一向量和所述第二向量之间的马氏距离，并根据所述马氏距离确定所述历史运行状态对应的匹配度。In this embodiment, there are many ways to match the current running state and the historical running state. In some embodiments, the current running state is matched with multiple historical running states to obtain multiple historical running states The matching degrees corresponding to the states may include: acquiring the first vector of the current operating state and the second vectors of the plurality of historical operating states; Two vectors, calculating the Mahalanobis distance between the first vector and the second vector, and determining the matching degree corresponding to the historical operation status according to the Mahalanobis distance.

具体的，马氏距离的计算公式可以如公式(1)所示：Specifically, the calculation formula of the Mahalanobis distance can be shown as formula (1):

其中，x，y分别代表当前系统各运行指标值向量和某个样本的各运行指标值向量，DM(x，y)为二者之间的马氏距离。∑为样本集S的协方差矩阵，∑-1为协方差矩阵的逆矩阵。Among them, x and y respectively represent the vectors of each operating index value of the current system and each operating index value vector of a certain sample, and DM(x, y) is the Mahalanobis distance between the two. Σ is the covariance matrix of the sample set S, and Σ-1 is the inverse matrix of the covariance matrix.

协方差矩阵的计算方式可参考如下公式(2)：The calculation method of the covariance matrix can refer to the following formula (2):

其中，m为样本量，n为运行指标个数,ci表示第i个指标项，cov(ci,cj)＝E([ci-E(ci)][cj-E(cj)])，E(ci)表示ci的平均值，cov(ci,cj)表示第i个指标和第j个指标之间的协方差。Among them, m is the sample size, n is the number of running indicators, ci indicates the i-th indicator item, cov(ci,cj)=E([ci-E(ci)][cj-E(cj)]), E (ci) represents the average value of ci, and cov(ci,cj) represents the covariance between the i-th indicator and the j-th indicator.

在一些实施例中，对于未触发阈值预警的情况，为了监控的全面性，可以进行各运行指标的全匹配，具体的，所述将所述当前运行状态与多个历史运行状态分别进行匹配，可以包括：若多个所述当前值均未超出对应的预设阈值范围，即m＝0，则将所述当前运行状态与多个历史运行状态分别进行匹配。示例性的，可以计算应用系统当前各项运行指标值与样本集S中所有样本的马氏距离，并记录与当前运行指标距离最小(匹配度最大)的样本ID。In some embodiments, for the situation where the threshold warning is not triggered, for the comprehensiveness of monitoring, full matching of each operating index can be performed. Specifically, the matching of the current operating state and multiple historical operating states respectively, It may include: if none of the multiple current values exceeds the corresponding preset threshold range, that is, m=0, matching the current running state with multiple historical running states respectively. Exemplarily, the Mahalanobis distance between the current operating index values of the application system and all samples in the sample set S can be calculated, and the ID of the sample with the smallest distance (the highest matching degree) to the current operating index can be recorded.

在一些实施例中，对于已触发阈值预警的情况，为了节省计算资源，可以仅针对当前值超出阈值范围的运行指标进行匹配处理。具体的，所述获取系统的当前运行状态之后，还可以包括：若多个所述当前值中存在至少一个当前值均超出对应的预设阈值范围，则生成对应的故障预警；从多个所述历史运行状态中筛选获得多个待匹配运行状态；所述待匹配运行状态对应的故障事件包括所述故障预警对应的故障事件；所述将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度，可以包括：将所述当前运行状态与多个所述待匹配运行状态分别进行匹配，获得多个所述待匹配运行状态分别对应的匹配度。In some embodiments, when a threshold warning has been triggered, in order to save computing resources, matching processing may be performed only on operating indicators whose current values exceed the threshold range. Specifically, after acquiring the current operating state of the system, it may also include: if at least one of the multiple current values exceeds the corresponding preset threshold range, generating a corresponding fault warning; Multiple running states to be matched are obtained by screening the historical running states; the fault events corresponding to the running states to be matched include the fault events corresponding to the fault warning; Matching, obtaining matching degrees respectively corresponding to a plurality of the historical operation states may include: respectively matching the current operation state with the plurality of operation states to be matched, and obtaining the matching degrees corresponding to the plurality of operation states to be matched respectively suitability.

示例性的，对于已触发阈值预警的情况，即事件已经发生，此时只需计算系统当前的运行指标值与样本集S中所有“事件标记”为“Y”的样本的马氏距离，并记录与当前运行指标距离最小的样本ID。Exemplarily, for the situation where the threshold warning has been triggered, that is, the event has occurred, it is only necessary to calculate the Mahalanobis distance between the current operating index value of the system and all samples whose "event mark" is "Y" in the sample set S, and Record the sample ID with the smallest distance from the current running metric.

203、获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。203. Obtain a plurality of fault events in the historical operating state corresponding to the maximum value among the matching degrees, and generate a fault risk prompt according to the fault events.

示例性的，获取到匹配度最大值对应的额最小距离样本ID后，结合该样本的“事件标记”来判断应用系统当前是否存在故障风险，“事件标记”为“Y”表示存在风险，则前端页面提示风险预警以及匹配到的样本详情和对应的事件解决分析详情；“事件标记”为“N”表示不存在风险，则前端页面无风险提示。Exemplarily, after obtaining the minimum distance sample ID corresponding to the maximum matching degree, combine the "event flag" of the sample to determine whether the application system currently has a failure risk, and the "event flag" is "Y" indicating that there is a risk, then The front-end page prompts risk warning, matched sample details, and corresponding incident resolution analysis details; "Event Mark" is "N" to indicate that there is no risk, and there is no risk warning on the front-end page.

本实施例提供的故障风险监控方法，通过获取当前的多个运行指标，并将多个运行指标与预先采集的历史指标进行匹配，并基于匹配度最大的历史运行状态下发生的故障事件，生成故障风险提示，从而能够提高监控的全面性和准确性。The fault risk monitoring method provided in this embodiment obtains multiple current operating indicators, matches the multiple operating indicators with pre-collected historical indicators, and generates fault events based on the historical operating status with the highest matching degree Failure risk prompts can improve the comprehensiveness and accuracy of monitoring.

图3为本申请实施例提供的故障风险监控方法的流程示意图二。如图3所示，在上述实施例的基础上，例如在图2所示实施例的基础上，本实施例中对历史运行状态，即风险评估样本集的生成与维护过程进行了示例性说明，该方法包括：FIG. 3 is a second schematic flow diagram of the failure risk monitoring method provided by the embodiment of the present application. As shown in FIG. 3, on the basis of the above-mentioned embodiments, for example, on the basis of the embodiment shown in FIG. 2, this embodiment exemplifies the historical operation status, that is, the generation and maintenance process of the risk assessment sample set. , the method includes:

301、将预设采集周期划分为多个时间区间。301. Divide the preset collection period into multiple time intervals.

302、针对每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，并将采集的历史运行状态与对应采集时刻发生的故障事件进行关联存储。302. For each time interval, collect historical operation statuses according to the sampling frequency corresponding to the time intervals, and store the collected historical operation statuses in association with fault events occurring at corresponding collection times.

具体的，应用系统的各项运行指标值随着时间不断发生着变化，根据经验一般每天的业务高峰期和执行批处理任务时相比其他一些时间段更容易发生故障。因此为了使采集到的样本更具有代表性，需要先对系统近期发生的生产事件进行预调查，根据生产事件发生的时间分布情况设置合理的采样频率。Specifically, the values of various operating indicators of the application system are constantly changing over time. According to experience, the daily business peak period and the execution of batch tasks are more prone to failure than other time periods. Therefore, in order to make the collected samples more representative, it is necessary to conduct a pre-investigation on the recent production events of the system, and set a reasonable sampling frequency according to the time distribution of production events.

在一些实施例中，为了节省计算资源，不同时间区间采用不同采样频率，采样频率的确定方式可以包括：根据置信度需求和抽样误差需求，确定故障事件的需求量；根据所述需求量采集多个故障事件；将所述多个故障事件划分至多个不同的时间区间内；针对每个时间区间，获取所述时间区间内发生的故障事件的数量；根据多个所述时间区间分别对应的故障事件的数量，确定多个所述时间区间分别对应的采样频率。In some embodiments, in order to save computing resources, different sampling frequencies are used in different time intervals, and the method of determining the sampling frequency may include: determining the demand of fault events according to the requirements of confidence and sampling error; fault events; divide the plurality of fault events into a plurality of different time intervals; for each time interval, obtain the number of fault events occurring in the time interval; The number of events determines the sampling frequencies respectively corresponding to the multiple time intervals.

在一些实施例中，为了采样频率设置的更加合理，可以基于软件功能的波动周期等参数来确定。具体的，所述根据多个所述时间区间分别对应的故障事件的数量，确定多个所述时间区间分别对应的采样频率，可以包括：确定所述预设采集周期对应的采集次数；针对每个时间区间，计算所述时间区间对应的故障事件的数量与所述多个故障事件的总量之间的比值，并根据所述比值和所述采集次数确定所述时间区间对应的采样频率。In some embodiments, in order to set the sampling frequency more reasonably, it may be determined based on parameters such as the fluctuation period of the software function. Specifically, the determining the sampling frequencies corresponding to the multiple time intervals according to the number of fault events respectively corresponding to the multiple time intervals may include: determining the number of acquisitions corresponding to the preset acquisition period; time interval, calculate the ratio between the number of fault events corresponding to the time interval and the total number of the plurality of fault events, and determine the sampling frequency corresponding to the time interval according to the ratio and the number of acquisitions.

在一些实施例中，样本集的覆盖时长的确定可以参考软件迭代频率，以提高合理性。具体的，可以根据软件迭代频率确定目标时长；所述针对每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，可以包括：针对所述目标时长对应的多个预设采集周期内的每个时间区间，根据所述时间区间对应的采样频率进行历史运行状态的采集，并将采集的历史运行状态加入样本集；所述将所述当前运行状态与多个历史运行状态分别进行匹配，可以包括：将所述当前运行状态与所述样本集中的多个历史运行状态分别进行匹配。In some embodiments, the determination of the coverage duration of the sample set may refer to software iteration frequency to improve rationality. Specifically, the target duration can be determined according to the software iteration frequency; for each time interval, the collection of the historical running state according to the sampling frequency corresponding to the time interval may include: multiple presets corresponding to the target duration For each time interval in the collection cycle, collect the historical running state according to the sampling frequency corresponding to the time interval, and add the collected historical running state to the sample set; the described current running state and multiple historical running states Performing the matching separately may include: respectively matching the current running status with multiple historical running statuses in the sample set.

示例性的，首先，可以根据分类型变量总体估计公式估算预调查样本数量，计算公式如下公式(3)所示：Exemplarily, first, the sample size of the pre-survey can be estimated according to the general estimation formula of the classification variable, and the calculation formula is shown in the following formula (3):

其中，n为样本容量，z为根据置信区间查z值表获得，p为目标总体的比例期望值，δ为抽样误差范围。Among them, n is the sample size, z is obtained by looking up the z value table according to the confidence interval, p is the expected value of the proportion of the target population, and δ is the range of sampling error.

示例性的，假如设置信区间为95％(z值为1.96)，抽样误差范围为4％，p(1-p)取最大值0.25，得到的样本容量n为600.25，即在置信水平95％、抽样误差范围4％的情况下，则需要收集软件最近发生的601笔生产事件数据进行调查分析。Exemplarily, if the confidence interval is set to 95% (z value is 1.96), the sampling error range is 4%, and the maximum value of p(1-p) is 0.25, the obtained sample size n is 600.25, which is at the confidence level of 95%. , If the sampling error range is 4%, it is necessary to collect the data of 601 recent production events of the software for investigation and analysis.

其次，可以根据上述估算的样本容量收集齐预调查样本，按一天24个小时划分为多个时间区间，例如可以划分为24个时间区间，即[0:00，1:00)，[1:00，2:00)，[2:00，3:00)，……，[23:00，24:00)，统计每个时间区间内发生的事件数量，并计算各区间内事件数量占样本总量的比例：b1，b2，b3，b4，……，b24。Secondly, the pre-survey samples can be collected according to the sample size estimated above, and divided into multiple time intervals according to 24 hours a day, for example, it can be divided into 24 time intervals, namely [0:00, 1:00), [1: 00, 2:00), [2:00, 3:00), ..., [23:00, 24:00), count the number of events that occur in each time interval, and calculate the number of events in each interval as a percentage of the sample The proportion of the total amount: b1, b2, b3, b4, ..., b24.

再次，可以根据上述得到的各时间区间事件数量占比b1，b2，b3，b4，……，b24，设置在不同时间区间抽样的频率。事件数量占比越高，说明在该时间区间内系统更容易发生故障，因此抽样的频率应该越高。合理假设平均每n分钟对软件的各项性能运行指标进行一次采样(因为在n分钟内软件的各项性能指标一般不会发生大的波动，即使指标有较大波动也可以在下个n分钟周期内收集到，持续较短时间的指标值突变可视作噪声，例如n可以为5)，则一天内共采样288次，得到每个区间的采样频率为b1×288，b2×288，……，b24×288，记作：f1，f2，f3，……，f24。Thirdly, the frequency of sampling in different time intervals can be set according to the proportions b1, b2, b3, b4, . . . , b24 of the number of events in each time interval obtained above. The higher the proportion of the number of events, it means that the system is more likely to fail during this time interval, so the sampling frequency should be higher. It is reasonable to assume that the various performance indicators of the software are sampled every n minutes on average (because the various performance indicators of the software generally do not fluctuate greatly within n minutes, even if the indicators fluctuate greatly, they can be sampled in the next n minutes period. The index value mutation collected within a short period of time can be regarded as noise, for example, n can be 5), then a total of 288 samples are taken in one day, and the sampling frequency of each interval is b1×288, b2×288,… , b24×288, denoted as: f1, f2, f3, ..., f24.

又次，除超阈值触发自动预警外，其他的故障事件需要人工维护事件信息，或者对接已有的运维管理系统获取事件信息。风险估计系统可以根据非自动预警事件发生的时间就近匹配当时采集的样本，并将事件信息补充到样本所关联的事件列表中。Thirdly, in addition to the automatic warning triggered by exceeding the threshold, other fault events require manual maintenance of event information, or docking with the existing operation and maintenance management system to obtain event information. The risk estimation system can match the samples collected at that time according to the time when the non-automatic early warning event occurred, and add the event information to the event list associated with the sample.

最后，样本覆盖的时间周期可以结合软件迭代频率、发生故障的次数来综合考虑确定，例如采集系统近15天运行的样本和近180天发生故障事件的样本，如此设定是因为大部分时间系统都是在正常运行，多台服务器每5分钟采集一次数据，15天采集到的样本量已足够代表系统的正常运行状态(一台服务器15天的样本量为4320)，相比而言故障事件只是偶发事件，需要拉长抽样周期才能获得足够的样本。每采集完新的一天的样本，风险评估系统将自动清除已过期的样本，完成样本集S的自动更新。Finally, the time period covered by the sample can be comprehensively considered and determined in combination with the software iteration frequency and the number of failures. For example, collecting the samples of the system running in the past 15 days and the samples of the failure events in the past 180 days. This setting is because most of the time the system They are all running normally, and multiple servers collect data every 5 minutes. The sample size collected in 15 days is enough to represent the normal operation status of the system (the sample size of a server in 15 days is 4320), compared to failure events It is only an occasional event, and the sampling period needs to be lengthened to obtain enough samples. Every time a new day's samples are collected, the risk assessment system will automatically clear expired samples and complete the automatic update of the sample set S.

303、获取系统的当前运行状态；所述当前运行状态包括多个运行指标的当前值。303. Acquire the current operating state of the system; the current operating state includes current values of multiple operating indicators.

304、将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值。304. Match the current running state with multiple historical running states respectively to obtain matching degrees corresponding to the multiple historical running states; different historical running states correspond to different sampling times; the historical running state includes multiple The historical value of the operating index at the corresponding sampling time.

305、获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。305. Obtain multiple fault events in historical operating states corresponding to the maximum value among the matching degrees, and generate a fault risk prompt according to the fault events.

本实施例中步骤303至步骤305与上述实施例中步骤201至步骤203相类似，此处不再赘述。Steps 303 to 305 in this embodiment are similar to steps 201 to 203 in the above embodiment, and will not be repeated here.

本实施例提供的故障风险监控方法，通过针对不同的时间分区采用不同的采样频率，使得采集到的样本更具代表性。从而能够减少采集量，增大样本采集的针对性，并且在后续计算中提高计算效率，节约计算资源。The failure risk monitoring method provided in this embodiment makes the collected samples more representative by adopting different sampling frequencies for different time zones. Therefore, the amount of collection can be reduced, the pertinence of sample collection can be increased, and the calculation efficiency can be improved in subsequent calculations, saving calculation resources.

为了更清楚的说明样本集S的构成，以下以对应用服务器集群进行样本采集的过程进行示例说明。In order to illustrate the composition of the sample set S more clearly, the following uses an example to illustrate the sample collection process of the application server cluster.

图4为本申请实施例提供的针对应用服务器集群的抽样示意图，如图4所示，应用通常会同时部署多个服务器节点，可以分别针对各服务器节点进行抽样，样本信息包括采样时间、各项运行指标值以及对应的故障事件，如图中指标列表和事件列表之间通过样本ID进行关联。指标列表中的“事件标记”可以标记样本是否存在对应的事件，若存在则“事件标记”取“Y”，不存在取“N”。事件列表存储的信息可以根据需要扩展，比如手工录入或者对接已有的运维系统获取事件分析和解决的细节，为识别出的故障风险提供排查和解决的决策支持。数据库服务器集群按类似的方式进行抽样。Fig. 4 is a schematic diagram of sampling for an application server cluster provided by the embodiment of the present application. As shown in Fig. 4, an application usually deploys multiple server nodes at the same time, and each server node can be sampled separately. The sample information includes sampling time, various Run indicator values and corresponding fault events, as shown in the figure, the indicator list and event list are associated through the sample ID. The "event mark" in the index list can mark whether there is a corresponding event in the sample. If there is, the "event mark" will be "Y", and if it does not exist, it will be "N". The information stored in the event list can be expanded according to needs, such as manual entry or docking with the existing operation and maintenance system to obtain details of event analysis and resolution, and provide decision support for troubleshooting and resolution of identified failure risks. Database server clusters are sampled in a similar fashion.

图5为本申请实施例提供的故障风险监控设备的结构示意图。如图5所示，该故障风险监控设备50包括：获取模块501、匹配模块502以及生成模块503。FIG. 5 is a schematic structural diagram of a failure risk monitoring device provided in an embodiment of the present application. As shown in FIG. 5 , the failure risk monitoring device 50 includes: an acquisition module 501 , a matching module 502 and a generation module 503 .

获取模块501，用于获取系统的当前运行状态；所述当前运行状态包括多个运行指标的当前值；An acquisition module 501, configured to acquire the current operating state of the system; the current operating state includes the current values of multiple operating indicators;

匹配模块502，用于将所述当前运行状态与多个历史运行状态分别进行匹配，获得多个所述历史运行状态分别对应的匹配度；不同历史运行状态对应不同的采样时间；所述历史运行状态包括多个所述运行指标在对应采样时间的历史值；The matching module 502 is configured to match the current operating state with a plurality of historical operating states respectively, and obtain matching degrees corresponding to the plurality of historical operating states; different historical operating states correspond to different sampling times; the historical operating states The state includes a plurality of historical values of the operating indicators at the corresponding sampling time;

生成模块503，用于获取多个所述匹配度中最大值对应的历史运行状态下的故障事件，并根据所述故障事件生成故障风险提示。The generating module 503 is configured to obtain a plurality of fault events in the historical operating state corresponding to the maximum value among the matching degrees, and generate a fault risk prompt according to the fault events.

本申请实施例提供的故障风险监控设备，通过获取当前的多个运行指标，并将多个运行指标与预先采集的历史指标进行匹配，并基于匹配度最大的历史运行状态下发生的故障事件，生成故障风险提示，从而能够提高监控的全面性和准确性。The fault risk monitoring device provided in the embodiment of the present application obtains multiple current operating indicators and matches the multiple operating indicators with pre-collected historical indicators, and based on the fault events that occur in the historical operating state with the highest matching degree, Generate failure risk prompts, which can improve the comprehensiveness and accuracy of monitoring.

本申请实施例提供的故障风险监控设备，可用于执行上述的方法实施例，其实现原理和技术效果类似，本实施例此处不再赘述。The failure risk monitoring device provided in the embodiment of the present application can be used to implement the above method embodiment, and its implementation principle and technical effect are similar, so this embodiment will not repeat them here.

图6为本申请实施例提供的故障风险监控设备的硬件结构示意图，该设备可以是终端设备或服务器。FIG. 6 is a schematic diagram of a hardware structure of a failure risk monitoring device provided in an embodiment of the present application, and the device may be a terminal device or a server.

设备60可以包括以下一个或多个组件：处理组件601，存储器602，电源组件603，多媒体组件604，音频组件605，输入/输出(I/O)接口606，传感器组件607，以及通信组件608。Device 60 may include one or more of the following components: processing component 601, memory 602, power supply component 603, multimedia component 604, audio component 605, input/output (I/O) interface 606, sensor component 607, and communication component 608.

处理组件601通常控制装置60的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件601可以包括一个或多个处理器609来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件601可以包括一个或多个模块，便于处理组件601和其他组件之间的交互。例如，处理组件601可以包括多媒体模块，以方便多媒体组件604和处理组件601之间的交互。The processing component 601 generally controls the overall operations of the device 60, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 609 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 601 may include one or more modules to facilitate interaction between processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601 .

存储器602被配置为存储各种类型的数据以支持在装置60的操作。这些数据的示例包括用于在装置60上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器602可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。Memory 602 is configured to store various types of data to support operations at device 60 . Examples of such data include instructions for any application or method operating on device 60, contact data, phonebook data, messages, pictures, videos, and the like. The memory 602 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

电源组件603为装置60的各种组件提供电力。电源组件603可以包括电源管理系统，一个或多个电源，及其他与为装置60生成、管理和分配电力相关联的组件。The power supply component 603 provides power to various components of the device 60 . Power components 603 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 60 .

多媒体组件604包括在所述装置60和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件604包括一个前置摄像头和/或后置摄像头。当装置60处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 604 includes a screen that provides an output interface between the device 60 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 604 includes a front camera and/or a rear camera. When the device 60 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件605被配置为输出和/或输入音频信号。例如，音频组件605包括一个麦克风(MIC)，当装置60处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器602或经由通信组件608发送。在一些实施例中，音频组件605还包括一个扬声器，用于输出音频信号。The audio component 605 is configured to output and/or input audio signals. For example, the audio component 605 includes a microphone (MIC) configured to receive external audio signals when the device 60 is in operating modes, such as calling mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 602 or sent via communication component 608 . In some embodiments, the audio component 605 also includes a speaker for outputting audio signals.

I/O接口606为处理组件601和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 606 provides an interface between the processing component 601 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

传感器组件607包括一个或多个传感器，用于为装置60提供各个方面的状态评估。例如，传感器组件607可以检测到装置60的打开/关闭状态，组件的相对定位，例如所述组件为装置60的显示器和小键盘，传感器组件607还可以检测装置60或装置60一个组件的位置改变，用户与装置60接触的存在或不存在，装置60方位或加速/减速和装置60的温度变化。传感器组件607可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件607还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件607还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。Sensor assembly 607 includes one or more sensors for providing various aspects of status assessment for device 60 . For example, the sensor component 607 can detect the open/closed state of the device 60, the relative positioning of components, such as the display and keypad of the device 60, and the sensor component 607 can also detect a change in the position of the device 60 or a component of the device 60 , the presence or absence of user contact with the device 60 , the device 60 orientation or acceleration/deceleration and the temperature change of the device 60 . The sensor assembly 607 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 607 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

通信组件608被配置为便于装置60和其他设备之间有线或无线方式的通信。装置60可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件608经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件608还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 60 and other devices. The device 60 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置60可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, apparatus 60 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器602，上述指令可由装置60的处理器609执行以完成上述方法。例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 602 including instructions, which can be executed by the processor 609 of the device 60 to implement the above method. For example, the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

上述的计算机可读存储介质，上述可读存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。可读存储介质可以是通用或专用计算机能够存取的任何可用介质。The above-mentioned computer-readable storage medium, the above-mentioned readable storage medium can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

一种示例性的可读存储介质耦合至处理器，从而使处理器能够从该可读存储介质读取信息，且可向该可读存储介质写入信息。当然，可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific IntegratedCircuits，简称：ASIC)中。当然，处理器和可读存储介质也可以作为分立组件存在于设备中。An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium may be located in application specific integrated circuits (Application Specific Integrated Circuits, ASIC for short). Of course, the processor and the readable storage medium can also exist in the device as discrete components.

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

本申请实施例还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现如上故障风险监控设备执行的故障风险监控方法。An embodiment of the present application further provides a computer program product, including a computer program, and when the computer program is executed by a processor, the above failure risk monitoring method performed by the failure risk monitoring device is implemented.

最后应说明的是：以上各实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述各实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and are not intended to limit it; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present application. scope.

Claims

1. A fault risk monitoring method, comprising:

acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;

matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;

and acquiring a fault event under a historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event.

2. The method of claim 1, wherein said matching said current operating state to a plurality of historical operating states, respectively, comprises:

and if the current values do not exceed the corresponding preset threshold value ranges, respectively matching the current running state with the historical running states.

3. The method of claim 1, wherein after obtaining the current operating state of the system, further comprising:

if at least one current value in the current values exceeds a corresponding preset threshold range, generating a corresponding fault early warning;

screening a plurality of historical operating states to obtain a plurality of operating states to be matched; the fault event corresponding to the running state to be matched comprises a fault event corresponding to the fault early warning;

the matching the current operating state with the plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively comprises:

and respectively matching the current running state with the running states to be matched to obtain matching degrees respectively corresponding to the running states to be matched.

4. The method according to any one of claims 1 to 3, wherein the matching the current operating state with a plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively comprises:

acquiring a first vector of the current running state and a plurality of second vectors of the historical running states;

and aiming at a second vector of each historical operating state in a plurality of historical operating states, calculating the Mahalanobis distance between the first vector and the second vector, and determining the matching degree corresponding to the historical operating state according to the Mahalanobis distance.

5. The method according to any of claims 1-3, wherein prior to matching the current operating state with a plurality of historical operating states, respectively, further comprising:

dividing a preset acquisition period into a plurality of time intervals;

and aiming at each time interval, acquiring historical operating states according to the sampling frequency corresponding to the time interval, and performing associated storage on the acquired historical operating states and fault events occurring at corresponding acquisition moments.

6. The method according to claim 5, wherein before the collecting of the historical operating state according to the sampling frequency corresponding to the time interval, the method further comprises:

determining the demand of the fault event according to the confidence coefficient demand and the sampling error demand;

collecting a plurality of fault events according to the demand;

dividing the plurality of fault events into a plurality of different time intervals;

acquiring the number of fault events occurring in each time interval;

and determining sampling frequencies corresponding to the time intervals according to the number of the fault events corresponding to the time intervals.

7. The method according to claim 6, wherein the determining sampling frequencies corresponding to the plurality of time intervals according to the number of fault events corresponding to the plurality of time intervals comprises:

determining the acquisition times corresponding to the preset acquisition period;

and calculating the ratio of the number of the fault events corresponding to each time interval to the total number of the plurality of fault events, and determining the sampling frequency corresponding to each time interval according to the ratio and the acquisition times.

8. The method according to claim 5, wherein before the collecting of the historical operating state according to the sampling frequency corresponding to the time interval for each time interval, the method further comprises:

determining a target duration according to the software iteration frequency;

the collecting of the historical operating state according to the sampling frequency corresponding to each time interval comprises the following steps:

aiming at each time interval in a plurality of preset acquisition periods corresponding to the target duration, acquiring historical operating states according to sampling frequency corresponding to the time interval, and adding the acquired historical operating states into a sample set;

the matching the current operating state with the plurality of historical operating states respectively includes:

and respectively matching the current running state with a plurality of historical running states in the sample set.

9. A fault risk monitoring device, comprising:

the acquisition module is used for acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;

the matching module is used for respectively matching the current running state with a plurality of historical running states to obtain matching degrees respectively corresponding to the plurality of historical running states; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;

and the generating module is used for acquiring a fault event in a historical operating state corresponding to the maximum value in the matching degrees and generating a fault risk prompt according to the fault event.

10. A fault risk monitoring device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing computer-executable instructions stored by the memory causes the at least one processor to perform the fault risk monitoring method of any of claims 1 to 8.

11. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the fault risk monitoring method of any one of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the fault risk monitoring method according to any one of claims 1 to 8.