CN117632666A

CN117632666A - Alarm method, device and storage medium

Info

Publication number: CN117632666A
Application number: CN202410112376.5A
Authority: CN
Inventors: 洪元东
Original assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Current assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-03-01
Anticipated expiration: 2044-01-25
Also published as: CN117632666B

Abstract

The embodiment of the application provides an alarm method, alarm equipment and a storage medium. Under the condition of an alarm triggering event, determining a request path for the abnormal IO requests, and clustering the abnormal IO requests based on the request path to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request path passing through the same physical node are positioned in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single physical node with an exception, the exception IO requests caused by the physical node can be clustered in the same exception IO request group because the request paths all pass through the physical node, so that the exception IO requests caused by the same exception cause can be ensured to realize alarming in one piece of alarming information, repeated alarming on the same exception cause is avoided, the number of the alarming information is reduced, the alarming information can be processed in time, and the processing efficiency of alarming is improved.

Description

Alarm method, device and storage medium

技术领域Technical field

本申请涉及云存储技术领域，尤其涉及一种告警方法、设备及存储介质。This application relates to the field of cloud storage technology, and in particular to an alarm method, equipment and storage medium.

背景技术Background technique

云存储可理解为是一种网上在线存储（Cloud storage）的模式。随着云存储技术的发展，基于云存储技术还提出了存储计算分离架构，简称存算分离架构。云存储技术可用于实现存算分离架构中的存储层。在存算分离架构中还可包含计算层，计算层和存储层解耦合，通过网络进行连通，计算层和存储层都可实现为独立的分布式系统。计算层中的各个计算节点可通过IO请求的方式访问存储层，以从存储层中的存储节点上读写数据。Cloud storage can be understood as a mode of online online storage (Cloud storage). With the development of cloud storage technology, a storage and computing separation architecture has also been proposed based on cloud storage technology, referred to as storage and computing separation architecture. Cloud storage technology can be used to implement the storage layer in a storage-computing separation architecture. The storage and computing separation architecture can also include a computing layer. The computing layer and storage layer are decoupled and connected through the network. Both the computing layer and the storage layer can be implemented as independent distributed systems. Each computing node in the computing layer can access the storage layer through IO requests to read and write data from the storage nodes in the storage layer.

目前，在存算分离架构中通常部署有全链路的异常监测系统，用于发现及自动诊断存在异常的IO请求，并以异常IO请求为单位进行告警。Currently, a full-link anomaly monitoring system is usually deployed in a storage-computing separation architecture to discover and automatically diagnose abnormal IO requests, and to issue alarms based on abnormal IO requests.

但是，随着IO请求的数量级不断攀升，告警的次数也在不断增多，海量的告警信息导致告警堆积，带来了很大的告警处理压力，无法及时处理告警。However, as the magnitude of IO requests continues to rise, the number of alarms also continues to increase. The massive amount of alarm information leads to the accumulation of alarms, which puts a lot of alarm processing pressure and cannot handle alarms in a timely manner.

发明内容Contents of the invention

本申请的多个方面提供一种告警方法、设备及存储介质，用以改善对告警的处理效率。Various aspects of this application provide an alarm method, device and storage medium to improve alarm processing efficiency.

本申请实施例提供一种告警方法，包括：The embodiment of this application provides an alarm method, including:

在发生告警触发事件的情况下，确定异常IO请求各自所对应的请求路径，单条请求路径表征所对应的异常IO请求途经的物理节点之间的连通关系；When an alarm triggering event occurs, determine the request path corresponding to each abnormal IO request. A single request path represents the connectivity relationship between the physical nodes passed by the corresponding abnormal IO request;

基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，其中，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；Cluster abnormal IO requests based on the request path to generate at least one abnormal IO request group, where the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;

以异常IO请求组为单位，输出告警信息。Output alarm information based on abnormal IO request groups.

进一步，确定所述异常IO请求各自所对应的请求路径，包括：Further, determine the request paths corresponding to each of the abnormal IO requests, including:

获取所述异常IO请求各自所对应的异常监测信息；Obtain the abnormal monitoring information corresponding to each of the abnormal IO requests;

从所述异常监测信息中，解析所对应异常IO请求所途经的物理节点的标识信息及途经顺序，以确定所述异常IO请求各自所对应的请求路径。From the abnormal monitoring information, the identification information and the order of the physical nodes passed by the corresponding abnormal IO requests are analyzed to determine the request paths corresponding to each of the abnormal IO requests.

进一步，获取所述异常IO请求各自所对应的异常监测信息，包括：Further, obtain the abnormal monitoring information corresponding to each of the abnormal IO requests, including:

向异常监测系统发送异常监测信息获取请求，所述获取请求中携带有目标异常类型，其中，所述异常监测系统已诊断出异常IO请求各自所对应的异常类型；Send an exception monitoring information acquisition request to the anomaly monitoring system, where the acquisition request carries the target anomaly type, wherein the anomaly monitoring system has diagnosed the anomaly type corresponding to each of the abnormal IO requests;

接收所述异常监测系统返回的已诊断为所述目标异常类型的异常IO请求所对应的异常监测信息。Receive exception monitoring information returned by the anomaly monitoring system corresponding to an abnormal IO request diagnosed as the target anomaly type.

进一步，所述异常监测信息采用追踪trace信息，trace信息与IO请求一一对应，trace信息中包含多个存在顺序关系的跨度span项，span项与IO请求所途经的物理节点一一对应，span项中包含所对应物理节点的标识信息，trace信息中包含的span项之间的顺序关系用于表征IO请求所途经的物理节点之间的途经顺序。Further, the abnormal monitoring information uses tracking trace information. The trace information corresponds to the IO request one-to-one. The trace information contains multiple span items with a sequential relationship. The span items correspond to the physical nodes through which the IO request passes. The span items The item contains the identification information of the corresponding physical node, and the order relationship between the span items contained in the trace information is used to represent the order of the physical nodes passed by the IO request.

进一步，基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，包括：Further, abnormal IO requests are clustered based on the request path to generate at least one abnormal IO request group, including:

将能够通过物理节点而连通的各条请求路径所对应的异常IO请求，聚类为异常IO请求组，或者，Cluster the abnormal IO requests corresponding to each request path that can be connected through the physical node into abnormal IO request groups, or,

查询能够连通多条请求路径的物理节点，作为聚类节点；将单个聚类节点所能连通的各条请求路径所对应的异常IO请求聚类为异常IO请求组。Query the physical nodes that can connect multiple request paths as clustering nodes; cluster the abnormal IO requests corresponding to each request path that can be connected by a single clustering node into abnormal IO request groups.

进一步，将能够通过物理节点而连通的各条请求路径所对应的异常IO请求，聚类为异常IO请求组，包括：Furthermore, abnormal IO requests corresponding to various request paths that can be connected through physical nodes are clustered into abnormal IO request groups, including:

以各条请求路径上的物理节点作为顶点，各条请求路径上物理节点之间的连通关系作为边，构建无向图；Using the physical nodes on each request path as vertices and the connectivity relationships between physical nodes on each request path as edges, an undirected graph is constructed;

从所述无向图中，搜索连通分量；From the undirected graph, search for connected components;

将单个连通分量内所包含的请求路径各自所对应的异常IO请求，聚类为异常IO请求组。The abnormal IO requests corresponding to the request paths included in a single connected component are clustered into abnormal IO request groups.

进一步，从所述无向图中，搜索连通分量，包括：Further, from the undirected graph, search for connected components, including:

在遍历至所述无向图中的目标顶点时，搜索所述目标顶点所处的连通分量；When traversing to a target vertex in the undirected graph, search for the connected component where the target vertex is located;

从所述无向图中删除所述目标顶点所处的连通分量；Delete the connected component where the target vertex is located from the undirected graph;

从所述无向图中的剩余顶点中，继续确定下一个目标顶点并搜索及删除对应的连通分量，直至所述无向图中不存在剩余顶点；From the remaining vertices in the undirected graph, continue to determine the next target vertex and search and delete the corresponding connected components until there are no remaining vertices in the undirected graph;

输出所搜索到的连通分量。Output the searched connected components.

进一步，以异常IO请求组为单位，输出告警信息，包括：Furthermore, alarm information is output based on the abnormal IO request group, including:

对目标异常IO请求组内各个异常IO请求所途经的物理节点进行去重后，将剩余的物理节点确定为目标节点；After deduplicating the physical nodes passed by each abnormal IO request in the target abnormal IO request group, determine the remaining physical nodes as the target nodes;

基于所述目标节点的标识信息、所述目标节点所属集群的标识信息和/或所述目标异常IO请求组所涉及到的异常类型，为所述目标异常IO请求组，输出告警信息；Based on the identification information of the target node, the identification information of the cluster to which the target node belongs, and/or the exception type involved in the target abnormal IO request group, output alarm information for the target abnormal IO request group;

其中，所述目标异常IO请求组为任一异常IO请求组。Wherein, the target abnormal IO request group is any abnormal IO request group.

进一步，对所述目标异常IO请求组内各个异常IO请求所途经的物理节点进行去重后，将剩余的物理节点确定为目标节点，包括：Further, after deduplicating the physical nodes passed by each abnormal IO request in the target abnormal IO request group, the remaining physical nodes are determined as target nodes, including:

若基于各条请求路径构建无向图并从所述无向图中搜索连通分量以聚类出所述目标异常IO请求组，则将所述目标异常IO请求组所对应的连通分量中包含的各个顶点所代表的物理节点，作为目标节点。If an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster the target abnormal IO request group, then the connected components contained in the target abnormal IO request group are The physical node represented by each vertex serves as the target node.

进一步，所述物理节点至少包括计算节点和存储节点，在输出所述告警信息之后，所述方法还包括：Further, the physical nodes include at least computing nodes and storage nodes. After outputting the alarm information, the method further includes:

响应于告警处理指令，在目标告警信息所对应的目标异常IO请求组下，分析计算节点与存储节点之间形成的连通结构；In response to the alarm processing instructions, analyze the connectivity structure formed between the computing node and the storage node under the target abnormal IO request group corresponding to the target alarm information;

基于连通结构与异常节点之间的指向关系，在所述目标异常IO请求组下推测导致异常的异常节点。Based on the pointing relationship between the connected structure and the abnormal node, the abnormal node causing the abnormality is inferred under the target abnormal IO request group.

进一步，基于连通接结构与异常节点之间的指向关系，在所述目标异常IO请求组下推测导致异常的异常节点，包括：Furthermore, based on the pointing relationship between the connection structure and the abnormal node, the abnormal node causing the abnormality is inferred under the target abnormal IO request group, including:

若所述目标异常IO请求组下存在第一类连通结构，则将所述第一类连通结构中的计算节点推测为异常节点，所述第一类连通结构为一个计算节点连通多个存储节点；或者，If there is a first type of connectivity structure under the target abnormal IO request group, the computing nodes in the first type of connectivity structure are inferred to be abnormal nodes. The first type of connectivity structure is a computing node that connects multiple storage nodes. ;or,

若所述目标异常IO请求组下存在第二类连通结构，则将所述第二类连通结构中的存储节点推测为异常节点，所述第二类连通结构为一个存储节点连通多个计算节点；或者，If there is a second type of connectivity structure under the target abnormal IO request group, the storage nodes in the second type of connectivity structure are inferred to be abnormal nodes. The second type of connectivity structure is a storage node connecting multiple computing nodes. ;or,

若所述目标异常IO请求组下存在第三类连通结构，则将所述第三类连通结构中的中间节点推测为异常节点，所述第三类连通结构为多个计算节点和多个存储节点通过中间节点连通。If there is a third type of connectivity structure under the target abnormal IO request group, the intermediate nodes in the third type of connectivity structure are inferred to be abnormal nodes. The third type of connectivity structure is multiple computing nodes and multiple storage nodes. Nodes are connected through intermediate nodes.

进一步，所述目标异常类型包括IO不可用类或IO受损类，所述物理节点包括计算系统中的计算节点、存储系统中的存储节点和/或用于网络连接的中间节点，所述异常IO请求为所述计算系统中的计算节点向所述存储系统中的存储节点发起的且已发生异常的IO请求。Further, the target exception type includes an IO unavailable class or an IO damaged class, and the physical nodes include computing nodes in the computing system, storage nodes in the storage system, and/or intermediate nodes used for network connection. The abnormality An IO request is an IO request initiated by a computing node in the computing system to a storage node in the storage system and in which an exception has occurred.

本申请实施例还提供一种电子设备，包括存储器和处理器；An embodiment of the present application also provides an electronic device, including a memory and a processor;

所述存储器用于存储一条或多条计算机指令；The memory is used to store one or more computer instructions;

所述处理器与所述存储器耦合，用于执行所述一条或多条计算机指令，以用于执行前述述的告警方法。The processor is coupled to the memory and is used to execute the one or more computer instructions to execute the aforementioned alarm method.

本申请实施例还提供一种存储计算机指令的计算机可读存储介质，当所述计算机指令被一个或多个处理器执行时，致使所述一个或多个处理器执行前述的数据处理方法。Embodiments of the present application also provide a computer-readable storage medium that stores computer instructions. When the computer instructions are executed by one or more processors, they cause the one or more processors to execute the foregoing data processing method.

本申请实施例还提供一种计算机程序产品，包括计算机程序/指令，其中，当计算机程序被处理器执行时，致使处理器实现前述的告警方法。An embodiment of the present application also provides a computer program product, including a computer program/instruction, wherein when the computer program is executed by a processor, it causes the processor to implement the foregoing alarm method.

在本申请实施例中，提出了在发生告警触发事件的情况下，为异常IO请求分别确定请求路径，基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；而且，提出以异常IO请求组为单位输出告警信息。这样，从单个发生异常的物理节点的角度来说，因该物理节点而导致的异常IO请求的请求路径均途经该物理节点，这使得这些请求路径所对应的异常IO请求可被聚类在同一异常IO请求组内，因此，可保证同一异常原因导致的异常IO请求在一条告警信息中即可实现告警，这可避免对同一异常原因进行重复告警，从而减少告警信息的数量，进而使得告警信息能够得到及时处理，改善对告警的处理效率。In the embodiment of this application, it is proposed that when an alarm triggering event occurs, request paths are determined for abnormal IO requests, and abnormal IO requests are clustered based on the request paths to generate at least one abnormal IO request group passing through the same physical The abnormal IO requests corresponding to the node's request path are located in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single abnormal physical node, the request paths of abnormal IO requests caused by this physical node all pass through this physical node, which allows the abnormal IO requests corresponding to these request paths to be clustered in the same Within the abnormal IO request group, therefore, it can be ensured that abnormal IO requests caused by the same abnormal cause can be alarmed in one alarm message. This can avoid repeated alarms for the same abnormal cause, thus reducing the number of alarm messages and thus making the alarm messages It can be processed in a timely manner and improve the efficiency of alarm processing.

附图说明Description of drawings

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:

图1为本申请一示例性实施例提供的一种告警方法的流程示意图；Figure 1 is a schematic flow chart of an alarm method provided by an exemplary embodiment of the present application;

图2为本申请一示例性实施例提供的一个异常IO请求的示例性请求路径的示意图；Figure 2 is a schematic diagram of an exemplary request path of an abnormal IO request provided by an exemplary embodiment of the present application;

图3为本申请一示例性实施例提供的一种示例性聚类方案的逻辑示意图；Figure 3 is a logical schematic diagram of an exemplary clustering scheme provided by an exemplary embodiment of the present application;

图4为本申请一示例性实施例提供的另一种告警方法的流程示意图；Figure 4 is a schematic flow chart of another alarm method provided by an exemplary embodiment of the present application;

图5为本申请一示例性实施例提供的又一种告警方法的流程示意图；Figure 5 is a schematic flow chart of yet another alarm method provided by an exemplary embodiment of the present application;

图6为本申请一示例性实施例提供的几种示例性的连通结构的示意图；Figure 6 is a schematic diagram of several exemplary connection structures provided by an exemplary embodiment of the present application;

图7为本申请另一示例性实施例提供的一种电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by another exemplary embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with specific embodiments of the present application and corresponding drawings. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

正如背景技术中提及的，目前针对IO请求的告警方案中，通常是以IO请求为单位输出告警信息，因此，在IO请求的数量不断增加的情况下，所导致的告警信息也海量增加。不断堆积的告警信息，给告警处理工作带来了巨大压力，告警无法得到及时处理，进而导致对告警的处理效率不佳。As mentioned in the background art, in current alarm solutions for IO requests, alarm information is usually output in units of IO requests. Therefore, as the number of IO requests continues to increase, the resulting alarm information also increases massively. The continuous accumulation of alarm information puts great pressure on alarm processing, and alarms cannot be processed in a timely manner, resulting in inefficient alarm processing.

为此，本申请实施例提出了一种新的告警方法，基本构思在于对异常IO请求进行聚类，以产生异常IO请求组，并以异常IO请求组为单位输出告警信息。这样，可有效减少告警信息的数量，保证告警信息的及时触达，从而提高对告警的处理效率。To this end, embodiments of this application propose a new alarm method. The basic idea is to cluster abnormal IO requests to generate abnormal IO request groups, and output alarm information in units of abnormal IO request groups. In this way, the amount of alarm information can be effectively reduced, ensuring timely access to alarm information, thereby improving alarm processing efficiency.

以下结合附图，详细说明本申请各实施例提供的技术方案。The technical solutions provided by each embodiment of the present application will be described in detail below with reference to the accompanying drawings.

图1为本申请一示例性实施例提供的一种告警方法的流程示意图，该方法可由告警装置执行，该告警装置可实现为软件、硬件或软件与硬件的结合，该告警装置可集成在电子设备中。参考图1，该方法可包括：Figure 1 is a schematic flow chart of an alarm method provided by an exemplary embodiment of the present application. The method can be executed by an alarm device. The alarm device can be implemented as software, hardware, or a combination of software and hardware. The alarm device can be integrated into an electronic device. in the device. Referring to Figure 1, the method may include:

步骤100、在发生告警触发事件的情况下，确定异常IO请求各自所对应的请求路径，单条请求路径表征所对应的异常IO请求途经的物理节点之间的连通关系；Step 100: When an alarm triggering event occurs, determine the request path corresponding to each abnormal IO request. A single request path represents the connectivity relationship between the physical nodes passed by the corresponding abnormal IO request;

步骤101、基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，其中，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；Step 101: Cluster abnormal IO requests based on the request path to generate at least one abnormal IO request group, where the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;

步骤102、以异常IO请求组为单位，输出告警信息。Step 102: Output alarm information based on abnormal IO request groups.

本实施例提供的告警方法可适用于各种需要针对IO请求进行异常告警的场景中。在不同的场景中，IO请求对应的发起方和接收方可能存在差别。例如，在背景技术中提及的云存储场景中， IO请求的发起方通常为计算系统中的计算节点，接收方通常为存储系统中的存储节点。对于其它场景中的IO请求，在此不做更多发起方和接收方的示例。应当理解的是，本实施例对应用场景不做限定，IO请求途经多个物理节点的应用场景下，可使用本实施例提供的告警方法，改善对告警的处理效率。The alarm method provided in this embodiment can be applied to various scenarios that require abnormal alarms for IO requests. In different scenarios, the initiator and receiver corresponding to the IO request may be different. For example, in the cloud storage scenario mentioned in the background art, the initiator of the IO request is usually the computing node in the computing system, and the recipient is usually the storage node in the storage system. For IO requests in other scenarios, no more examples of initiators and receivers are provided here. It should be understood that this embodiment does not limit application scenarios. In application scenarios where IO requests pass through multiple physical nodes, the alarm method provided by this embodiment can be used to improve alarm processing efficiency.

参考图1，在步骤100中，告警触发事件可以是接收到告警触发指令，也可以是达到预设的告警周期，当然还可以是其它类型的事件，本实施例对告警触发事件的事件类型不做限定，相应地，对告警方法的启动时机也不做限定。Referring to Figure 1, in step 100, the alarm triggering event may be receiving an alarm triggering instruction, or reaching a preset alarm period, or of course other types of events. This embodiment does not specify the event type of the alarm triggering event. There are no restrictions, and accordingly, there are no restrictions on the timing of starting the alarm method.

其中，步骤100中的异常IO请求是指已经诊断为发生异常的IO请求。关于IO请求的异常诊断环节，本实施例不做限定。本实施例中，支持采用现有或将来可用于的IO请求异常监测手段来对IO请求进行异常监测，以及时监测到异常IO请求。例如，在云存储场景中，已经部署有基于全链路追踪的异常监测系统，这种异常监测系统可以IO请求为单位生成追踪trace信息，并自动诊断IO请求对应的异常类型。当然，这仅是示例性的，本实施例中，还支持采用其它的异常监测手段来预先发现异常IO请求。The abnormal IO request in step 100 refers to an IO request that has been diagnosed as abnormal. Regarding the abnormal diagnosis process of IO requests, this embodiment does not limit it. In this embodiment, it is supported to use existing or future IO request exception monitoring means to perform abnormal monitoring on IO requests, so as to detect abnormal IO requests in a timely manner. For example, in cloud storage scenarios, an anomaly monitoring system based on full-link tracking has been deployed. This anomaly monitoring system can generate trace information for IO requests and automatically diagnose the abnormal types corresponding to IO requests. Of course, this is only exemplary. In this embodiment, other abnormality monitoring means are also supported to discover abnormal IO requests in advance.

本实施例中，考虑到异常IO请求才具有告警需求，因此，在步骤100中，可筛选出异常IO请求，作为本实施例中的处理对象。In this embodiment, only abnormal IO requests have alarm requirements. Therefore, in step 100, abnormal IO requests can be screened out and used as processing objects in this embodiment.

在此基础上，在步骤100中，可确定异常IO请求各自所对应的请求路径。其中，请求路径用于表征所对应异常IO请求途经的物理节点之间的连通关系。应当理解的是，请求路径中至少包含发起异常IO请求的物理节点和响应异常IO请求的物理节点，当然，请求路径中还可能包含用于对异常IO请求进行中间转发的物理节点。On this basis, in step 100, the request paths corresponding to each abnormal IO request can be determined. Among them, the request path is used to represent the connectivity relationship between the physical nodes passed by the corresponding abnormal IO request. It should be understood that the request path at least includes the physical node that initiates the abnormal IO request and the physical node that responds to the abnormal IO request. Of course, the request path may also include a physical node used for intermediate forwarding of the abnormal IO request.

图2为本申请一示例性实施例提供的一个异常IO请求的示例性请求路径的示意图。参考图2，在存算分离架构中，可包含计算系统和存储系统，计算系统中可包含计算节点，存储系统中可包含存储节点。对于不同的云存储产品，所提供的存储系统中可采用不完全相同的节点组织结构。例如，图2中所示，对于块存储产品，其所提供的存储系统中可至少部署两类存储节点：第一类存储节点上安装有用于调度存储资源的数据块服务器（blockserver），第二类存储节点上安装有用于进行数据存储的区块服务器（chunkserver），也即是，IO请求相关的数据存储在存储系统中的第二类存储节点上，而第一类存储节点主要负责对第二类存储节点上的存储资源进行调度和管理等工作，不承担数据存储工作。当然，图2中所示出的存算分离架构仅是示例性的，存储系统中也可不包含第一类存储节点，在此不做限定。Figure 2 is a schematic diagram of an exemplary request path of an abnormal IO request provided by an exemplary embodiment of the present application. Referring to Figure 2, the storage and computing separation architecture can include a computing system and a storage system. The computing system can include computing nodes, and the storage system can include storage nodes. For different cloud storage products, different node organizational structures may be used in the provided storage systems. For example, as shown in Figure 2, for block storage products, at least two types of storage nodes can be deployed in the storage system provided: the first type of storage node is installed with a data block server (blockserver) for scheduling storage resources; the second type A chunkserver for data storage is installed on the storage node. That is to say, the data related to the IO request is stored on the second-class storage node in the storage system, and the first-class storage node is mainly responsible for the third-class storage node. The storage resources on the second-category storage nodes are scheduled and managed and are not responsible for data storage. Of course, the storage and computing separation architecture shown in Figure 2 is only exemplary, and the storage system may not include the first type of storage nodes, which is not limited here.

继续参考图2，图2中示出的异常IO请求由计算系统中的计算节点A发出，经存储系统中的存储节点1进行中间转发，最终到达存储系统中的存储节点3并由存储节点3完成响应。基于此，该异常IO请求对应的请求路径可确定为：计算节点A-存储节点1-存储节点2。可以理解的是，请求路径不仅表征了该异常IO请求所途经的物理节点，而且还表征出了其所途经的物理节点之间的连通关系。参考图2，异常IO请求对应的请求路径可表征出计算节点A与存储节点1连通，存储节点1和存储节点2连通。Continuing to refer to Figure 2, the abnormal IO request shown in Figure 2 is issued by the computing node A in the computing system, is forwarded by the storage node 1 in the storage system, and finally reaches the storage node 3 in the storage system and is sent by the storage node 3. Complete response. Based on this, the request path corresponding to the abnormal IO request can be determined as: computing node A-storage node 1-storage node 2. It can be understood that the request path not only represents the physical node that the abnormal IO request passes through, but also represents the connectivity relationship between the physical nodes that it passes through. Referring to Figure 2, the request path corresponding to the abnormal IO request can indicate that computing node A is connected to storage node 1, and storage node 1 is connected to storage node 2.

发明人在研究过程中发现，不同异常IO请求各自所对应的请求路径之间可能途经相同的物理节点，请求路径之间可基于这类物理节点而连通。During the research process, the inventor found that the request paths corresponding to different abnormal IO requests may pass through the same physical nodes, and the request paths can be connected based on such physical nodes.

在此基础上，本实施例在步骤101中提出，可基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组。其中，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内。应当理解的是，在步骤102中对异常IO请求进行了聚类，从而可产生至少一个异常IO请求组。而聚类的依据则是请求路径之间的连通关系。On this basis, this embodiment proposes in step 101 that abnormal IO requests can be clustered based on the request path to generate at least one abnormal IO request group. Among them, the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group. It should be understood that in step 102, the abnormal IO requests are clustered, so that at least one abnormal IO request group can be generated. The basis for clustering is the connectivity relationship between request paths.

正如前文提及的，请求路径之间可基于途经的相同的物理节点而连通，因此，基于同一物理节点而连通的各条请求路径可被聚类在一起，这可保证途经同一物理节点的异常IO请求能够被聚类在同一异常IO请求组内。这样，基于请求路径之间的连通关系，在步骤100中确定出的各条请求路径将被划归到聚类出的各个异常IO请求组下。As mentioned earlier, request paths can be connected based on the same physical node they pass through. Therefore, each request path connected based on the same physical node can be clustered together, which can ensure that exceptions passing through the same physical node IO requests can be clustered within the same abnormal IO request group. In this way, based on the connectivity relationship between the request paths, each request path determined in step 100 will be classified under each clustered abnormal IO request group.

发明人在研究过程中发现，第一请求路径可能与多条其它请求路径连通，而这多条其它请求路径与第一请求路径之间连通所基于的物理节点则可能不同。在一种优选地实现方式中，在步骤101中，将能够基于物理节点而连通的各条请求路径所对应的异常IO请求，聚类为异常IO请求组。在该优选的实现方式中，基于不同物理节点而与第一请求路径能够连通的多条其它请求路径可聚类至同一异常IO请求组下。其中，第一请求路径可以是步骤100确定出的任意一条请求路径。During the research process, the inventor found that the first request path may be connected to multiple other request paths, and the physical nodes on which the connections between the multiple other request paths and the first request path are based may be different. In a preferred implementation manner, in step 101, abnormal IO requests corresponding to each request path that can be connected based on physical nodes are clustered into abnormal IO request groups. In this preferred implementation manner, multiple other request paths that are connected to the first request path based on different physical nodes can be clustered under the same abnormal IO request group. The first request path may be any request path determined in step 100.

举例来说，第一请求路径可以是：A-B，基于物理节点A，与第一请求路径连通的其它请求路径有：A-C和A-D-F；而基于物理节点B，与第一请求路径连通的其它请求路径有：B-G。因此，尽管请求路径B-G与请求路径A-C并未途经相同的物理节点，但是两者因都与第一请求路径连通，可被聚类至同一异常IO请求组。For example, the first request path may be: A-B, based on physical node A, and other request paths connected to the first request path include: A-C and A-D-F; and based on physical node B, other request paths connected to the first request path There are: B-G. Therefore, although request paths B-G and request paths A-C do not pass through the same physical node, they are both connected to the first request path and can be clustered into the same abnormal IO request group.

在该优选的实现方式中，可采用多种聚类方案来实现对异常IO请求的聚类。图3为本申请一示例性实施例提供的一种示例性聚类方案的逻辑示意图。参考图3，在该示例性聚类方案中：可以各条请求路径上的物理节点作为顶点，各条请求路径上物理节点之间的连通关系作为边，构建无向图；从无向图中，搜索连通分量；将单个连通分量内所包含的请求路径各自所对应的异常IO请求，聚类为异常IO请求组。In this preferred implementation, multiple clustering schemes can be used to implement clustering of abnormal IO requests. Figure 3 is a logical schematic diagram of an exemplary clustering scheme provided by an exemplary embodiment of the present application. Referring to Figure 3, in this exemplary clustering scheme: the physical nodes on each request path can be used as vertices, and the connectivity relationships between physical nodes on each request path can be used as edges to construct an undirected graph; from the undirected graph , search for connected components; cluster the abnormal IO requests corresponding to the request paths contained in a single connected component into abnormal IO request groups.

在该示例性聚类方案中，引入了无向图这种数据结构来承载各条请求路径。其中，图Graph，是一种非线性结构，包括顶点和边两种元素，边用于表示顶点之间的连接关系。对于一个图，若其包含的边是没有方向的，则可称为无向图。在该示例性聚类方案中，可通过无向图表征出请求路径中物理节点之间的连通关系，而且，还可表征出请求路径之间是否能够连通。基于此，在该示例性聚类方案中提出，可从构建出的无向图中，搜索连通分量。连通分量是指无向图中能够连通的极大子图，也即是，连通分量中的任意两个顶点之间都存在可达路径。参考图3，示出了该示例性聚类方案中所构建出的无向图，可以理解的是，在该无向图中并不是任意两个顶点之间都存在可达路径，例如，顶点6和顶点7之间就不存可达路径。参考图3，还示出从无向图中搜索出的三个连通分量。In this exemplary clustering scheme, a data structure such as an undirected graph is introduced to carry each request path. Among them, Graph is a non-linear structure, including two elements: vertices and edges. Edges are used to represent the connection relationships between vertices. For a graph, if the edges it contains have no direction, it can be called an undirected graph. In this exemplary clustering scheme, the connectivity relationship between the physical nodes in the request path can be represented through an undirected graph, and whether the request paths can be connected can also be represented. Based on this, it is proposed in this exemplary clustering scheme that connected components can be searched from the constructed undirected graph. A connected component refers to a connected maximal subgraph in an undirected graph, that is, there is a reachable path between any two vertices in a connected component. Referring to Figure 3, an undirected graph constructed in this exemplary clustering scheme is shown. It can be understood that in this undirected graph, not all reachable paths exist between any two vertices, for example, There is no reachable path between 6 and vertex 7. Referring to Figure 3, three connected components searched from the undirected graph are also shown.

在该示例性聚类方案中，搜索连通分量的实现逻辑可以是：In this exemplary clustering scheme, the implementation logic for searching connected components can be:

在遍历至无向图中的目标顶点时，搜索目标顶点所处的连通分量；When traversing to the target vertex in the undirected graph, search for the connected component where the target vertex is located;

从无向图中删除目标顶点所处的连通分量；Remove the connected component where the target vertex is located from the undirected graph;

从无向图中的剩余顶点中，继续确定下一个目标顶点并搜索及删除对应的连通分量，直至无向图中不存在剩余顶点；From the remaining vertices in the undirected graph, continue to determine the next target vertex and search and delete the corresponding connected components until there are no remaining vertices in the undirected graph;

输出所搜索到的连通分量。Output the searched connected components.

实际应用中，可将物理节点的IP地址作为标识信息，这样，在无向图中，可将各个顶点标记为IP地址，基于此，可遍历无向图中的各个IP地址，在遍历至目标IP地址时，可搜索目标IP地址所对应目标顶点所处的连通分量。这里，不限定搜索目标顶点所处连通分量过程中采用的搜索算法，例如，可采用深度优先搜索(DFS)算法等，在此对搜索原理不做展开详述。之后，可从无向图中删除所搜索到的连通分量，并在剩余的IP地址中确定下一个目标IP地址，如此循环，可搜索出无向图中的全部连通分量。In practical applications, the IP address of the physical node can be used as identification information. In this way, in the undirected graph, each vertex can be marked as an IP address. Based on this, each IP address in the undirected graph can be traversed, and then traversed to the target When using an IP address, the connected component of the target vertex corresponding to the target IP address can be searched. Here, the search algorithm used in the process of searching for the connected component of the target vertex is not limited. For example, the depth first search (DFS) algorithm can be used, and the search principle will not be described in detail here. Afterwards, the searched connected components can be deleted from the undirected graph, and the next target IP address can be determined among the remaining IP addresses. In this cycle, all connected components in the undirected graph can be searched.

这样，可基于无向图准确地表征出各条请求路径之间的连通情况，并可将请求路径之间的聚类问题转换为在无向图中搜索连通分量的问题，从而有效提高请求路径之间的聚类效率，请求路径之间的聚类实质即为异常IO请求之间的聚类，因此，通过引入无向图，可有效提高异常IO请求的聚类效率，并可保证途经同一物理节点的异常IO请求能够被聚类在同一异常IO请求组内。In this way, the connectivity between each request path can be accurately characterized based on the undirected graph, and the clustering problem between request paths can be converted into a problem of searching connected components in the undirected graph, thereby effectively improving the request path The clustering efficiency between request paths is essentially the clustering between abnormal IO requests. Therefore, by introducing an undirected graph, the clustering efficiency of abnormal IO requests can be effectively improved and the same path can be ensured. Abnormal IO requests of physical nodes can be clustered in the same abnormal IO request group.

值得说明的是，图3提供的聚类方案仅是示例性的，本实施例中还可采用其它聚类方案来实现步骤101中异常IO请求的聚类。例如，可遍历各条请求路径，在遍历至目标请求路径时，搜索与目标请求路径存在相同物理节点的其它请求路径，并记录目标请求路径与这些其它请求路径之间的连通关系，之后，继续遍历下一条请求路径，如此循环，可获得各条请求路径各自连通的请求路径。在此基础上，可选中一条请求路径作为起始路径，将该起始路径和其连通的请求路径添加至路径组中，之后，继续将该条请求路径所连通的各条请求路径分别作为起始路径，继续将各条起始路径所连通的请求路径添加至该路径组中，已在该路径组中的请求路径无需重复添加，如此循环，可将能够连通的请求路径聚类至同一路径组中。这种聚类方案可获取与图3中一致的聚类效果。在此不做更多聚类方案的示例性，本实施例并不限于此。It is worth noting that the clustering scheme provided in Figure 3 is only exemplary. In this embodiment, other clustering schemes can also be used to implement the clustering of abnormal IO requests in step 101. For example, you can traverse each request path, and when traversing to the target request path, search for other request paths that have the same physical node as the target request path, and record the connectivity relationship between the target request path and these other request paths, and then continue Traverse the next request path, and loop like this to obtain the request paths connected to each request path. On this basis, you can select a request path as the starting path, add the starting path and its connected request paths to the path group, and then continue to use each request path connected by this request path as the starting path. Starting path, continue to add the request paths connected by each starting path to the path group. The request paths already in the path group do not need to be added repeatedly. In this cycle, the connected request paths can be clustered into the same path. group. This clustering scheme can obtain the same clustering effect as in Figure 3. No further examples of clustering schemes are provided here, and this embodiment is not limited thereto.

在上述优选的实现方式中，可将直接或间接连通的各条请求路径所对应的异常IO请求聚类至同一异常IO请求组中，这不仅可保证途经同一物理节点的异常IO请求能够被聚类在同一异常IO请求组内，而且可保证每个异常IO请求仅出现在一个异常IO请求组内，因此，可避免对同一异常IO请求的重复分析，从而更好地减少聚类出的异常IO请求组的数量。In the above preferred implementation, abnormal IO requests corresponding to each directly or indirectly connected request path can be clustered into the same abnormal IO request group. This not only ensures that abnormal IO requests passing through the same physical node can be clustered. The classes are in the same abnormal IO request group, and it is guaranteed that each abnormal IO request only appears in one abnormal IO request group. Therefore, repeated analysis of the same abnormal IO request can be avoided, thereby better reducing clustered exceptions. The number of IO request groups.

除了上述提供的优选实现方式之外，本实施例中，在步骤101中还可采用其它实现方式来聚类出异常IO请求组。In addition to the preferred implementation methods provided above, in this embodiment, other implementation methods may also be used to cluster abnormal IO request groups in step 101.

例如，在另一种可选的实现方式中：查询能够连通多条请求路径的物理节点，作为聚类节点；将单个聚类节点所能连通的各条请求路径所对应的异常IO请求聚类为异常IO请求组。这种实现方式也可保证途经同一物理节点的异常IO请求能够被聚类在同一异常IO请求组内，这可使得后续输出的告警信息可指向单个异常原因，从而更便于进行告警处理。但聚类出的异常IO请求组的数量将多于前述的优选实现方式。另外，在该可选的实现方式中，也可采用前述的无向图的方式聚类出异类IO请求组，示例性地：可在遍历至无向图中的目标顶点时，若存在多条请求路径途经目标顶点，则将途经目标顶点的多条请求路径对应的异常IO请求聚类为异常IO请求组；从无向图中删除途经目标顶点的多条请求路径；从无向图中的剩余顶点中，继续确定下一个目标顶点并搜索及删除途经目标顶点的多条请求路径，直至无向图中不存在剩余顶点，以获得至少一个异常IO请求组。For example, in another optional implementation: query physical nodes that can connect multiple request paths as cluster nodes; cluster abnormal IO requests corresponding to each request path that a single cluster node can connect For abnormal IO request group. This implementation method also ensures that abnormal IO requests passing through the same physical node can be clustered in the same abnormal IO request group, which allows subsequent output alarm information to point to a single exception cause, making alarm processing easier. However, the number of clustered abnormal IO request groups will be more than the aforementioned preferred implementation method. In addition, in this optional implementation, the aforementioned undirected graph method can also be used to cluster heterogeneous IO request groups. For example: when traversing to the target vertex in the undirected graph, if there are multiple If the request path passes through the target vertex, the abnormal IO requests corresponding to the multiple request paths passing through the target vertex are clustered into abnormal IO request groups; the multiple request paths passing through the target vertex are deleted from the undirected graph; from the undirected graph Among the remaining vertices, continue to determine the next target vertex and search and delete multiple request paths passing through the target vertex until there are no remaining vertices in the undirected graph to obtain at least one abnormal IO request group.

可以理解的是，无论采用哪种实现方式，本实施例中，可在步骤101中，聚类出至少一个异常IO请求组，而且，从单个物理节点的角度来说，途经同一物理节点的异常IO请求能够被聚类在同一异常IO请求组内。发明人在研究过程中发现，在物理节点发生异常的情况下，需途经该物理节点的IO请求大概率会发生异常，而本实施例的步骤101中已将途经该类物理节点的异常IO请求聚类到同一异常IO请求组中，这实质上可理解为已将同一异常原因导致的异常IO请求聚类在同一异常IO请求组中。It can be understood that no matter which implementation method is adopted, in this embodiment, at least one abnormal IO request group can be clustered in step 101, and, from the perspective of a single physical node, exceptions passing through the same physical node IO requests can be clustered within the same abnormal IO request group. During the research process, the inventor found that when a physical node is abnormal, the IO requests that need to pass through the physical node will most likely be abnormal. In step 101 of this embodiment, the abnormal IO requests that pass through this type of physical node are already detected. Clustering into the same abnormal IO request group can essentially be understood as clustering abnormal IO requests caused by the same exception into the same abnormal IO request group.

在此基础上，参考图1，本实施例中在步骤102中提出，以异常IO请求组为单位，输出告警信息。也即是，一个异常IO请求组输出一条告警信息即可。本实施例中，对告警信息中包含的告警内容不做限定，能够提供在异常IO请求组下进行告警处理所需的必要内容即可。本实施例中，支持按需配置告警信息中的内容字段，在步骤102中，可基于告警信息中所需的内容字段生成对应的告警内容，并将生成的告警内容封装在告警信息中的对应内容字段中，以产生并输出告警信息。关于告警信息的构建方案将在后文中进行示例性说明，在此先不做展开详述。On this basis, referring to Figure 1, this embodiment proposes in step 102 to output alarm information in units of abnormal IO request groups. That is, an abnormal IO request group can output an alarm message. In this embodiment, the alarm content contained in the alarm information is not limited, as long as it can provide the necessary content required for alarm processing under the abnormal IO request group. In this embodiment, the content fields in the alarm information can be configured on demand. In step 102, the corresponding alarm content can be generated based on the required content fields in the alarm information, and the generated alarm content can be encapsulated in the corresponding alarm information. Content field to generate and output alarm information. The construction scheme of the alarm information will be exemplified later and will not be described in detail here.

承接前文提及的，同一异常原因导致的异常IO请求聚类在同一异常IO请求组中，这里，在步骤102中，可实现同一异常原因导致的异常IO请求在一条告警信息中即可实现告警，这可避免对同一异常原因进行重复告警。发明人在研究过程中发现，基于本实施例中的步骤101聚类出的异常IO请求组的数量远小于异常IO请求的数量，因此，在步骤102中所输出的告警信息的数量将远小于传统的以IO请求为单位所输出的告警信息的数量。Following the above mentioned, abnormal IO requests caused by the same abnormal reason are clustered in the same abnormal IO request group. Here, in step 102, abnormal IO requests caused by the same abnormal reason can be alarmed in one alarm message. , which can avoid repeated alarms for the same abnormality cause. During the research process, the inventor found that the number of abnormal IO request groups clustered based on step 101 in this embodiment is much smaller than the number of abnormal IO requests. Therefore, the number of alarm information output in step 102 will be much smaller than the number of abnormal IO request groups. The traditional number of alarm messages output in units of IO requests.

综上，本实施例中，提出了在发生告警触发事件的情况下，为异常IO请求分别确定请求路径，基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；而且，提出以异常IO请求组为单位输出告警信息。这样，从单个发生异常的物理节点的角度来说，因该物理节点而导致的异常IO请求的请求路径均途经该物理节点，这使得这些请求路径所对应的异常IO请求可被聚类在同一异常IO请求组内，因此，可保证同一异常原因导致的异常IO请求在一条告警信息中即可实现告警，这可避免对同一异常原因进行重复告警，从而减少告警信息的数量，进而使得告警信息能够得到及时处理，改善对告警的处理效率。In summary, in this embodiment, it is proposed that when an alarm triggering event occurs, request paths are determined for abnormal IO requests, and abnormal IO requests are clustered based on the request paths to generate at least one abnormal IO request group passing through the same The abnormal IO requests corresponding to the request paths of the physical nodes are located in the same abnormal IO request group; moreover, it is proposed to output alarm information in units of abnormal IO request groups. In this way, from the perspective of a single abnormal physical node, the request paths of abnormal IO requests caused by this physical node all pass through this physical node, which allows the abnormal IO requests corresponding to these request paths to be clustered in the same Within the abnormal IO request group, therefore, it can be ensured that abnormal IO requests caused by the same abnormal cause can be alarmed in one alarm message. This can avoid repeated alarms for the same abnormal cause, thus reducing the number of alarm messages and thus making the alarm messages It can be processed in a timely manner and improve the efficiency of alarm processing.

图4为本申请一示例性实施例提供的另一种告警方法的流程示意图，参考图4，该方法可包括：Figure 4 is a schematic flowchart of another alarm method provided by an exemplary embodiment of the present application. Referring to Figure 4, the method may include:

步骤400、在发生告警触发事件的情况下，获取异常IO请求各自所对应的异常监测信息；Step 400: When an alarm triggering event occurs, obtain the abnormal monitoring information corresponding to each abnormal IO request;

步骤401、从异常监测信息中，解析所对应异常IO请求所途经的物理节点的标识信息及途经顺序，以确定异常IO请求各自所对应的请求路径；Step 401: From the abnormal monitoring information, parse the identification information and the order of the physical nodes passed by the corresponding abnormal IO requests to determine the corresponding request paths of the abnormal IO requests;

步骤402、基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，其中，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；Step 402: Cluster the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, where the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;

步骤403、以异常IO请求组为单位，输出告警信息。Step 403: Output alarm information based on abnormal IO request groups.

其中，步骤402和步骤403可参考前述实施例中的相关描述，在此不做重复赘述。本实施例中，基于步骤400和步骤401提供了一种确定异常IO请求所对应的请求路径的可选实现方式。该可选实现方案可与上述或下述实施例中针对其它步骤所提供的实现方案进行组合，以产生新的技术方案。For steps 402 and 403, reference may be made to the relevant descriptions in the foregoing embodiments and will not be repeated here. In this embodiment, based on step 400 and step 401, an optional implementation method of determining the request path corresponding to the abnormal IO request is provided. This optional implementation solution can be combined with the implementation solutions provided for other steps in the above or following embodiments to generate new technical solutions.

参考图4，在该可选实现方式中，可获取异常IO请求各自所对应的异常监测信息。正如前文提及的，本实施例中支持各种对IO请求的异常监测手段，这些异常监测手段中均可产出异常监测信息。优选地，本实施例中可采用能够产出以IO请求为单位的异常监测信息的异常监测手段。当然，这仅是优选地，在以其它单位产出异常监测数据的情况下，本实施例中支持将这些异常监测数据整理为以IO请求为单位的异常监测信息。Referring to Figure 4, in this optional implementation, the abnormal monitoring information corresponding to each abnormal IO request can be obtained. As mentioned before, this embodiment supports various abnormality monitoring methods for IO requests, and all of these abnormality monitoring methods can produce abnormality monitoring information. Preferably, in this embodiment, an abnormality monitoring method capable of producing abnormality monitoring information in units of IO requests may be used. Of course, this is only preferred. In the case where the abnormal monitoring data is generated in other units, this embodiment supports organizing these abnormal monitoring data into abnormal monitoring information in IO request units.

在一种示例性的方案中：异常监测信息采用追踪trace信息，trace信息与IO请求一一对应，trace信息中包含多个存在顺序关系的跨度span项，span项与IO请求所途经的物理节点一一对应，span项中包含所对应物理节点的标识信息，trace信息中包含的span项之间的顺序关系用于表征IO请求所途经的物理节点之间的途经顺序。在该示例性的方案中，在步骤400中，可从基于全链路追踪的异常监测系统中，获取异常IO请求各自对应的trace信息，正如前文提及的，这种异常监测系统可以IO请求为单位生成trace信息，并自动诊断IO请求对应的异常类型。而且，这种异常监测系统为IO请求所途经的各个物理节点分别构建了span项，用于描述物理节点上的IO处理过程，trace信息中包含的span项之间存在顺序关系，可作为确定IO请求所途经的物理节点之间的途经顺序的依据。应当理解的是，这仅是示例性的，本实施例中，还可采用其它类型的异常监测信息，而并不限于此。In an exemplary solution: the exception monitoring information uses tracking trace information. The trace information corresponds to the IO request one-to-one. The trace information contains multiple span items that have a sequential relationship. The span items correspond to the physical nodes that the IO requests pass through. One-to-one correspondence, the span item contains the identification information of the corresponding physical node, and the sequence relationship between the span items contained in the trace information is used to represent the path order between the physical nodes that the IO request passes through. In this exemplary solution, in step 400, the trace information corresponding to the abnormal IO requests can be obtained from the anomaly monitoring system based on full-link tracking. As mentioned above, this anomaly monitoring system can obtain IO requests. Generate trace information for the unit and automatically diagnose the exception type corresponding to the IO request. Moreover, this anomaly monitoring system constructs span items for each physical node that the IO request passes through, which is used to describe the IO processing process on the physical node. There is a sequential relationship between the span items included in the trace information, which can be used to determine the IO The basis for the order in which the request passes between the physical nodes. It should be understood that this is only exemplary. In this embodiment, other types of abnormality monitoring information can also be used, but are not limited thereto.

在此基础上，参考图4，在步骤401中提出，可从异常监测信息中，解析所对应异常IO请求所途经的物理节点的标识信息及途经顺序，以确定异常IO请求各自所对应的请求路径。发明人在研究过程中发现，异常监测信息中通常记录有异常IO请求所途经的各个物理节点上的IO处理过程描述信息，而IO处理过程描述信息中则通常包括物理节点的标识信息，物理节点所属集群的标识信息，物理节点上的IO处理耗时，物理节点的上一跳节点的标识信息，物理节点的下一跳节点的标识信息，物理节点上执行的IO处理操作类型等多方面的信息。例如前文中提及的span项中即记录有物理节点上的这些IO处理过程描述信息。On this basis, with reference to Figure 4, it is proposed in step 401 that the identification information and passing sequence of the physical nodes passed by the corresponding abnormal IO requests can be parsed from the abnormal monitoring information to determine the requests corresponding to each of the abnormal IO requests. path. During the research process, the inventor found that the abnormal monitoring information usually records the IO processing description information on each physical node through which the abnormal IO request passes, and the IO processing description information usually includes the identification information of the physical node. The physical node The identification information of the cluster to which it belongs, the IO processing time on the physical node, the identification information of the previous hop node of the physical node, the identification information of the next hop node of the physical node, the type of IO processing operation performed on the physical node, etc. information. For example, the span item mentioned above records the IO processing description information on the physical node.

基于此，在步骤401中，可从异常监测信息中解析所对应异常IO请求所途经的物理节点的标识信息及途经顺序，进而，可确定出异常IO请求所对应的请求路径。Based on this, in step 401, the identification information and routing sequence of the physical nodes passed by the corresponding abnormal IO request can be analyzed from the abnormal monitoring information, and further, the request path corresponding to the abnormal IO request can be determined.

进一步，发明人在研究过程中发现，不同异常IO请求所对应的异常类型可能不完全相同。实际应用中，IO请求的异常类型可包括但不限于IO不可用类或IO受损类等。其中，IO不可用类通常是IO请求未完成响应，IO受损类则通常是IO请求完成响应但响应速率过慢。应当理解的是，这里提供的几种异常类型仅是示例性的，本实施例并不限于此。而且，发明人还发现，异常类型是后续的告警处理环节中的重要参考依据。Furthermore, the inventor discovered during the research process that the exception types corresponding to different abnormal IO requests may not be exactly the same. In actual applications, the exception types of IO requests may include but are not limited to IO unavailable classes or IO damaged classes. Among them, the IO unavailable category usually means that the IO request has not completed the response, and the IO damaged category usually means that the IO request has completed the response but the response rate is too slow. It should be understood that the several exception types provided here are only exemplary, and this embodiment is not limited thereto. Moreover, the inventor also found that the exception type is an important reference in subsequent alarm processing links.

为此，在该可选实现方式中，针对获取异常IO请求各自所对应的异常监测信息的过程，提出了一种示例性获取方案：可向针对存储系统所部署的异常监测系统发送异常监测信息获取请求，获取请求中携带有目标异常类型，其中，异常监测系统已诊断出异常IO请求各自所对应的异常类型；接收异常监测系统返回的已诊断为目标异常类型的异常IO请求所对应的异常监测信息。To this end, in this optional implementation, an exemplary acquisition scheme is proposed for the process of obtaining the abnormal monitoring information corresponding to each abnormal IO request: the abnormal monitoring information can be sent to the abnormal monitoring system deployed for the storage system Obtain the request, which carries the target exception type. The exception monitoring system has diagnosed the exception type corresponding to the abnormal IO request; receives the exception corresponding to the abnormal IO request returned by the exception monitoring system and has been diagnosed as the target exception type. Monitoring information.

在该示例性获取方案中，将目标异常类型携带在向异常监测系统所发生的异常监测信息获取请求，由于异常监测系统中已经诊断出异常IO请求所对应的异常类型，因此，异常监测系统可筛选出已诊断为目标异常类型的异常IO请求并返回筛选出的异常IO请求所对应的异常监测信息。In this exemplary acquisition scheme, the target anomaly type is carried in the anomaly monitoring information acquisition request to the anomaly monitoring system. Since the anomaly type corresponding to the abnormal IO request has been diagnosed in the anomaly monitoring system, the anomaly monitoring system can Filter out abnormal IO requests that have been diagnosed as the target exception type and return the exception monitoring information corresponding to the filtered abnormal IO requests.

这样，基于该示例性获取方案，本实施例所提供的告警方案中，首先，从异常类型的维度对异常IO请求进行一层聚类，将被诊断为相同异常类型的异常IO请求聚类在一起；其次，按照步骤402所提供的基于请求路径进行聚类的构思，可针对不同异常类型下的异常IO请求分别进行再聚类。这样，可在不同异常类型下分别聚类出异常IO请求组，也即是，同一异常IO请求组内的各个异常IO请求所对应的异常类型将一致，进而，在步骤403中所输出的单条告警信息中将仅涉及到一种异常类型。这可为后续的告警处理环节提供关于异常类型方面的参考依据。In this way, based on this exemplary acquisition scheme, in the alarm scheme provided by this embodiment, first, a layer of clustering is performed on abnormal IO requests from the dimension of abnormal type, and abnormal IO requests diagnosed as the same abnormal type are clustered in Together; secondly, according to the concept of clustering based on request paths provided in step 402, abnormal IO requests under different exception types can be re-clustered respectively. In this way, abnormal IO request groups can be clustered under different exception types. That is, the exception types corresponding to each abnormal IO request in the same abnormal IO request group will be consistent. Furthermore, the single output in step 403 The alarm information will only involve one exception type. This can provide a reference for exception types for subsequent alarm processing.

当然，在该可选实现方案中，还可采用其它示例性方案来支持在告警信息中示出异常类型。例如，可从异常监测系统中获取所有异常IO请求所对应的异常监测数据，并统一按照步骤401-步骤403进行处理。但是，在步骤403中可在单个异常IO请求组下按照异常类型进行再聚类，并在告警信息中记录不同异常类型下所涉及到的异常IO请求。这同样可为后续的告警处理环节提供关于异常类型方面的参考依据，在此对能够支持在告警信息中示出异常类型的方案不做更多示例，本实施例并不限于此。Of course, in this optional implementation solution, other exemplary solutions can also be adopted to support displaying the exception type in the alarm information. For example, the abnormal monitoring data corresponding to all abnormal IO requests can be obtained from the abnormal monitoring system, and processed uniformly according to steps 401 to 403. However, in step 403, a single abnormal IO request group can be re-clustered according to the exception type, and the abnormal IO requests involved in different exception types can be recorded in the alarm information. This can also provide a reference for the exception type for subsequent alarm processing links. No more examples will be given here to support the display of exception types in alarm information, and this embodiment is not limited to this.

综上，本实施例中，可获取异常IO请求各自所对应的异常监测信息，并基于异常监测信息准确地确定出异常IO请求各自所对应的请求路径，进而可为异常IO请求的聚类提供准确的依据。而且，还提出可从异常类型的维度对异常IO请求进行一层聚类，从而支持在告警信息中对异常类型进行合理展示，以为后续的告警处理环节提供参考依据，这可进一步改善对告警的处理效率。In summary, in this embodiment, the abnormal monitoring information corresponding to each abnormal IO request can be obtained, and the request path corresponding to each abnormal IO request can be accurately determined based on the abnormal monitoring information, which can then provide a method for clustering abnormal IO requests. Accurate basis. Moreover, it is also proposed that abnormal IO requests can be clustered from the dimension of exception type, thereby supporting the reasonable display of exception types in alarm information, and providing a reference basis for subsequent alarm processing links, which can further improve the understanding of alarms. processing efficiency.

在上述或下述实施例中，可采用多种实现方式来实现告警信息的构建。由于在各个异常IO请求组下构建告警信息的逻辑一致，为便于描述，以下以目标异常IO请求组为例，进行告警信息构建方案的说明。应当理解的是，目标异常IO请求组可以是聚类出的任一异常IO请求组。In the above or following embodiments, multiple implementation methods may be used to realize the construction of alarm information. Since the logic of constructing alarm information under each abnormal IO request group is consistent, for the convenience of description, the following takes the target abnormal IO request group as an example to describe the alarm information construction plan. It should be understood that the target abnormal IO request group can be any clustered abnormal IO request group.

在一种可选的实现方式中：可对目标异常IO请求组内各个异常IO请求所途经的物理节点进行去重后，将剩余的物理节点确定为目标节点；基于目标节点的标识信息、目标节点所属的集群的标识信息和/或目标异常IO请求组所涉及到的异常类型，为目标异常IO请求组，输出告警信息。In an optional implementation method: after deduplicating the physical nodes passed by each abnormal IO request in the target abnormal IO request group, the remaining physical nodes can be determined as the target nodes; based on the identification information of the target node, the target The identification information of the cluster to which the node belongs and/or the exception type involved in the target abnormal IO request group is the target abnormal IO request group, and alarm information is output.

其中，如前文提及的，不同请求路径可能途经相同的物理节点，为此，在该可选的实现方式中提出了对物理节点进行去重。通过去重处理，可确定出目标异常IO请求组中所涉及到的物理节点。在一种示例性去重方案中：若基于各条请求路径构建无向图并从无向图中搜索连通分量以聚类出目标异常IO请求组，则将目标异常IO请求组所对应的连通分量中包含的各个顶点所代表的物理节点，作为目标节点。As mentioned above, different request paths may pass through the same physical node. For this reason, deduplication of physical nodes is proposed in this optional implementation. Through deduplication processing, the physical nodes involved in the target abnormal IO request group can be determined. In an exemplary deduplication solution: if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster the target abnormal IO request group, then the connected components corresponding to the target abnormal IO request group are The physical node represented by each vertex contained in the component is used as the target node.

可以理解的是，在该可选的实现方式中，告警信息中可至少包含用于记录目标节点的标识信息的内容字段，用于基于目标节点所属的集群的标识信息，和/或用于记录目标异常IO请求组所涉及到的异常类型的内容字段。当然，这些内容字段仅是示例性的，本实施例并不限于此。It can be understood that, in this optional implementation, the alarm information may include at least a content field for recording the identification information of the target node, based on the identification information of the cluster to which the target node belongs, and/or for recording The content field of the exception type involved in the target exception IO request group. Of course, these content fields are only exemplary, and this embodiment is not limited thereto.

对于前述的目标节点的标识信息和目标节点所属集群的标识信息，可从目标异常IO请求组内包含的异常IO请求各自对应的异常监测信息中获取到，例如，可前文提及的span项中获取到。而对于目标异常IO请求所涉及到的异常类型，可参考前文提供的示例性方案，在基于请求路径对异常IO请求进行聚类之前，先基于异常类型对异常IO请求进行一层聚类，这种情况下，可将在目标异常类型下基于请求路径所聚类出的各个异常IO请求标记为目标异常类型，进而在告警信息中携带为目标异常IO请求组所标记的异常类型即可。而若在基于请求路径对异常IO请求进行聚类之后，再在目标异常IO请求组内基于异常类型进行再聚类，则可为目标异常IO请求组标记聚类出的异常类型并分别记录各异常类型下的异常IO请求，进而在告警信息中携带为目标异常IO请求组所标记出的异常类型以及各异常类型下的异常IO请求。The aforementioned identification information of the target node and the identification information of the cluster to which the target node belongs can be obtained from the corresponding abnormal monitoring information of the abnormal IO requests included in the target abnormal IO request group. For example, it can be obtained from the span item mentioned above. obtained. As for the exception types involved in the target abnormal IO request, you can refer to the exemplary solution provided above. Before clustering the abnormal IO requests based on the request path, first perform a layer of clustering on the abnormal IO requests based on the exception type. This way In this case, each abnormal IO request clustered based on the request path under the target exception type can be marked as the target exception type, and then the exception type marked as the target abnormal IO request group can be carried in the alarm information. If the abnormal IO requests are clustered based on the request path, and then re-clustered based on the exception type within the target abnormal IO request group, the clustered exception types can be marked for the target abnormal IO request group and recorded separately. Abnormal IO requests under the exception type, and then the alarm information carries the exception type marked for the target abnormal IO request group and the abnormal IO requests under each exception type.

应当理解的是，本实施例中还可采用其它实现方式来进行告警信息的构建，告警信息中所包含的内容字段也并不限于前述提供的几种示例性内容字段。基于本实施例提供的告警信息构建方案，可通过告警信息提示哪些集群和/或哪些节点发生了哪一类IO异常。可知，本实施例中的告警信息在以IO请求为单位来提示异常，而是从异常类型、集群及节点等维度来提示异常，这可更加便于在后续告警处理过程中进行异常定位，从而可进一步提高对告警的处理效率。It should be understood that other implementation methods can be used to construct alarm information in this embodiment, and the content fields included in the alarm information are not limited to the several exemplary content fields provided above. Based on the alarm information construction solution provided in this embodiment, the alarm information can be used to prompt which clusters and/or which nodes have experienced which type of IO abnormality. It can be seen that the alarm information in this embodiment not only prompts exceptions in units of IO requests, but also prompts exceptions from the dimensions of exception type, cluster and node, etc. This can make it easier to locate exceptions in the subsequent alarm processing process, and thus can Further improve the efficiency of alarm processing.

图5为本申请一示例性实施例提供的又一种告警方法的流程示意图。参考图5，该方法可包括：Figure 5 is a schematic flowchart of yet another alarm method provided by an exemplary embodiment of the present application. Referring to Figure 5, the method may include:

步骤500、在发生告警触发事件的情况下，确定异常IO请求各自所对应的请求路径，单条请求路径表征所对应的异常IO请求途经的物理节点之间的连通关系；Step 500: When an alarm triggering event occurs, determine the request paths corresponding to each of the abnormal IO requests. A single request path represents the connectivity relationship between the physical nodes passed by the corresponding abnormal IO requests;

步骤501、基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组，其中，途经同一物理节点的请求路径所对应的异常IO请求位于同一异常IO请求组内；Step 501: Cluster abnormal IO requests based on the request path to generate at least one abnormal IO request group, where the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;

步骤502、以异常IO请求组为单位，输出告警信息。Step 502: Output alarm information based on abnormal IO request groups.

步骤503、响应于告警处理指令，在目标告警信息所对应的目标异常IO请求组下，分析计算节点与存储节点之间形成的连通结构；Step 503: In response to the alarm processing instruction, analyze the connection structure formed between the computing node and the storage node under the target abnormal IO request group corresponding to the target alarm information;

步骤504、基于连通结构与异常节点之间的指向关系，在目标异常IO请求组下推测导致异常的异常节点。Step 504: Based on the pointing relationship between the connectivity structure and the abnormal node, infer the abnormal node that caused the abnormality under the target abnormal IO request group.

其中，步骤500-步骤502可参考前述实施例中的相关描述，在此不再重复赘述。本实施例中，基于步骤503和步骤504提供告警信息输出之后的可选实现方案。该可选实现方案可与上述或下述实施例中针对其它步骤所提供的实现方案进行组合，以产生新的技术方案。For steps 500 to 502, reference may be made to the relevant descriptions in the foregoing embodiments, which will not be repeated here. In this embodiment, based on step 503 and step 504, an optional implementation solution after outputting the alarm information is provided. This optional implementation solution can be combined with the implementation solutions provided for other steps in the above or following embodiments to generate new technical solutions.

参考图5，在输出告警信息后，可响应于告警处理指令，启动对输出的各条告警信息的告警处理逻辑。由于对不同告警信息所实施的告警处理逻辑一致，为便于描述，本实施例中，以目标告警信息为例对告警处理逻辑进行说明。应当理解的是，目标告警信息可以是本实施例中步骤502中所输出的任意一条告警信息。Referring to Figure 5, after the alarm information is output, the alarm processing logic for each piece of output alarm information can be started in response to the alarm processing instruction. Since the alarm processing logic implemented for different alarm information is consistent, for convenience of description, in this embodiment, the alarm processing logic is explained by taking the target alarm information as an example. It should be understood that the target alarm information may be any alarm information output in step 502 in this embodiment.

参考图5，本实施例中，考虑到IO请求的本质可理解为数据读写请求，因此，至少存在用于数据存储的物理节点，本实施例中将用于数据存储的节点描述为存储节点；而数据读写的根因通常是用于计算，因此，本实施例中将发起IO请求的物理节点描述为计算节点。这样，本实施例中，请求路径中的物理节点可至少包括计算节点和存储节点。正如前文提及的，请求路径中还可包含用于对IO请求进行中间转发的中间节点，等。在存算分离架构中，计算节点可以位于计算系统中，而存储节点则可位于存储系统中。当然，在其它场景中，计算节点和存储节点的部署位置可不限于此，在此对计算节点和存储节点的部署位置不做限定。Referring to Figure 5, in this embodiment, considering that the nature of IO requests can be understood as data read and write requests, there are at least physical nodes for data storage. In this embodiment, the nodes used for data storage are described as storage nodes. ; The root cause of data reading and writing is usually used for calculation. Therefore, in this embodiment, the physical node that initiates the IO request is described as a computing node. In this way, in this embodiment, the physical nodes in the request path may include at least computing nodes and storage nodes. As mentioned earlier, the request path can also include intermediate nodes for intermediate forwarding of IO requests, etc. In the storage-computing separation architecture, computing nodes can be located in the computing system, and storage nodes can be located in the storage system. Of course, in other scenarios, the deployment locations of computing nodes and storage nodes may not be limited to this, and the deployment locations of computing nodes and storage nodes are not limited here.

基于此，在步骤503中提出，可在目标告警信息所对应的目标异常IO请求组下，分析计算节点与存储节点之间形成的连通结构。本实施例中的连通结构，可理解为基于一个物理节点所连通起的各条请求路径所形成的结构，该物理节点可以是计算节点，也可以是存储节点，还可以是用于IO请求转发的中间节点。发明人在研究过程中发现，在单个异常IO请求组下，可能分析出多个连通结构。Based on this, it is proposed in step 503 that the connection structure formed between the computing node and the storage node can be analyzed under the target abnormal IO request group corresponding to the target alarm information. The connectivity structure in this embodiment can be understood as a structure formed based on request paths connected by a physical node. The physical node can be a computing node, a storage node, or it can also be used for IO request forwarding. the intermediate node. During the research process, the inventor found that under a single abnormal IO request group, multiple connected structures may be analyzed.

图6为本申请一示例性实施例提供的几种示例性的连通结构的示意图。参考图6，本实施例中，至少可存在三类连通结构：Figure 6 is a schematic diagram of several exemplary connection structures provided by an exemplary embodiment of the present application. Referring to Figure 6, in this embodiment, there can be at least three types of connected structures:

第一类连通结构为一个计算节点连通多个存储节点；The first type of connectivity structure is a computing node connected to multiple storage nodes;

第二类连通结构为一个存储节点连通多个计算节点；The second type of connectivity structure is a storage node connected to multiple computing nodes;

第三类连通结构为多个计算节点和多个存储节点通过中间节点连通。其中，中间节点可以是用于网络中转的网络设备等。The third type of connectivity structure is that multiple computing nodes and multiple storage nodes are connected through intermediate nodes. The intermediate node may be a network device used for network transfer, etc.

为此，在步骤504中提出，可预先配置连通结构与异常节点之间的指向关系。该指向关系可用于在不同连通结构下指导定位出异常节点。这样，在步骤504中，可基于该指向关系，在目标异常IO请求组下推测导致异常的异常节点。若目标异常IO请求组下分析出多个连通结构，则可在多个连通架构下按照该指向关系分别推测异常节点。To this end, it is proposed in step 504 that the pointing relationship between the connected structure and the abnormal node can be configured in advance. This pointing relationship can be used to guide the location of abnormal nodes under different connectivity structures. In this way, in step 504, based on the pointing relationship, the abnormal node causing the exception can be inferred under the target abnormal IO request group. If multiple connected structures are analyzed under the target abnormal IO request group, the abnormal nodes can be inferred based on the pointing relationships under the multiple connected structures.

在一种示例性的推测方案中：In an exemplary speculative scenario:

若目标异常IO请求组下存在第一类连通结构，则将第一类连通结构中的计算节点推测为异常节点；或者，If there is a first-type connected structure under the target abnormal IO request group, the computing nodes in the first-type connected structure are inferred to be abnormal nodes; or,

若目标异常IO请求组下存在第二类连通结构，则将第二类连通结构中的存储节点推测为异常节点；或者，If there is a second type of connected structure under the target abnormal IO request group, the storage node in the second type of connected structure is inferred to be an abnormal node; or,

若目标异常IO请求组下存在第三类连通结构，则将第三类连通结构中的中间节点推测为异常节点。If there is a third type of connectivity structure under the target abnormal IO request group, the intermediate nodes in the third type of connectivity structure are inferred to be abnormal nodes.

参考图6，对于第一类连通结构，多个存储节点连通至同一计算节点，则可将异常原因初步定位在该计算节点上，通常是因该计算节点发生异常，才导致与该计算节点连通的多个存储节点所对应的IO请求都发生异常。同理，对于第二类连通结构，可将异常原因初步定位在这类连通结构中的存储节点。而对于第三类连通结构，由于多个计算节点和多个存储节点之间的IO请求均发生了异常，因此，可将异常原因初步定位在用于连通该多个计算节点和该多个存储节点的中间节点上，通常是由于中间节点发生异常，才导致途经中间节点的多个IO请求都发生异常。Referring to Figure 6, for the first type of connectivity structure, if multiple storage nodes are connected to the same computing node, the cause of the abnormality can be initially located on the computing node. Usually, the abnormality of the computing node causes the connection to the computing node. The IO requests corresponding to multiple storage nodes are abnormal. Similarly, for the second type of connected structure, the cause of the abnormality can be initially located at the storage node in this type of connected structure. As for the third type of connection structure, since the IO requests between multiple computing nodes and multiple storage nodes are abnormal, the cause of the abnormality can be initially located in the network used to connect the multiple computing nodes and the multiple storage nodes. On the intermediate node of the node, it is usually due to an abnormality in the intermediate node that the multiple IO requests passing through the intermediate node are abnormal.

值得说明的是，本实施例中，提供的告警处理逻辑仅是示例性的，而且，基于该示例性的告警处理逻辑，可进行异常节点的推测，推测出的异常节点可作为运维参考依据，实际应用中，还可引入更多的推测维度来进一步修正本实施例提供的推测结果，当然，还可加入人工分析等处理环节，以确保准确定位出异常原因，在此对其它推测维度以及人工分析逻辑等均不作限定。It is worth noting that in this embodiment, the alarm processing logic provided is only exemplary. Moreover, based on this exemplary alarm processing logic, abnormal nodes can be inferred, and the inferred abnormal nodes can be used as reference for operation and maintenance. , in actual applications, more inference dimensions can be introduced to further correct the inference results provided by this embodiment. Of course, manual analysis and other processing steps can also be added to ensure that the cause of the anomaly is accurately located. Here, other inference dimensions and Manual analysis logic, etc. are not limited.

综上，本实施例中，在输出告警信息之后，还提出了告警处理逻辑，在告警处理逻辑中可充分利用本实施例中在告警过程中所产出的请求路径及异常IO请求组等数据，作为告警处理逻辑中的分析依据。基于此，可在各个告警信息下，分析出连通结构，进而可基于连通结构进行异常节点的推测，为运维提供参考依据，从而更快，更准确地完成告警处理，这可进一步改善对告警的处理效率。In summary, in this embodiment, after outputting the alarm information, alarm processing logic is also proposed. In the alarm processing logic, the request path and abnormal IO request group data generated during the alarm process in this embodiment can be fully utilized. , as the analysis basis in the alarm processing logic. Based on this, the connectivity structure can be analyzed under each alarm information, and abnormal nodes can be inferred based on the connectivity structure to provide a reference for operation and maintenance, so that alarm processing can be completed faster and more accurately, which can further improve alarm processing. processing efficiency.

需要说明的是，在上述实施例及附图中的描述的一些流程中，包含了按照特定顺序出现的多个操作，但是应该清楚了解，这些操作可以不按照其在本文中出现的顺序来执行或并行执行，操作的序号如101、102等，仅仅是用于区分开各个不同的操作，序号本身不代表任何的执行顺序。另外，这些流程可以包括更多或更少的操作，并且这些操作可以按顺序执行或并行执行。需要说明的是，本文中的“第一”、“第二”等描述，是用于区分不同的连通结构等，不代表先后顺序，也不限定“第一”和“第二”是不同的类型。It should be noted that some of the processes described in the above embodiments and drawings include multiple operations that appear in a specific order, but it should be clearly understood that these operations may not be performed in the order in which they appear in this article. Or execute in parallel. The sequence numbers of operations, such as 101, 102, etc., are only used to distinguish different operations. The sequence numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and the operations can be performed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this article are used to distinguish different connected structures, etc., and do not represent the order, nor do they limit "first" and "second" to be different type.

图7为本申请另一示例性实施例提供的一种电子设备的结构示意图。如图7所示，该电子设备可包括：存储器70和处理器71。FIG. 7 is a schematic structural diagram of an electronic device provided by another exemplary embodiment of the present application. As shown in FIG. 7 , the electronic device may include: a memory 70 and a processor 71 .

处理器71，与存储器70耦合，用于执行存储器70中的计算机程序，以用于：The processor 71 is coupled to the memory 70 and is used to execute the computer program in the memory 70 for:

在发生告警触发事件的情况下，确定异常IO请求各自所对应的请求路径，单条请求路径表征所对应的异常IO请求途经的物理节点之间的连通关系；When an alarm triggering event occurs, determine the request paths corresponding to the abnormal IO requests. A single request path represents the connectivity relationship between the physical nodes that the corresponding abnormal IO requests pass through;

在一可选实施例中，处理器71在确定所述异常IO请求各自所对应的请求路径时，可具体用于：In an optional embodiment, when determining the request path corresponding to each of the abnormal IO requests, the processor 71 may be specifically configured to:

在一可选实施例中，处理器71在获取所述异常IO请求各自所对应的异常监测信息时，可具体用于：In an optional embodiment, when obtaining the abnormal monitoring information corresponding to each of the abnormal IO requests, the processor 71 may be specifically configured to:

向用于进行IO异常监测的异常监测系统发送异常监测信息获取请求，所述获取请求中携带有目标异常类型，其中，所述异常监测系统已诊断出异常IO请求各自所对应的异常类型；Send an exception monitoring information acquisition request to the anomaly monitoring system used for IO anomaly monitoring, where the acquisition request carries the target anomaly type, wherein the anomaly monitoring system has diagnosed the anomaly type corresponding to each of the abnormal IO requests;

在一可选实施例中，所述异常监测信息采用追踪trace信息，trace信息与IO请求一一对应，trace信息中包含多个存在顺序关系的跨度span项，span项与IO请求所途经的物理节点一一对应，span项中包含所对应物理节点的标识信息，trace信息中包含的span项之间的顺序关系用于表征IO请求所途经的物理节点之间的途经顺序。In an optional embodiment, the abnormality monitoring information uses tracking trace information. The trace information corresponds to the IO request one-to-one. The trace information contains multiple span items that have a sequential relationship. The span items are related to the physical location of the IO request. There is a one-to-one correspondence between nodes. The span item contains the identification information of the corresponding physical node. The sequence relationship between the span items contained in the trace information is used to represent the order of the physical nodes that the IO request passes through.

在一可选实施例中，处理器71在基于请求路径对异常IO请求进行聚类，以产生至少一个异常IO请求组时，可具体用于：In an optional embodiment, when the processor 71 clusters abnormal IO requests based on request paths to generate at least one abnormal IO request group, it may be specifically configured to:

在一可选实施例中，处理器71在将能够通过物理节点而连通的各条请求路径所对应的异常IO请求，聚类为异常IO请求组时，可具体用于：In an optional embodiment, when the processor 71 clusters the abnormal IO requests corresponding to each request path that can be connected through the physical node into abnormal IO request groups, it may be specifically used to:

在一可选实施例中，处理器71在从所述无向图中，搜索连通分量时，可具体用于：In an optional embodiment, when searching for connected components from the undirected graph, the processor 71 may be specifically configured to:

输出所搜索到的连通分量。Output the searched connected components.

在一可选实施例中，处理器71在以异常IO请求组为单位，输出告警信息时，可具体用于：In an optional embodiment, when outputting alarm information based on abnormal IO request groups as a unit, the processor 71 may be specifically used to:

在一可选实施例中，处理器71在对所述目标异常IO请求组内各个异常IO请求所途经的物理节点进行去重后，将剩余的物理节点确定为目标节点时，可具体用于：In an optional embodiment, after the processor 71 deduplicates the physical nodes passed by each abnormal IO request in the target abnormal IO request group, and determines the remaining physical nodes as the target nodes, the processor 71 may be specifically configured to: :

在一可选实施例中，所述物理节点至少包括计算节点和存储节点，在输出所述告警信息之后，处理器71还可用于：In an optional embodiment, the physical node includes at least a computing node and a storage node. After outputting the alarm information, the processor 71 may also be configured to:

在一可选实施例中，处理器71在基于连通接结构与异常节点之间的指向关系，在所述目标异常IO请求组下推测导致异常的异常节点时，可具体用于：In an optional embodiment, when the processor 71 infers the abnormal node that causes the exception under the target abnormal IO request group based on the pointing relationship between the connection structure and the abnormal node, it may be specifically used to:

在一可选实施例中，所述目标异常类型包括IO不可用类或IO受损类，所述物理节点包括计算系统中的计算节点、存储系统中的存储节点和/或用于网络连接的中间节点，所述异常IO请求为所述计算系统中的计算节点向所述存储系统中的存储节点发起的且已发生异常的IO请求。In an optional embodiment, the target exception type includes an IO unavailable class or an IO damaged class, and the physical nodes include computing nodes in the computing system, storage nodes in the storage system, and/or network connections. Intermediate node, the abnormal IO request is an IO request initiated by the computing node in the computing system to the storage node in the storage system and in which an abnormality has occurred.

进一步，如图7所示，该电子设备还包括：通信组件72及电源组件73等其它组件。图7中仅示意性给出部分组件，并不意味着电子设备只包括图7所示组件。Further, as shown in FIG. 7 , the electronic device also includes: a communication component 72 and a power supply component 73 and other components. Only some components are schematically shown in FIG. 7 , which does not mean that the electronic device only includes the components shown in FIG. 7 .

值得说明的是，上述关于电子设备各实施例中的技术细节，可参考前述的方法实施例中的相关描述，为节省篇幅，在此不再赘述，但这不应造成本申请保护范围的损失。It is worth noting that for the technical details of the various embodiments of the electronic device mentioned above, reference can be made to the relevant descriptions in the foregoing method embodiments. To save space, they will not be described in detail here, but this should not cause a loss of the protection scope of the present application. .

相应地，本申请实施例还提供一种存储有计算机程序的计算机可读存储介质，计算机程序被执行时能够实现上述方法实施例中执行的各步骤。Correspondingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program. When the computer program is executed, the steps performed in the above method embodiments can be implemented.

相应地，本申请实施例还提供一种计算机程序产品，当计算机程序产品中的计算机程序被执行时能够实现上述方法实施例中执行的各步骤。Correspondingly, embodiments of the present application also provide a computer program product, which can implement each step performed in the above method embodiment when the computer program in the computer program product is executed.

上述图7中的存储器，用于存储计算机程序，并可被配置为存储其它各种数据以支持在计算平台上的操作。这些数据的示例包括用于在计算平台上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器（SRAM），电可擦除可编程只读存储器（EEPROM），可擦除可编程只读存储器（EPROM），可编程只读存储器（PROM），只读存储器（ROM），磁存储器，快闪存储器，磁盘或光盘。The memory in Figure 7 above is used to store computer programs, and can be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, etc. Memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable memory Read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

上述图7中的通信组件，被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络，如WiFi，2G、3G、4G/LTE、5G等移动通信网络，或它们的组合。在一个示例性实施例中，通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件还包括近场通信（NFC）模块，以促进短程通信。例如，在NFC模块可基于射频识别（RFID）技术，红外数据协会（IrDA）技术，超宽带（UWB）技术，蓝牙（BT）技术和其他技术来实现。The above-mentioned communication component in Figure 7 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access wireless networks based on communication standards, such as WiFi, 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

上述图7中的电源组件，为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统，一个或多个电源，及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The power supply component in Figure 7 above provides power to various components of the device where the power supply component is located. A power component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power component resides.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备（系统）、和计算机程序产品的流程图和／或方框图来描述的。应理解可由计算机程序指令实现流程图和／或方框图中的每一流程和／或方框、以及流程图和／或方框图中的流程和／或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in a process or processes in a flowchart and/or a block or blocks in a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes in the flowchart and/or in a block or blocks in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

需要说明的是，本申请所涉及的用户信息（包括但不限于用户设备信息、用户个人信息等）和数据（包括但不限于用于分析的数据、存储的数据、展示的数据等），均为经用户授权或者经过各方充分授权的信息和数据，并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准，并提供有相应的操作入口，供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all It is information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or reject.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only examples of the present application and are not intended to limit the present application. To those skilled in the art, various modifications and variations may be made to this application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims

1. An alert method, comprising:

under the condition of an alarm triggering event, determining request paths corresponding to the abnormal IO requests respectively, wherein a single request path characterizes a communication relationship between physical nodes through which the corresponding abnormal IO requests pass;

clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group, wherein the abnormal IO requests corresponding to the request paths passing through the same physical node are located in the same abnormal IO request group;

And outputting alarm information by taking the abnormal IO request group as a unit.

2. The method of claim 1, wherein determining the request paths to which the abnormal IO requests each correspond comprises:

acquiring the abnormality monitoring information corresponding to each abnormal IO request;

and analyzing the identification information and the passing sequence of the physical nodes passed by the corresponding abnormal IO requests from the abnormal monitoring information to determine the request paths corresponding to the abnormal IO requests.

3. The method of claim 2, wherein obtaining the anomaly monitoring information corresponding to each of the anomaly IO requests comprises:

an abnormal monitoring information acquisition request is sent to an abnormal monitoring system for carrying out IO abnormal monitoring, wherein the acquisition request carries a target abnormal type, and the abnormal monitoring system has diagnosed the abnormal type corresponding to each abnormal IO request;

and receiving the abnormality monitoring information corresponding to the abnormality IO request which is returned by the abnormality monitoring system and is diagnosed as the target abnormality type.

4. A method according to claim 2 or 3, wherein the anomaly monitoring information adopts trace information, the trace information corresponds to the IO request one by one, the trace information comprises a plurality of span items with sequence relations, the span items correspond to physical nodes through which the IO request passes one by one, the span items comprise identification information of the corresponding physical nodes, and the sequence relations among the span items contained in the trace information are used for representing the passing sequence among the physical nodes through which the IO request passes.

5. The method of claim 1, wherein clustering the abnormal IO requests based on the request paths to generate at least one abnormal IO request group comprises:

the abnormal IO requests corresponding to the request paths which can be communicated through the physical nodes are clustered into an abnormal IO request group, or,

inquiring a physical node capable of communicating a plurality of request paths as a clustering node; and clustering the abnormal IO requests corresponding to each request path which can be communicated by the single clustering node into an abnormal IO request group.

6. The method of claim 5, wherein clustering the abnormal IO requests corresponding to each request path that can be communicated through the physical node into an abnormal IO request group comprises:

taking physical nodes on each request path as vertexes, and taking a communication relation among the physical nodes on each request path as edges to construct an undirected graph;

searching for connected components from the undirected graph;

and clustering the abnormal IO requests corresponding to the request paths contained in the single connected component into an abnormal IO request group.

7. The method of claim 6, wherein searching for connected components from the undirected graph comprises:

Searching a connected component of a target vertex when traversing to the target vertex in the undirected graph;

deleting a connected component of the target vertex from the undirected graph;

continuing to determine the next target vertex from the residual vertices in the undirected graph, and searching and deleting the corresponding connected components until the residual vertices do not exist in the undirected graph;

outputting the searched connected component.

8. The method of claim 1, wherein outputting the alert information in units of the abnormal IO request group comprises:

after the physical nodes passed by each abnormal IO request in the target abnormal IO request group are de-duplicated, the rest physical nodes are determined as target nodes;

based on the identification information of the target node, the identification information of the cluster to which the target node belongs and/or the abnormality type related to the target abnormal IO request group, outputting alarm information for the target abnormal IO request group;

wherein the target abnormal IO request group is any abnormal IO request group.

9. The method of claim 8, wherein after de-duplicating the physical nodes traversed by each abnormal IO request in the target abnormal IO request group, determining the remaining physical nodes as target nodes comprises:

And if an undirected graph is constructed based on each request path and connected components are searched from the undirected graph to cluster out the target abnormal IO request group, taking physical nodes represented by all vertexes contained in the connected components corresponding to the target abnormal IO request group as target nodes.

10. The method of claim 1, wherein the physical nodes comprise at least a computing node and a storage node, and wherein after outputting the alert information, the method further comprises:

responding to an alarm processing instruction, and analyzing a communication structure formed between a computing node and a storage node under a target abnormal IO request group corresponding to target alarm information;

based on the pointing relation between the communication structure and the abnormal nodes, the abnormal nodes causing the abnormality are speculated under the target abnormal IO request group.

11. The method of claim 10, wherein speculating an exception node that caused an exception under the target exception IO request group based on a directed relationship between a communication structure and the exception node comprises:

if a first type of communication structure exists under the target abnormal IO request group, computing nodes in the first type of communication structure are presumed to be abnormal nodes, and the first type of communication structure is that one computing node is communicated with a plurality of storage nodes; or,

If a second type of communication structure exists under the target abnormal IO request group, presuming the storage nodes in the second type of communication structure as abnormal nodes, wherein the second type of communication structure is a storage node communicated with a plurality of computing nodes; or,

if a third type of communication structure exists under the target abnormal IO request group, the intermediate nodes in the third type of communication structure are presumed to be abnormal nodes, and the third type of communication structure is formed by communicating a plurality of computing nodes and a plurality of storage nodes through the intermediate nodes.

12. A method according to claim 3, wherein the target exception type comprises an IO unavailable class and/or an IO damaged class, the physical node comprises a computing node in a computing system, a storage node in a storage system and/or an intermediate node for network connection, and the abnormal IO request is an IO request that the computing node in the computing system initiates to the storage node in the storage system and has an abnormality occurred.

13. An electronic device comprising a memory and a processor;

the memory is used for storing one or more computer instructions;

the processor is coupled to the memory for executing the one or more computer instructions for performing the alerting method of any one of claims 1-12.

14. A computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the alerting method of any one of claims 1-12.

15. A computer program product comprising computer programs/instructions which, when executed by a processor, cause the processor to implement the alerting method of any one of claims 1-12.