CN117349128B - Fault monitoring method, device and equipment of server cluster and storage medium - Google Patents
Fault monitoring method, device and equipment of server cluster and storage medium Download PDFInfo
- Publication number
- CN117349128B CN117349128B CN202311654834.XA CN202311654834A CN117349128B CN 117349128 B CN117349128 B CN 117349128B CN 202311654834 A CN202311654834 A CN 202311654834A CN 117349128 B CN117349128 B CN 117349128B
- Authority
- CN
- China
- Prior art keywords
- server
- server cluster
- connection
- cluster
- ids
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012544 monitoring process Methods 0.000 title claims abstract description 32
- 238000010586 diagram Methods 0.000 claims abstract description 73
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000004590 computer program Methods 0.000 claims description 11
- 238000012806 monitoring device Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 5
- 238000003062 neural network model Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention relates to the technical field of server monitoring, in particular to a fault monitoring method, device and equipment of a server cluster and a storage medium, wherein the method comprises the following steps: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.
Description
Technical Field
The present invention relates to the field of server monitoring technologies, and in particular, to a method, an apparatus, a device, and a storage medium for monitoring a failure of a server cluster.
Background
The connection mode between the database all-in-one machines is network communication connection, and a logical service cluster, namely a database all-in-one machine cluster, is formed together; in the database all-in-one cluster, a certain server is in a crash state or other abnormal states due to different factors, so that data abnormality can be caused, and therefore, faults of the server need to be monitored.
Disclosure of Invention
Aiming at the technical problems, the invention provides a fault monitoring method of a server cluster, which comprises the following steps:
and acquiring the relation information of the server cluster.
And processing the relation information of the server clusters to generate a connection association diagram of the server clusters.
And acquiring the current attribute information of the server cluster.
And generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
The invention also protects a fault monitoring device of the server cluster, which comprises:
the first acquisition module is used for acquiring the relation information of the server cluster.
The first generation module is used for processing the relation information of the server cluster and generating a connection association diagram of the server cluster.
And the second acquisition module is used for acquiring the current attribute information of the server cluster.
And the second generation module is used for generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the fault of the server cluster is monitored according to the fault state diagram of the server cluster.
The invention protects a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the fault monitoring method of the server cluster when executing the computer program.
The present invention protects a computer readable storage medium storing a computer program which when executed by a processor implements the above-described failure monitoring method for a server cluster.
Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the fault monitoring method, the fault monitoring device, the fault monitoring equipment and the storage medium of the server cluster can achieve quite technical progress and practicability, and have wide industrial utilization value, and the fault monitoring method, the fault monitoring device and the storage medium of the server cluster have at least the following advantages:
the invention discloses a fault monitoring method, device and equipment of a server cluster and a storage medium, wherein the method comprises the following steps: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention, as well as the features and advantages of the present invention, which are more fully understood, as it is now apparent from the following detailed description of the preferred embodiments, taken in conjunction with the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a fault monitoring method for a server cluster according to a first embodiment of the present invention;
FIG. 2 is a flowchart of the step S2 provided in the first embodiment of the present invention;
FIG. 3 is a flowchart of step S4 according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a fault monitoring device for a server cluster according to a second embodiment of the present invention;
fig. 5 is a schematic structural diagram of a first generating module 2 according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of a second generating module 4 according to a second embodiment of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation and effects of a recovery method of a seed-obtained server cluster according to the present invention with reference to the accompanying drawings and preferred embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and in the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
Example 1
As shown in fig. 1, a first embodiment provides a fault monitoring method for a server cluster, where the method includes:
s1, acquiring relation information of a server cluster, wherein the server cluster comprises a plurality of servers, for example, the servers are database integrated machines.
Specifically, the relationship information of the server cluster includes relationship information of a plurality of servers, where the relationship information of each server refers to a communication connection relationship between any server and other servers except the server.
S2, processing the relation information of the server clusters to generate a connection association diagram of the server clusters.
Specifically, the processing the relationship information of the server cluster, and generating a connection association graph of the server cluster further includes the following steps, as shown in fig. 2:
s21, determining a connection association server ID set of the server cluster according to the relation information of the server cluster;
in a specific embodiment, the method further determines a set of connection association server IDs for the server cluster by:
s211, obtaining a server ID list A= { A corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster.
Specifically, the server ID is a unique identity of the server.
S212, acquiring a connection gateway corresponding to A according to the relation information of the server cluster corresponding to A
The set of co-server IDs b= { B 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i And the corresponding number of the connection association server IDs, namely the connection association server ID set of the server cluster is B.
Further, the A i The corresponding connection association server ID refers to the connection associated with A i Unique identity of server with relation information between corresponding serversAnd (5) identifying.
Specifically, the connection association graph of the server cluster is a tree-structured association graph, wherein the connection association graph of the server cluster comprises a connection association root node and leaf nodes associated with s-layer connection, and the number of the leaf nodes associated with the connection of each layer is inconsistent, and the method further comprises the following steps of:
s221, a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to a is acquired.
S222, determining the root node associated with the connection according to n.
In a specific embodiment, the root node of the connection association is A when only n (i) is the minimum number of associated server IDs in n i 。
In another specific embodiment, in step S222, the root node associated with the connection is further determined by:
s2221, according to n, acquires a first intermediate server ID set C= { C 1 ,C 2 ,……,C x ,……,C p },C x For the x-th first intermediate server ID, x=1, 2 … … p, p being the number of first intermediate server IDs.
Further, the first intermediate server ID refers to a server ID corresponding to the minimum value in n.
S2222, obtain the associated server ID number list z= { z (1), z (2), … …, z (x), … …, z (p) }, z (x) corresponding to C from B x Corresponding number of associated server IDs.
S2223, when any z (x) is the minimum number of associated server IDs in z, determining C x And associating a root node for the connection.
And the root node and the leaf node are found through the association relation, so that a connection association diagram of a reasonable tree structure is further constructed, the subsequent association with the fault is realized, the probability of the fault of the whole is ensured to be determined, and the effective monitoring of the fault of the server cluster is further realized.
S223, determining all leaf nodes D= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q (r) being the number of leaf nodes in the r layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 And any server ID which is not larger than a preset first server ID quantity threshold value and is outside the corresponding server ID list and the server ID corresponding to the root node associated with the connection.
Further, in step S223, D is also determined by the following steps 1y :
S2231, obtaining a second intermediate server ID list U= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server IDs corresponding to the root node associated with the connection, where v is the number of second intermediate server IDs corresponding to the root node associated with the connection.
Further, the second intermediate server ID is an associated server ID corresponding to the root node associated with the connection in B.
S2232, obtain each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y 。
Preferably, the first server ID number threshold may be determined by a person skilled in the art according to the level of the leaf node, which will not be described in detail herein.
When the leaf nodes are determined, the reasonable leaf nodes can be accurately determined based on the number of the associated server IDs, so that the reasonable probability of overall faults is improved, and further effective monitoring of faults of the server cluster is realized.
S3, obtaining the current attribute information of the server cluster.
Specifically, the current attribute information of the server cluster includes current attribute information of each server, where the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
And S4, generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
In a specific embodiment, the generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the monitoring of the fault of the server cluster according to the fault state diagram of the server cluster further includes the following steps, as shown in fig. 3:
s41, determining the current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server set.
S42, generating a fault state diagram of the server cluster according to the current fault label vector of each server and the connection association diagram of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
Specifically, the step S41 further includes the following steps:
s411, acquiring the current attribute information A corresponding to A 0 ={A 0 1 ,A 0 2 ,……,A 0 i ,……,A 0 m },A 0 i =(A 0 i1 ,A 0 i2 ,A 0 i3 ),A 0 i1 Refers to A i Current hardware state information of A 0 i2 Refers to A i Current network state information of A) i3 Refers to A i Is provided for the current software state information of the computer system.
S412, according to A 0 Obtaining A 0 Corresponding current failure tag vector set B 0 ={B 0 1 ,B 0 2 ,……,B 0 i ,……,B 0 m },B 0 i =(B 0 i1 ,B 0 i2 ,B 0 i3 ),B 0 i1 Is A 0 i1 Corresponding fault probability value, B 0 i2 Is A 0 i2 Corresponding fault probability value, B 0 i3 Is A 0 i3 A corresponding fault probability value; it can be understood that: will A 0 i1 、A 0 i2 And A 0 i3 Respectively inputting the data into a corresponding trained neural network model to obtain A by distribution 0 i1 、A 0 i2 And A 0 i3 A corresponding fault probability value; those skilled in the art are aware of the method for obtaining the probability value of occurrence of the fault by using the neural network model, and will not be described herein.
Specifically, the step S42 further includes the following steps:
s421, obtaining a connection association diagram of a server cluster;
s422, will B 0 i And recording the corresponding server ID on the node corresponding to each server ID in the connection association diagram of the server cluster, and generating a fault state diagram of the server cluster so as to monitor the fault of the server cluster according to the fault state diagram of the server cluster.
And the fault state diagram is constructed by combining the probability of fault occurrence on the basis of the connection relation diagram, so that the reasonable probability of overall fault occurrence is determined, and further, the effective monitoring of the faults of the server cluster is realized.
The fault monitoring method of the server cluster in this embodiment includes: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.
Example two
As shown in fig. 4, a first embodiment provides a fault monitoring device for a server cluster, where the device includes:
the first obtaining module 1 is configured to obtain relationship information of a server cluster, where the server cluster includes a plurality of servers, for example, the servers are database integrated machines.
Specifically, the relationship information of the server cluster includes relationship information of a plurality of servers, where the relationship information of each server refers to a communication connection relationship between any server and other servers except the server.
The first generating module 2 is configured to process the relationship information of the server cluster, and generate a connection association diagram of the server cluster.
Specifically, as shown in fig. 5, the first generating module 2 further includes:
a first determining module 21, configured to determine a connection association server ID set of the server cluster according to relationship information of the server cluster;
the first graph generating module 22 is configured to generate a connection association graph of the server cluster according to the connection association server ID set of the server cluster.
In a specific embodiment, the first determining module 21 includes:
a server ID list obtaining module 211, configured to obtain a server ID list a= { a corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster.
Specifically, the server ID is a unique identity of the server.
A connection association server ID set acquisition module 212, configured to, according to the server cluster corresponding to a
Acquiring a connection association server ID set B= { B corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i And the corresponding number of the connection association server IDs, namely the connection association server ID set of the server cluster is B.
Further, the A i The corresponding connection association server ID refers to the connection associated with A i And the unique identity of the servers with the relation information exists between the corresponding servers.
Specifically, the connection association diagram of the server cluster is a tree-structured association diagram, wherein,
the connection association graph of the server cluster includes a connection association root node and leaf nodes associated with s-layer connection, and the number of the leaf nodes associated with the connection of each layer is inconsistent, where the first graph generating module 22 includes:
the connection association server ID number list obtaining module 221 is configured to obtain a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to a.
The root node determining module 222 is configured to determine, according to n, a root node associated with the connection.
In a specific embodiment, the root node of the connection association is A when only n (i) is the minimum number of associated server IDs in n i 。
In another specific embodiment, the root node determination module 222 includes:
a first intermediate server ID set acquisition module 2221, configured to acquire a first intermediate server ID set c= { C according to n 1 ,C 2 ,……,C x ,……,C p },C x For the x-th first intermediate server ID, x=1, 2 … … p, p being the number of first intermediate server IDs.
Further, the first intermediate server ID refers to a server ID corresponding to the minimum value in n.
A first execution module 2222, configured to obtain, from B, a list z= { z (1), z (2), … …, z (x), … …, z (p) }, z (x) corresponding to C, where z (x) is C x Corresponding number of associated server IDs.
A second execution module 2223 configured to determine C when any z (x) is the minimum number of associated server IDs in z x And associating a root node for the connection.
A leaf node determining module 223, configured to determine all leaf nodes d= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q (r) being the number of leaf nodes in the r layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 And any server ID which is not larger than a preset first server ID quantity threshold value and is outside the corresponding server ID list and the server ID corresponding to the root node associated with the connection.
Further, the leaf node determining module 223 includes:
a third execution module 2231, configured to obtain a second intermediate server ID list u= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server IDs corresponding to the root node associated with the connection, where v is the number of second intermediate server IDs corresponding to the root node associated with the connection.
Further, the second intermediate server ID is an associated server ID corresponding to the root node associated with the connection in B.
A fourth execution module 2232 for obtaining each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y 。
Preferably, the first server ID number threshold may be determined by a person skilled in the art according to the level of the leaf node, which will not be described in detail herein.
And the second acquisition module 3 is used for acquiring the current attribute information of the server cluster.
Specifically, the attribute information of the server cluster includes current attribute information of each server, where the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
And the second generating module 4 is configured to generate a failure state diagram of the server cluster according to the connection association diagram of the server cluster and current attribute information of the server cluster, so that failure of the server cluster is monitored according to the failure state diagram of the server cluster.
In a specific embodiment, as shown in fig. 6, the second generating module 4 further includes:
and the second determining module 41 is configured to determine a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server set.
And the second graph generating module 42 is configured to generate a failure state graph of the server cluster according to the current failure label vector of each server and the connection association graph of the server cluster, so that failure of the server cluster is monitored according to the failure state graph of the server cluster.
Specifically, the second determining module 41 includes:
a current attribute information obtaining module 411 for obtaining current attribute information a corresponding to a 0 ={A 0 1 ,A 0 2 ,……,A 0 i ,……,A 0 m },A 0 i =(A 0 i1 ,A 0 i2 ,A 0 i3 ),A 0 i1 Refers to A i Current hardware state information of A 0 i2 Refers to A i Current network state information of A) i3 Refers to A i Is provided for the current software state information of the computer system.
The current failure tag vector set obtaining module 412 is configured to obtain, according to a 0 Obtaining A 0 Corresponding current failure tag vector set B 0 ={B 0 1 ,B 0 2 ,……,B 0 i ,……,B 0 m },B 0 i =(B 0 i1 ,B 0 i2 ,B 0 i3 ),B 0 i1 Is A 0 i1 Corresponding fault probability value, B 0 i2 Is A 0 i2 Corresponding fault probability value, B 0 i3 Is A 0 i3 A corresponding fault probability value; it can be understood that: will A 0 i1 、A 0 i2 And A 0 i3 Respectively inputting the data into a corresponding trained neural network model to obtain A by distribution 0 i1 、A 0 i2 And A 0 i3 A corresponding fault probability value; those skilled in the art are aware of the method for obtaining the probability value of occurrence of the fault by using the neural network model, and will not be described herein.
Specifically, the second graph generation module 42 includes:
a fifth execution module 421, configured to obtain a connection association diagram of the server cluster;
a sixth execution module 422 for executing B 0 i And recording the corresponding server ID on the node corresponding to each server ID in the connection association diagram of the server cluster, and generating a fault state diagram of the server cluster so as to monitor the fault of the server cluster according to the fault state diagram of the server cluster.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring relation information of a server cluster;
processing the relation information of the server clusters to generate a connection association diagram of the server clusters;
acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring relation information of a server cluster;
processing the relation information of the server clusters to generate a connection association diagram of the server clusters;
acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above functional units and the division of the modules are illustrated, and in practical application, the above functions may be allocated to different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to complete all or part of the functions described above.
The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any person skilled in the art can make some changes or modifications to the equivalent embodiments without departing from the scope of the present invention, but all the simple modifications, equivalent changes and modifications according to the technical matter of the present invention fall within the scope of the technical solution of the present invention.
Claims (10)
1. A method for monitoring a failure of a server cluster, the method comprising:
acquiring relation information of a server cluster, wherein the relation information of the server cluster comprises relation information of a plurality of servers, and the relation information of each server refers to communication connection relation between any server and other servers except the server;
processing the relation information of the server clusters to generate a connection association graph of the server clusters, wherein the processing the relation information of the server clusters to generate the connection association graph of the server clusters further comprises the following steps:
according to the relation information of the server clusters, determining a connection association server ID set of the server clusters, wherein the method further comprises the following steps of:
s211, obtaining a server ID list A= { A corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster;
s212, acquiring a connection gateway corresponding to A according to the relation information of the server cluster corresponding to A
The set of co-server IDs b= { B 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i The corresponding number of the connection association server IDs, namely B is the connection association server ID set of the server cluster;
generating connection of the server cluster according to the connection association server ID set of the server cluster
The connection association diagram of the server cluster is a tree-structured association diagram, wherein the connection association diagram of the server cluster comprises a connection association root node and s-layer connection association leaf nodes, the number of the connection association leaf nodes of each layer is inconsistent, and the method further comprises the following steps of:
s221, obtaining a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to A;
s222, determining a root node associated with the connection according to n, wherein the root node associated with the connectionThe node is A when only n (i) is the minimum number of associated server IDs in n i The root node associated with the connection is also determined in step S222 by:
s2221, according to n, acquires a first intermediate server ID set C= { C 1 ,C 2 ,……,C x ,……,C p },C x For the x first intermediate server IDs, x=1, 2 … … p, p is the number of the first intermediate server IDs, where the first intermediate server IDs refer to server IDs corresponding to the minimum value in n;
s2222, obtain the associated server ID number list z= { z (1), z (2), … …, z (x), … …, z (p) }, z (x) corresponding to C from B x The number of corresponding associated server IDs;
s2223, when any z (x) is the minimum number of associated server IDs in z, determining C x A root node associated with the connection;
s223, determining all leaf nodes D= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q (r) being the number of leaf nodes in the r layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 Any server ID out of the corresponding server ID list and the server ID corresponding to the root node associated with the connection and not greater than the preset threshold number of first server IDs is further determined in step S223 by the steps of 1y :
S2231, obtaining a second intermediate server ID list U= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server ID corresponding to the root node associated with the connection, where v is the number of second intermediate server IDs corresponding to the root node associated with the connection, and the second intermediate server ID is the root node associated with the connection from BA corresponding associated server ID;
s2232, obtain each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y ;
Acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
2. The method for fault monitoring of a server cluster according to claim 1, wherein the server cluster comprises a number of servers.
3. The method for fault monitoring of a server cluster as claimed in claim 2, wherein,
the current attribute information of the server cluster includes current attribute information of each server, wherein the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
4. A method for monitoring a failure of a server cluster according to claim 3, wherein the generating a failure state diagram of the server cluster according to the connection association diagram of the server cluster and current attribute information of the server cluster, so that the monitoring the failure of the server cluster according to the failure state diagram of the server cluster further comprises the steps of:
determining a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server cluster;
generating a fault state diagram of the server cluster according to the current fault label vector of each server and the connection association diagram of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
5. A fault monitoring device for a server cluster, the device comprising:
the first acquisition module is used for acquiring the relation information of the server cluster, wherein the relation information of the server cluster comprises relation information of a plurality of servers, and the relation information of each server refers to a communication connection relation between any server and other servers except the server;
the first generation module is configured to process relationship information of the server cluster, and generate a connection association graph of the server cluster, where the first generation module includes:
a first determining module, configured to determine a server set according to relationship information of the server cluster
A connection association server ID set of a group, the first determining module comprising:
a server ID list obtaining module 211, configured to obtain a server ID list a= { a corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster;
a connection association server ID set acquisition module 212, configured to, according to the server cluster corresponding to a
Acquiring a connection association server ID set B= { B corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i The corresponding number of the connection association server IDs, namely B is the connection association server ID set of the server cluster;
a first graph generation module for generating according to the connection association server ID set of the server cluster
The method comprises the steps of forming a connection association graph of a server cluster, wherein the connection association graph of the server cluster is a tree-shaped structure association graph, the connection association graph of the server cluster comprises a connection association root node and leaf nodes of s-layer connection association, the number of the leaf nodes of each layer of connection association is inconsistent, and the first graph generation module comprises:
a connection association server ID number list obtaining module 221, configured to obtain a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to a;
a root node determining module 222, configured to determine, according to n, a root node associated with the connection, where the root node associated with the connection is a when only n (i) is the minimum number of associated server IDs in n i The root node determination module 222 includes:
a first intermediate server ID set acquisition module 2221, configured to acquire a first intermediate server ID set c= { C according to n 1 ,C 2 ,……,C x ,……,C p },C x For the x first intermediate server IDs, x=1, 2 … … p, p is the number of the first intermediate server IDs, where the first intermediate server IDs refer to server IDs corresponding to the minimum value in n;
a first execution module 2222, configured to obtain, from B, a list z= { z (1), z (2), … …, z (x), … …, z (p) }, z (x) corresponding to C, where z (x) is C x The number of corresponding associated server IDs;
a second execution module 2223 configured to determine C when any z (x) is the minimum number of associated server IDs in z x A root node associated with the connection;
a leaf node determining module 223, configured to determine all leaf nodes d= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q #r) is the number of leaf nodes of the r-th layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 The leaf node determining module 223 includes:
a third execution module 2231, configured to obtain a second intermediate server ID list u= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server IDs corresponding to the root nodes associated with the connection, where v is the number of second intermediate server IDs corresponding to the root nodes associated with the connection;
a fourth execution module 2232 for obtaining each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y ;
The second acquisition module is used for acquiring the current attribute information of the server cluster;
and the second generation module is used for generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the fault of the server cluster is monitored according to the fault state diagram of the server cluster.
6. The fault monitoring device of a server cluster according to claim 5, wherein the server cluster comprises a number of servers.
7. The apparatus for monitoring a failure of a server cluster according to claim 6, wherein the current attribute information of the server cluster includes current attribute information of each server, wherein the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
8. The failure monitoring apparatus of the server cluster according to claim 7, wherein the second generating module includes:
the second determining module is used for determining a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server cluster;
and the second graph generating module is used for generating a fault state graph of the server cluster according to the current fault label vector of each server and the connection association graph of the server cluster, so that faults of the server cluster are monitored according to the fault state graph of the server cluster.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements a fault monitoring method of a server cluster according to any of claims 1-4 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method for fault monitoring of a server cluster according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311654834.XA CN117349128B (en) | 2023-12-05 | 2023-12-05 | Fault monitoring method, device and equipment of server cluster and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311654834.XA CN117349128B (en) | 2023-12-05 | 2023-12-05 | Fault monitoring method, device and equipment of server cluster and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117349128A CN117349128A (en) | 2024-01-05 |
CN117349128B true CN117349128B (en) | 2024-03-22 |
Family
ID=89365340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311654834.XA Active CN117349128B (en) | 2023-12-05 | 2023-12-05 | Fault monitoring method, device and equipment of server cluster and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117349128B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111752759A (en) * | 2020-06-30 | 2020-10-09 | 重庆紫光华山智安科技有限公司 | Kafka cluster fault recovery method, device, equipment and medium |
CN111984498A (en) * | 2020-07-24 | 2020-11-24 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Server cluster monitoring and management system |
CN112202617A (en) * | 2020-10-09 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Resource management system monitoring method and device, computer equipment and storage medium |
CN113609139A (en) * | 2021-09-30 | 2021-11-05 | 苏州浪潮智能科技有限公司 | Monitoring data management method and device, electronic equipment and storage medium |
CN114064438A (en) * | 2021-11-24 | 2022-02-18 | 建信金融科技有限责任公司 | Database fault processing method and device |
CN115643158A (en) * | 2022-10-25 | 2023-01-24 | 平安银行股份有限公司 | Equipment cluster repairing method, device, equipment and storage medium |
CN115643163A (en) * | 2022-11-03 | 2023-01-24 | 平安科技(深圳)有限公司 | Fault equipment positioning method, device, equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8010833B2 (en) * | 2009-01-20 | 2011-08-30 | International Business Machines Corporation | Software application cluster layout pattern |
-
2023
- 2023-12-05 CN CN202311654834.XA patent/CN117349128B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111752759A (en) * | 2020-06-30 | 2020-10-09 | 重庆紫光华山智安科技有限公司 | Kafka cluster fault recovery method, device, equipment and medium |
CN111984498A (en) * | 2020-07-24 | 2020-11-24 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Server cluster monitoring and management system |
CN112202617A (en) * | 2020-10-09 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Resource management system monitoring method and device, computer equipment and storage medium |
CN113609139A (en) * | 2021-09-30 | 2021-11-05 | 苏州浪潮智能科技有限公司 | Monitoring data management method and device, electronic equipment and storage medium |
CN114064438A (en) * | 2021-11-24 | 2022-02-18 | 建信金融科技有限责任公司 | Database fault processing method and device |
CN115643158A (en) * | 2022-10-25 | 2023-01-24 | 平安银行股份有限公司 | Equipment cluster repairing method, device, equipment and storage medium |
CN115643163A (en) * | 2022-11-03 | 2023-01-24 | 平安科技(深圳)有限公司 | Fault equipment positioning method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
云计算服务器集群的时间同步方法;朱莉;;指挥信息系统与技术;20180828(04);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117349128A (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109472372B (en) | Resource data distribution method and device based on leasing equipment and computer equipment | |
CN108874968A (en) | Risk management data processing method, device, computer equipment and storage medium | |
CN108509433A (en) | The method, apparatus and electronic equipment of formation sequence number based on distributed system | |
CN109325026B (en) | Data processing method, device, equipment and medium based on big data platform | |
CN104424287B (en) | Data query method and apparatus | |
CN106936622A (en) | A kind of distributed memory system upgrade method and device | |
CN113065912A (en) | Method, apparatus, device and medium for monitoring orders with unsynchronized order states | |
CN112351119A (en) | Probability-based block chain transaction originating IP address determination method and device | |
CN112737800A (en) | Service node fault positioning method, call chain generation method and server | |
CN117349128B (en) | Fault monitoring method, device and equipment of server cluster and storage medium | |
US20160019296A1 (en) | Fingerprint-based configuration typing and classification | |
CN117201515A (en) | Bank financial information synchronization method, system, equipment and storage medium | |
CN110599267A (en) | Electronic invoice billing method and device, computer readable storage medium and computer equipment | |
CN110188081B (en) | Log data storage method and device based on cassandra database and computer equipment | |
CN111966461A (en) | Virtual machine cluster node guarding method, device, equipment and storage medium | |
CN112465048A (en) | Deep learning model training method, device, equipment and storage medium | |
CN110716101B (en) | Power line fault positioning method and device, computer and storage medium | |
CN114584453B (en) | Fault analysis method and device for application system | |
CN108959486B (en) | Audit field information acquisition method and device, computer equipment and storage medium | |
CN114564349A (en) | Server monitoring method and device, electronic equipment and storage medium | |
CN114513498A (en) | File transmission checking method and device, computer equipment and storage medium | |
CN112231142A (en) | System backup recovery method and device, computer equipment and storage medium | |
CN114428704A (en) | Method and device for full-link distributed monitoring, computer equipment and storage medium | |
CN112668730A (en) | Self-service equipment module replacement method and device, computer equipment and storage medium | |
CN113010120B (en) | Method for realizing distributed storage of voice data in round robin mode |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |