CN106301853A - The fault detection method of group system interior joint and device - Google Patents
The fault detection method of group system interior joint and device Download PDFInfo
- Publication number
- CN106301853A CN106301853A CN201510306800.0A CN201510306800A CN106301853A CN 106301853 A CN106301853 A CN 106301853A CN 201510306800 A CN201510306800 A CN 201510306800A CN 106301853 A CN106301853 A CN 106301853A
- Authority
- CN
- China
- Prior art keywords
- node
- nodal point
- neighbor nodes
- neighbor
- heartbeat message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 87
- 230000004044 response Effects 0.000 claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000010247 heart contraction Effects 0.000 claims 5
- 230000005540 biological transmission Effects 0.000 claims 2
- 230000009191 jumping Effects 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 19
- 230000000694 effects Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Hardware Redundancy (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
本发明实施例提供一种集群系统中节点的故障检测方法和装置,该方法包括:第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文;在第一节点未接收到第二节点发送的心跳报文的情况下,向第二节点的所有邻居节点中除第一节点之外的其他邻居节点发送请求消息;第一节点接收其他邻居节点发送的携带有接收状态的响应消息;在第一节点根据接收状态确定出其他邻居节点均未接收到心跳报文的情况下,第一节点确定第二节点发生故障。本发明实施例提供的集群系统中节点的故障检测方法和装置能够提高节点故障检测的效率。
Embodiments of the present invention provide a node fault detection method and device in a cluster system. The method includes: the first node judges whether the first heartbeat message sent by the second node is received within a preset time, and the first node is the first heartbeat message sent by the second node. The neighbor node of the second node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel; in the case that the first node does not receive the heartbeat message sent by the second node Next, send a request message to other neighbor nodes except the first node among all the neighbor nodes of the second node; the first node receives the response message carrying the reception status sent by other neighbor nodes; the first node determines according to the reception status In the case that no heartbeat message is received by other neighbor nodes, the first node determines that the second node fails. The node failure detection method and device in the cluster system provided by the embodiments of the present invention can improve the efficiency of node failure detection.
Description
技术领域technical field
本发明实施例涉及通信技术,尤其涉及一种集群系统中节点的故障检测方法和装置。Embodiments of the present invention relate to communication technologies, and in particular to a method and device for detecting node faults in a cluster system.
背景技术Background technique
在分布式集群系统中,通常包括一个中心节点和多个普通节点,当中心节点或者普通节点发生故障后,将对分布式集群系统的可靠性造成很大的影响,因此,如何有效的进行节点的故障检测,是非常重要的。In a distributed cluster system, it usually includes a central node and multiple ordinary nodes. When the central node or ordinary nodes fail, it will have a great impact on the reliability of the distributed cluster system. Therefore, how to effectively implement node fault detection is very important.
图1为现有技术中节点的故障检测方法的示意图,如图1所示,普通节点(B、C、D、E)根据心跳周期向中心节点(M)发送心跳报文,中心节点(M)根据检测周期内收到的连续心跳报文的情况,来检测普通节点是否故障,其中,一个检测周期可以包含多个心跳周期。同时,中心节点(M)也可以周期性的向普通节点(B、C、D、E)发送心跳报文,以通知普通节点中心节点所担任的角色以及是否处于正常状态,一旦普通节点(B、C、D、E)在检测周期内未收到中心节点(M)发送的心跳报文,则会判断出中心节点(M)发生故障,此时,普通节点会发起重新选举中心节点的操作,若选举成功,普通节点将感知新的中心节点,并将心跳报文发送到新的中心节点,集群再进行故障检测。Fig. 1 is the schematic diagram of the fault detection method of node in the prior art, as shown in Fig. 1, common node (B, C, D, E) sends heartbeat message to central node (M) according to the heartbeat period, and central node (M ) to detect whether a common node is faulty according to the condition of the continuous heartbeat messages received in the detection period, wherein one detection period may include multiple heartbeat periods. At the same time, the central node (M) can also periodically send heartbeat messages to ordinary nodes (B, C, D, E) to inform the ordinary nodes of the role of the central node and whether it is in a normal state. Once the ordinary node (B , C, D, E) If the heartbeat message sent by the central node (M) is not received within the detection period, it will be judged that the central node (M) has failed. At this time, the ordinary node will initiate the operation of re-election of the central node , if the election is successful, the common node will perceive the new central node and send a heartbeat message to the new central node, and then the cluster will perform fault detection.
然而,在现有技术中,通过判断在检测周期内是否接收到心跳报文的方式来检测节点是否发生故障时,由于在集群规模固定的情况下,发送心跳报文的心跳周期无法改变,因此检测周期的时间也无法改变,使得节点故障检测需要通过多个心跳周期才能检测出来,造成节点故障检测的周期较长,导致节点故障检测的效率较低。However, in the prior art, when detecting whether a node fails by judging whether a heartbeat message is received within the detection period, since the heartbeat period for sending the heartbeat message cannot be changed when the cluster size is fixed, The time of the detection cycle cannot be changed, so that node fault detection needs to be detected through multiple heartbeat cycles, resulting in a longer node fault detection cycle and lower efficiency of node fault detection.
发明内容Contents of the invention
本发明实施例提供一种集群系统中节点的故障检测方法和装置,用于解决现有技术存在着的节点故障检测需要通过多个心跳周期才能检测出来,造成节点故障检测的周期较长的问题,从而提高了节点故障检测的效率。The embodiment of the present invention provides a node fault detection method and device in a cluster system, which is used to solve the problem in the prior art that node fault detection needs to be detected through multiple heartbeat cycles, resulting in a longer cycle of node fault detection , thus improving the efficiency of node failure detection.
第一方面,本发明实施例提供一种集群系统中节点的故障检测方法,包括:In a first aspect, an embodiment of the present invention provides a method for detecting failures of nodes in a cluster system, including:
第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;The first node judges whether the first heartbeat message sent by the second node is received within the preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the second A heartbeat message sent by the node to each neighbor node of the second node in parallel, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and Less than two heartbeat cycles;
在所述第一节点未接收到所述第二节点发送的第一心跳报文的情况下,所述第一节点向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;When the first node does not receive the first heartbeat message sent by the second node, the first node sends a message to all neighbor nodes of the second node except the first node Other neighbor nodes send a request message, where the request message is used to inquire whether the other neighbor nodes have received the first heartbeat message;
所述第一节点接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;The first node receives a response message carrying a reception status sent by the other neighbor nodes, and the reception status is used to indicate whether the first heartbeat message is received;
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node determines that the second node fails.
结合第一方面,在第一方面的第一种可能的实现方式中,所述第一节点确定所述第二节点发生故障之后,还包括:With reference to the first aspect, in a first possible implementation manner of the first aspect, after the first node determines that the second node fails, the method further includes:
所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;The first node generates first voting information, and receives second voting information sent by each of the other neighboring nodes, the first voting information includes the node identifier corresponding to the node elected by the first node; The second voting information includes the node identifier corresponding to the node elected by the neighbor node that sent the second voting information;
所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The first node counts the number of votes obtained by each node among all elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighbor nodes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the third node All neighbor nodes of the third node include neighbor nodes of the third node itself and neighbor nodes of the second node.
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,还包括:In combination with the first aspect or the first possible implementation of the first aspect, the second possible implementation of the first aspect further includes:
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node determines that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the Nodes among the first node and the other neighbor nodes that have not received the first heartbeat message.
结合第一方面、第一方面的第一种至第一方面的第二种任一种可能的实现方式,在第一方面的第三种可能的实现方式中,还包括:Combining the first aspect, the first aspect of the first aspect to any second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, it also includes:
所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。The first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
第二方面,本发明实施例提供一种集群系统中节点的故障检测方法,所述方法包括:In a second aspect, an embodiment of the present invention provides a method for detecting failures of nodes in a cluster system, the method comprising:
第二节点并行地向第一节点和其他邻居节点发送第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上;The second node sends a first heartbeat message to the first node and other neighbor nodes in parallel; the first node is a neighbor node of the second node, and the other neighbor nodes are all neighbor nodes of the second node For nodes other than the first node, the number of other neighbor nodes is more than one;
所述第一节点判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;The first node judges whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods;
在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;When the first node does not receive the first heartbeat message, the first node sends a request message to each of the other neighbor nodes, and the request message is used to ask each of the other neighbor nodes Whether the neighbor node has received the first heartbeat message;
所述第一节点接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;The first node receives a response message carrying a reception status sent by each of the other neighbor nodes, and the reception status is used to indicate whether the first heartbeat message is received;
在所述第一节点根据接收到的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the received response message, the first node determines that the The second node fails.
结合第二方面,在第二方面的第一种可能的实现方式中,所述第一节点确定所述第二节点发生故障之后,还包括:With reference to the second aspect, in the first possible implementation manner of the second aspect, after the first node determines that the second node fails, the method further includes:
所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;The first node generates first voting information, and receives second voting information sent by each of the other neighboring nodes, the first voting information includes the node identifier corresponding to the node elected by the first node; The second voting information includes the node identifier corresponding to the node elected by the neighbor node that sent the second voting information;
所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The first node counts the number of votes obtained by each node among all elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighbor nodes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the third node All neighbor nodes of the third node include neighbor nodes of the third node itself and neighbor nodes of the second node.
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,还包括:In combination with the second aspect or the first possible implementation manner of the second aspect, the second possible implementation manner of the second aspect further includes:
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,则所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到第一心跳报文的节点。When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , then the first node determines that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the Nodes that have not received the first heartbeat message among the first node and the other neighbor nodes.
结合第二方面、第二方面的第一种至第二方面的第二种任一种可能的实现方式,在第二方面的第三种可能的实现方式中,还包括:In combination with the second aspect, any of the first possible implementation manners of the second aspect to the second aspect of the second aspect, the third possible implementation manner of the second aspect further includes:
所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。The first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
第三方面,本发明实施例提供一种集群系统中节点的故障检测装置,包括:In a third aspect, an embodiment of the present invention provides a node failure detection device in a cluster system, including:
判断模块,用于判断在预设时间内是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;A judging module, configured to judge whether a first heartbeat message sent by a second node is received within a preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the A heartbeat message sent by the second node to each neighbor node of the second node in parallel, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle , and less than two heartbeat cycles;
在所述判断模块判断出接收模块未接收到所述第二节点发送的第一心跳报文的情况下,When the judging module judges that the receiving module has not received the first heartbeat message sent by the second node,
发送模块,用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;A sending module, configured to send a request message to other neighbor nodes except the first node among all neighbor nodes of the second node, and the request message is used to ask whether the other neighbor nodes have received the first node. A heartbeat message;
所述接收模块,还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;The receiving module is further configured to receive a response message carrying a receiving status sent by the other neighbor nodes, and the receiving status is used to indicate whether the first heartbeat message is received;
确定模块,用于根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;A determining module, configured to determine, according to the receiving status carried in the response message sent by each of the other neighboring nodes received by the receiving module, whether the other neighboring nodes have not received the first heartbeat message ;
在所述确定模块确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述确定模块,还用于确定所述第二节点发生故障。In a case where the determining module determines that none of the other neighbor nodes has received the first heartbeat message, the determining module is further configured to determine that the second node fails.
结合第三方面,在第三方面的第一种可能的实现方式中,在所述确定模块确定所述第二节点发生故障之后,还包括:With reference to the third aspect, in a first possible implementation manner of the third aspect, after the determining module determines that the second node fails, it further includes:
生成模块,还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;The generating module is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;
所述接收模块,还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;The receiving module is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node elected by the neighboring node that sent the second voting information;
所述确定模块,还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The determining module is further configured to, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighbor nodes, count the number of votes obtained by each node among all the nodes elected. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.
结合第三方面或第三方面的第一种可能的实现方式,在第三方面的第二种可能的实现方式中,In combination with the third aspect or the first possible implementation of the third aspect, in the second possible implementation of the third aspect,
在所述确定模块根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,The determination module determines that at least one of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the response message sent by each of the other neighbor nodes received by the receiving module in the case of,
所述确定模块还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The determination module is also used to determine that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the A node that has not received the first heartbeat message among the first node and the other neighbor nodes.
结合第三方面、第三方面的第一种至第三方面的第二种任一种可能的实现方式,在第三方面的第三种可能的实现方式中,Combining the third aspect, any of the first to second possible implementations of the third aspect, in the third possible implementation of the third aspect,
所述确定模块还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。The determination module is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
第四方面,本发明实施例提供一种集群系统中节点的故障检测系统,包括第一节点、第二节点和其他邻居节点,所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上,包括:In a fourth aspect, an embodiment of the present invention provides a node failure detection system in a cluster system, including a first node, a second node and other neighbor nodes, the first node is a neighbor node of the second node, and the Other neighbor nodes are nodes other than the first node among all neighbor nodes of the second node, and the number of other neighbor nodes is more than one, including:
所述第二节点,用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文;The second node is configured to send a first heartbeat message to the first node and the other neighbor nodes in parallel;
所述第一节点,用于判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;The first node is configured to determine whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods;
在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点还用于向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;以及,所述第一节点还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;In the case that the first node does not receive the first heartbeat message, the first node is further configured to send a request message to each of the other neighbor nodes, and the request message is used to inquire about each Whether the other neighbor nodes have received the first heartbeat message; and, the first node is also used to receive a response message carrying a reception status sent by each of the other neighbor nodes, and the reception status is used for Indicates whether the first heartbeat message is received;
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点还用于确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node is further configured to determine that the second node fails.
结合第四方面,在第四方面的第一种可能的实现方式中,所述第一节点确定所述第二节点发生故障之后,还包括:With reference to the fourth aspect, in a first possible implementation manner of the fourth aspect, after the first node determines that the second node fails, the method further includes:
所述第一节点还用于:The first node is also used for:
生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;Generate first voting information, and receive second voting information sent by each of the other neighbor nodes, the first voting information includes the node identifier corresponding to the node elected by the first node, and the second voting information includes sending The node identifier corresponding to the node elected by the neighbor node of the second voting information;
以及,根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。And, according to the node identification in the first voting information and the node identification in the second voting information sent by each of the other neighboring nodes, count the number of votes obtained by each node in all nodes elected, and vote The node with the largest number is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbors of the third node The nodes include neighbor nodes of the third node itself and neighbor nodes of the second node.
结合第四方面或第四方面的第一种可能的实现方式,在第四方面的第二种可能的实现方式中,In combination with the fourth aspect or the first possible implementation of the fourth aspect, in the second possible implementation of the fourth aspect,
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes ,
所述第一节点还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的邻居节点。The first node is further configured to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message includes A neighbor node among the first node and the other neighbor nodes that has not received the first heartbeat message.
结合第四方面、第四方面的第一种至第四方面的第二种任一种可能的实现方式,在第四方面的第三种可能的实现方式中,Combining the fourth aspect and any of the second possible implementation manners from the first to the fourth aspect of the fourth aspect, in the third possible implementation manner of the fourth aspect,
所述第一节点还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。The first node is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
本发明实施例提供的集群系统中节点的故障检测方法和装置中,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,可以缩短故障检测的周期,从而提高了节点故障检测的效率。In the node failure detection method and device in the cluster system provided by the embodiments of the present invention, the first node judges whether it receives the first heartbeat message sent by the second node within a preset time, wherein the first node is the second node neighbor nodes, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than Or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node asks other neighbor nodes of the second node whether they have received the first heartbeat message if it does not receive the first heartbeat message, and then When it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon can shorten the cycle of fault detection, thereby improving the efficiency of node fault detection.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为现有技术中集群系统中节点的故障检测方法的结构示意图;FIG. 1 is a schematic structural diagram of a node fault detection method in a cluster system in the prior art;
图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图;FIG. 2 is a schematic flow diagram of Embodiment 1 of a node fault detection method in a cluster system provided by the present invention;
图3为集群系统中节点之间相邻关系的示意图一;Fig. 3 is a schematic diagram 1 of the adjacent relationship between nodes in the cluster system;
图4为集群系统中节点之间相邻关系的示意图二;FIG. 4 is a second schematic diagram of the adjacent relationship between nodes in the cluster system;
图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图;FIG. 5 is a schematic flow diagram of Embodiment 2 of a node fault detection method in a cluster system provided by the present invention;
图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图;FIG. 6A is a schematic diagram of the neighbor relationship between nodes before a node failure is detected in the cluster system;
图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图;6B is a schematic diagram of re-determining the adjacent relationship between nodes after node failure is detected in the cluster system;
图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图;FIG. 7 is a schematic flowchart of Embodiment 3 of a node fault detection method in a cluster system provided by the present invention;
图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图;FIG. 8 is a schematic flowchart of Embodiment 4 of a node fault detection method in a cluster system provided by the present invention;
图9为本发明集群系统中节点的故障检测装置实施例一的结构示意图;9 is a schematic structural diagram of Embodiment 1 of a node fault detection device in a cluster system of the present invention;
图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图图10为本发明节点实施例一的结构示意图;FIG. 10 is a schematic structural diagram of Embodiment 1 of a node fault detection system in a cluster system according to the present invention. FIG. 10 is a schematic structural diagram of Embodiment 1 of a node in the present invention;
图11为本发明节点实施例一的结构示意图。FIG. 11 is a schematic structural diagram of Embodiment 1 of a node according to the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
本发明实施例适用于集群系统中,其具体适用于分布式集群系统中节点的故障检测的场景。该分布式集群系统包括至少两个节点,该节点例如可以是计算机。可选的,本实施例中的集群系统中的节点与现有的集群系统的不同之处在于:本实施例的集群系统中,将所有的节点都赋予相同的功能,即所有的节点都具有相同的接收心跳报文和发送心跳报文的能力,因此,在本实施例的集群系统中,并不存在中心节点和普通节点的区分,也不需要中心节点管理普通节点。可选的,下述实施例的技术方案均以计算机作为执行主体来介绍。The embodiment of the present invention is applicable to a cluster system, and it is specifically applicable to the scene of node fault detection in a distributed cluster system. The distributed cluster system includes at least two nodes, and the nodes may be computers, for example. Optionally, the difference between the nodes in the cluster system in this embodiment and the existing cluster system is that in the cluster system in this embodiment, all nodes are given the same function, that is, all nodes have The ability to receive heartbeat messages and send heartbeat messages is the same. Therefore, in the cluster system of this embodiment, there is no distinction between central nodes and ordinary nodes, and the central node does not need to manage ordinary nodes. Optionally, the technical solutions of the following embodiments are all introduced using a computer as an execution subject.
图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例以计算机作为执行主体为例来介绍。如图2所示,本实施例的方法可以包括:FIG. 2 is a schematic flowchart of Embodiment 1 of a node failure detection method in a cluster system provided by the present invention. The method involved in the embodiment of the present invention is applicable to a distributed cluster system. This embodiment is introduced by taking a computer as an execution subject as an example. As shown in Figure 2, the method of this embodiment may include:
步骤201、第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文;第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;预设时间大于或等于一个心跳周期,且小于两个心跳周期。Step 201, the first node judges whether it has received the first heartbeat message sent by the second node within the preset time; the first node is the neighbor node of the second node, and the first heartbeat message is sent by the second node in parallel Each neighbor node of the second node sends a heartbeat message, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods.
在本实施例中,第二节点根据集群系统中所有节点的信息,按照集群系统中预设的规则确定出第一节点,其中,第一节点为第二节点的任意一个邻居节点,第二节点的邻居节点为与第二节点有关联关系的节点。图3为集群系统中节点之间相邻关系的示意图一,如图3所示,在集群系统中,节点E根据所有节点的信息,按照集群系统中预设的规则可以确定出其有四个邻居节点,分别是节点A、B、C和D。其中,第一节点可以为节点A、B、C和D中的任意一个。第一节点通过判断在预设时间内是否接收到第二节点发送的第一心跳报文,来检测第二节点是否发生故障。需要进行说明的是,第二节点是通过并行地向它的所有邻居节点发送心跳报文的,因此,第一心跳报文为第二节点并行地在同一个时刻向第二节点的每一邻居节点发送的一个心跳报文。另外,第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文,因此,第一节点可以判断在大于或等于一个心跳周期,且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如:假设心跳周期为5s,即第二节点每隔5s,将并行地向它的所有邻居节点发送一次心跳报文,对于第二节点在第5s发送的第一心跳报文,第一节点将判断在大于或等于5s,且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中,心跳周期可以根据经验或者实际情况进行设置,对于心跳周期的具体取值,本实施例在此不作限制。In this embodiment, the second node determines the first node according to the preset rules in the cluster system according to the information of all nodes in the cluster system, wherein the first node is any neighbor node of the second node, and the second node The neighbor nodes of are nodes that have an association relationship with the second node. Figure 3 is a schematic diagram of the adjacent relationship between nodes in the cluster system. As shown in Figure 3, in the cluster system, node E can determine that there are four The neighbor nodes are nodes A, B, C and D respectively. Wherein, the first node may be any one of nodes A, B, C and D. The first node detects whether the second node fails by judging whether it receives the first heartbeat message sent by the second node within a preset time. It should be noted that the second node sends heartbeat messages to all its neighbor nodes in parallel. Therefore, the first heartbeat message is the second node sending heartbeat messages to each neighbor of the second node in parallel at the same time. A heartbeat message sent by the node. In addition, the second node can send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node can judge whether to receive the first heartbeat message within a time period greater than or equal to one heartbeat period and less than two heartbeat periods. to the first heartbeat message sent by the second node. For example: Assume that the heartbeat period is 5s, that is, the second node will send a heartbeat message to all its neighbor nodes in parallel every 5s, and for the first heartbeat message sent by the second node at 5s, the first node will It is judged whether the first heartbeat message sent by the second node is received within the time greater than or equal to 5s and less than 10s. The heartbeat period may be set according to experience or actual conditions, and this embodiment does not limit the specific value of the heartbeat period.
另外,第二节点可以通过一个物理网络周期性地向第一节点发送第一心跳报文,但是由于基于单物理网络进行故障检测时,在网络发生故障,例如:管理平面网络发生故障,而业务平面网络正常时,往往无法界定是集群系统中第二节点发生了故障还是第二节点和第一节点之间的链路发生了故障,或者第二节点和第一节点同时发生了故障,由此,导致故障的检测结果不准确。为了解决这一问题,优选地,本实施例中还可以通过至少两个网络发送第一心跳报文,举例来说,可以通过双平面发送第一心跳报文,例如:管理平面和业务平面,也可以通过三平面发送第一心跳报文,例如:管理平面、业务平面和信令平面。采用多物理网络的方式发送第一心跳报文,来检测节点是否发生故障,可以提高检测的准确性。需要进行说明的是,若物理网络的数量为至少两个时,该至少两个物理网络之间相互隔离,这样可以避免由于多网络之间存在共用某些设备时,若共用设备发生故障,从而导致节点之间无法正常通信的现象,有利于提高检测的准确性。In addition, the second node may periodically send the first heartbeat message to the first node through a physical network, but when fault detection is performed based on a single physical network, when a network fault occurs, for example: the management plane network fails, and the service When the flat network is normal, it is often impossible to define whether the second node in the cluster system fails or the link between the second node and the first node fails, or the second node and the first node fail at the same time. , resulting in inaccurate fault detection results. In order to solve this problem, preferably, in this embodiment, the first heartbeat message can also be sent through at least two networks, for example, the first heartbeat message can be sent through two planes, for example: management plane and service plane, The first heartbeat message may also be sent through three planes, for example: management plane, service plane and signaling plane. The first heartbeat message is sent through a multi-physical network to detect whether a node fails, which can improve detection accuracy. It should be noted that, if the number of physical networks is at least two, the at least two physical networks are isolated from each other, so that when some equipment is shared between multiple networks, if the shared equipment fails, thereby The phenomenon that leads to the failure of normal communication between nodes is conducive to improving the accuracy of detection.
步骤202、在第一节点未接收到第二节点发送的第一心跳报文的情况下,第一节点向第二节点的所有邻居节点中除第一节点之外的其他邻居节点发送请求消息,请求消息用于询问其他邻居节点是否接收到第一心跳报文。Step 202, when the first node does not receive the first heartbeat message sent by the second node, the first node sends a request message to all neighbor nodes of the second node except the first node, The request message is used to inquire whether other neighbor nodes have received the first heartbeat message.
在现有技术中,在普通节点发送到中心节点的心跳周期固定的情况下,因为中心节点的性能的限制,集群系统无法无限增加普通节点,使得集群系统的扩展性受到影响。针对这一问题,本发明实施例中,若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文,即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文,因此,第一节点将向第二节点的邻居节点中,除自身以外的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。由此可见,当第一节点未接收到第二节点发送的第一心跳报文时,第一节点可以向第二节点的其他邻居节点发送请求消息,而且第二节点的非邻居节点也将不再给第二节点发送心跳报文,由此可以减少第二节点处理心跳报文的数量,从而可以减轻第二节点的负担,使得集群系统的可扩展性较好。In the prior art, when the heartbeat period sent from common nodes to the central node is fixed, the cluster system cannot increase the common nodes infinitely due to the limitation of the performance of the central node, which affects the scalability of the cluster system. To solve this problem, in the embodiment of the present invention, if the first node does not receive the first heartbeat message sent by the second node within the preset time, it can be preliminarily determined that the second node may have failed. Since the second node sends the first heartbeat message to all its neighboring nodes in parallel, the first node will send a request message to other neighboring nodes of the second node except itself to ask other Whether the neighbor node receives the first heartbeat message sent by the second node. It can be seen that when the first node does not receive the first heartbeat message sent by the second node, the first node can send request messages to other neighbor nodes of the second node, and the non-neighbor nodes of the second node will not The heartbeat message is then sent to the second node, thereby reducing the number of heartbeat messages processed by the second node, thereby reducing the burden on the second node and making the cluster system more scalable.
举例来说,图4为集群系统中节点之间相邻关系的示意图二,如图4所示,节点E的邻居节点有X、A、D、C和G,节点E将在每个心跳周期内向它的所有邻居节点X、A、D、C和G发送心跳报文,假设将节点E作为第二节点,将节点A作为第一节点,若在某一个心跳周期内,第一节点A未接收到第二节点E发送的第一心跳报文,则第一节点A将会向其他邻居节点X、D、C和G发送请求消息,以询问节点X、D、C和G是否接收到第一心跳报文。For example, Figure 4 is a schematic diagram 2 of the adjacent relationship between nodes in the cluster system. As shown in Figure 4, the neighbor nodes of node E are X, A, D, C and G, and node E will Send heartbeat messages to all its neighbor nodes X, A, D, C and G. Assume that node E is used as the second node and node A is used as the first node. After receiving the first heartbeat message sent by the second node E, the first node A will send a request message to other neighbor nodes X, D, C and G to ask whether the nodes X, D, C and G have received the first heartbeat message. A heartbeat message.
步骤203、第一节点接收其他邻居节点发送的携带有接收状态的响应消息,该接收状态用于表示是否接收到第一心跳报文。Step 203, the first node receives a response message carrying a receiving status sent by other neighboring nodes, and the receiving status is used to indicate whether the first heartbeat message is received.
在本实施例中,其他邻居节点接收到第一节点发送的请求消息后,将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。In this embodiment, after receiving the request message sent by the first node, other neighboring nodes carry the receiving status of whether they have received the first heartbeat message in the response message and send it to the first node.
步骤204、在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态,确定出其他邻居节点均未接收到第一心跳报文的情况下,第一节点确定第二节点发生故障。Step 204, when the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each other neighbor node, the first node determines the second heartbeat message. A node fails.
在本实施例中,每一个其他邻居节点在接收到第一节点发送的请求消息之后,都会向第一节点返回携带有接收状态的响应消息,第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息,判断其他邻居节点是否接收到第一心跳报文,在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文的情况下,即可确定出第二节点发生了故障。In this embodiment, after each other neighbor node receives the request message sent by the first node, it will return a response message carrying the receiving status to the first node, and the first node sends Carrying the response message with receiving status, judge whether other neighbor nodes have received the first heartbeat message, and determine whether other neighbor nodes have received the first heartbeat message sent by the second node. The second node has failed.
需要进行说明的是,节点之间的相邻关系是双向的,即形成邻居关系的节点之间可以相互发送心跳报文,因此,第二节点的所有邻居节点都会单独的执行步骤201-步骤204。It should be noted that the adjacency relationship between nodes is bidirectional, that is, nodes forming a neighbor relationship can send heartbeat messages to each other, therefore, all neighbor nodes of the second node will individually perform steps 201-204 .
本发明实施例提供的集群系统中节点的故障检测方法中,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the first node judges whether it has received the first heartbeat message sent by the second node within a preset time, wherein the first node is a neighbor of the second node node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to One heartbeat cycle, and less than two heartbeat cycles; the first node inquires whether other neighbor nodes of the second node have received the first heartbeat message if it has not received the first heartbeat message, and determines the When no other neighbor nodes of the second node receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图。在图2所示实施例的基础上,对第一节点确定第二节点发生故障之后,各节点重新确定邻居节点的实施例,作详细说明。如图5所示,本实施例的方法可以包括:FIG. 5 is a schematic flowchart of Embodiment 2 of a node failure detection method in a cluster system provided by the present invention. On the basis of the embodiment shown in FIG. 2 , after the first node determines that the second node fails, each node re-determines an embodiment of a neighbor node, which will be described in detail. As shown in Figure 5, the method of this embodiment may include:
步骤501、第一节点生成第一投票信息,并接收每一其他邻居节点发送的第二投票信息,第一投票信息包括第一节点选举的节点对应的节点标识;第二投票信息包括发送第二投票信息的邻居节点选举的节点对应的节点标识。Step 501, the first node generates the first voting information, and receives the second voting information sent by every other neighbor node, the first voting information includes the node identification corresponding to the node elected by the first node; the second voting information includes sending the second voting information The node ID corresponding to the node elected by the neighboring nodes of the voting information.
在本实施例中,当第二节点的邻居节点确定出第二节点发生故障之后,所有的邻居节点均需要重新计算各自的邻居节点。为便于说明,可以将第二节点的任意一个邻居节点作为第一节点,第一节点需要生成第一投票信息,该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外,第一节点还要接收每一其他邻居节点发送的第二投票信息,第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中,投票依据与多种因素有关,例如:负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等,如:第一节点可以通过判断哪一个节点所承担的负载最小,并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的,其他邻居节点也可以用类似的方式,将第二投票信息发送给第一节点。In this embodiment, after the neighbor nodes of the second node determine that the second node fails, all the neighbor nodes need to recalculate their respective neighbor nodes. For ease of description, any neighbor node of the second node can be used as the first node, and the first node needs to generate first voting information, which includes the node ID corresponding to the node elected by the first node and the voting basis. In addition, the first node also needs to receive the second voting information sent by every other neighboring node, and the second voting information includes the node identification and voting basis corresponding to the node elected by the neighboring node that sent the second voting information. In practical applications, the voting basis is related to many factors, such as: load, node number size, node cache newness, node network bandwidth, etc. For example, the first node can judge which node bears the least load, And carry the node identifier corresponding to the node with the smallest load in the first voting information and send it to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.
步骤502、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中的每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点;第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。Step 502, the first node counts the number of votes obtained by each node among all nodes elected according to the node identifier in the first voting information and the node identifier in the second voting information sent by each other neighbor node, and sends The node with the largest number of votes is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbor nodes of the third node include the third node's own neighbor nodes and neighbor nodes of the second node.
在本实施例中,第一节点在接收到每一其他邻居节点发送的第二投票信息后,根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识,可以确定出第三节点。在具体的实现过程中,可以根据第一投票信息和第二投票信息中携带的节点标识,通过投票选举的方式,统计被选举的所有节点中每一节点获得的投票数量,并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点,也即接管第二节点与其他节点之间的关联关系,因此,第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点,其中,第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外,还包括第二节点的邻居节点。In this embodiment, after the first node receives the second voting information sent by every other neighboring node, according to the node identification in the first voting information generated by itself and the node identification in the received second voting information, A third node can be determined. In the specific implementation process, according to the node identification carried in the first voting information and the second voting information, through voting, the number of votes obtained by each node among all elected nodes can be counted, and the number of votes obtained will be The node with the most number acts as the third node. The third node is used to take over the neighbor nodes of the failed second node, that is, to take over the association relationship between the second node and other nodes. Therefore, the third node will replace the second node and provide All neighbor nodes are nodes sending heartbeat messages, wherein all neighbor nodes of the third node include neighbor nodes of the second node in addition to neighbor nodes of the third node itself.
步骤503、第一节点根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定第一节点的邻居节点。Step 503, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in other neighbor nodes.
在本实施例中,第二节点的所有邻居节点通过投票选举的方式确定出第三节点之后,若第一节点为第三节点,则第一节点将接管第二节点的相邻关系,其他邻居节点可以根据第一节点接管第二节点的邻居节点后的相邻关系,重新通过计算确定各自的邻居节点;若第一节点不是第三节点,则第一节点将待第三节点重新确定出相邻关系之后,根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定自身的邻居节点。In this embodiment, after all neighbor nodes of the second node determine the third node through voting, if the first node is the third node, the first node will take over the adjacency relationship of the second node, and other neighbors Nodes can re-determine their neighbor nodes through calculation according to the neighbor relationship after the first node takes over the neighbor nodes of the second node; if the first node is not the third node, the first node will wait for the third node to re-determine the relative After adjacency, according to the neighbor nodes of the third node and the nodes other than the third node in other neighbor nodes, re-determine the neighbor nodes of itself.
举例来说,图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图,图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图。如图6A所示,假设节点E为第二节点,节点A为第一节点,当第一节点A确定第二节点E发生故障之后,第一节点A将生成第一投票信息,并分别接收节点X、D、C和G发送的第二投票信息,第一节点A根据第一投票信息中的节点标识和第二投票信息中的节点标识确定出第三节点,以使第三节点替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文。如图6B所示,若通过投票选举,确定出第一节点A为第三节点,则由第一节点A来替代第二节点、且并行地向第一节点A的所有邻居节点发送心跳报文,此时,第一节点A需要通过其他邻居节点X、D、C和G重新确定自己的邻居节点,而节点X、D、C和G在等第一节点A确定好自己的邻居节点之后,根据第一节点A确定出的邻居节点重新确定各自的邻居节点。For example, FIG. 6A is a schematic diagram of the neighbor relationship between nodes before a node failure is detected in the cluster system, and FIG. 6B is a schematic diagram of re-determining the neighbor relationship between nodes after a node failure is detected in the cluster system. As shown in Figure 6A, assuming that node E is the second node and node A is the first node, when the first node A determines that the second node E has failed, the first node A will generate the first voting information and receive node For the second voting information sent by X, D, C and G, the first node A determines the third node according to the node ID in the first voting information and the node ID in the second voting information, so that the third node can replace the second node, and send heartbeat messages to all neighbor nodes of the third node in parallel. As shown in Figure 6B, if the first node A is determined to be the third node through voting, the second node will be replaced by the first node A, and heartbeat messages will be sent to all neighbor nodes of the first node A in parallel , at this time, the first node A needs to re-determine its neighbor nodes through other neighbor nodes X, D, C and G, and after nodes X, D, C and G wait for the first node A to determine their neighbor nodes, According to the neighbor nodes determined by the first node A, the respective neighbor nodes are re-determined.
本发明实施例提供的集群系统中节点的故障检测方法,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。另外,通过在确定第一节点发生故障之后,重新确定各自的邻居节点,进而继续进行故障检测,提高了故障检测的准确性。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the first node judges whether it has received the first heartbeat message sent by the second node within a preset time, wherein the first node is a neighbor node of the second node , the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires whether other neighbor nodes of the second node have received the first heartbeat message if it does not receive the first heartbeat message, and determines the first heartbeat message When the other neighbor nodes of the second node do not receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection. In addition, after it is determined that the first node fails, the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.
可选地,在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态,确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定所述未接收到第一心跳报文的节点与第二节点之间的链路发生故障。Optionally, when the first node determines that at least one other neighbor node has received the first heartbeat message according to the reception status carried in the received response message sent by each other neighbor node, the first node determines that the first heartbeat message is received by the first node. The link between the node that has not received the first heartbeat message and the second node fails.
具体地,第一节点在未接收到第二节点发送的第一心跳报文,并向每一其他节点发送请求消息,以询问每一其他邻居节点是否接收到第一心跳报文之后,若根据每一其他节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文,则第一节点可以确定出第二节点是正常的,而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障,其中,未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。Specifically, after the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to inquire whether each other neighbor node has received the first heartbeat message, if according to The response message sent by each other node determines that at least one other neighbor node has received the first heartbeat message, then the first node can determine that the second node is normal, and it may be the second node and the first node, and The link between the node that has not received the first heartbeat message and the first node has failed, wherein the nodes that have not received the first heartbeat message include the first node and other neighbor nodes that have not received the first heartbeat message Neighbor nodes of the message.
本发明实施例提供的集群系统中节点的故障检测方法,由于第一节点在确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定未接收到第一心跳报文的节点与第二节点之间的链路发生故障,使得故障检测更加全面。In the node failure detection method in the cluster system provided by the embodiment of the present invention, when the first node determines that at least one other neighbor node has received the first heartbeat message, the first node determines that the first heartbeat message has not been received The link between the node and the second node fails, so that the fault detection is more comprehensive.
图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例中仍然以计算机作为执行主体为例进行介绍。如图7所示,本实施例的方法可以包括:FIG. 7 is a schematic flowchart of Embodiment 3 of a node failure detection method in a cluster system provided by the present invention. The method involved in the embodiment of the present invention is applicable to a distributed cluster system. In this embodiment, a computer is still used as an execution subject for introduction. As shown in Figure 7, the method of this embodiment may include:
步骤701、第二节点并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点为第二节点的邻居节点;其他邻居节点为第二节点的所有邻居节点中除第一节点之外的节点,其他邻居节点的数目为一个以上。Step 701, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, the first node is the neighbor node of the second node; the other neighbor nodes are all the neighbor nodes of the second node except the first node For nodes other than , the number of other neighbor nodes is more than one.
在本实施例中,第二节点可以根据集群系统中所包含的节点的信息,根据集群系统中预设的规则确定出自身所有的邻居节点,其中,第一节点为第二节点的任意一个邻居节点,第二节点的邻居节点为与该第二节点有关联关系的节点。第二节点在确定出所有的邻居节点之后,会并行地向第一节点和其他邻居节点发送第一心跳报文。In this embodiment, the second node can determine all its neighbor nodes according to the information of the nodes contained in the cluster system and the preset rules in the cluster system, wherein the first node is any neighbor of the second node node, and the neighbor nodes of the second node are nodes associated with the second node. After the second node determines all the neighbor nodes, it will send the first heartbeat message to the first node and other neighbor nodes in parallel.
步骤702、第一节点判断在预设时间内是否接收到第一心跳报文;预设时间大于或等于一个心跳周期,且小于两个心跳周期。Step 702, the first node judges whether a first heartbeat packet is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods.
在本实施例中,第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文,因此,第一节点可以判断在大于或等于一个心跳周期,且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如:假设心跳周期为5s,即第二节点每隔5s,将并行地向它的邻居节点发送一次心跳报文,对于第二节点在第5s发送的第一心跳报文,第一节点将判断在大于等于5s,且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中,心跳周期可以根据经验或者实际情况进行设置,对于心跳周期的具体取值,本实施例在此不作限制。In this embodiment, the second node can send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node can determine whether the heartbeat period is greater than or equal to one heartbeat period and less than two heartbeat periods. Whether the first heartbeat message sent by the second node is received within the time. For example: Assume that the heartbeat period is 5s, that is, the second node will send a heartbeat message to its neighbor nodes in parallel every 5s, and for the first heartbeat message sent by the second node at 5s, the first node will judge Whether the first heartbeat message sent by the second node is received within 5s or more and less than 10s. The heartbeat period may be set according to experience or actual conditions, and this embodiment does not limit the specific value of the heartbeat period.
另外,第二节点可以通过一个物理网络周期性地向第一节点发送第一心跳报文,但是由于基于单物理网络进行故障检测时,在网络发生故障,例如:管理平面网络发生故障,而业务平面网络正常时,往往无法界定是集群系统中第二节点发生了故障还是第二节点和第一节点之间的链路发生了故障,或者第二节点和第一节点同时发生了故障,由此,导致故障的检测结果不准确。为了解决这一问题,优选地,本实施例中还可以通过至少两个网络发送第一心跳报文,举例来说,可以通过双平面发送第一心跳报文,例如:管理平面和业务平面,也可以通过三平面发送第一心跳报文,例如:管理平面、业务平面和信令平面。采用多物理网络的方式发送第一心跳报文,来检测节点是否发生故障,可以提高检测的准确性。需要进行说明的是,若物理网络的数量为至少两个时,该至少两个物理网络之间相互隔离,这样可以避免由于多网络之间存在共用某些设备时,若共用设备发生故障,从而导致节点之间无法正常通信的现象,有利于提高检测的准确性。In addition, the second node may periodically send the first heartbeat message to the first node through a physical network, but when fault detection is performed based on a single physical network, when a network fault occurs, for example: the management plane network fails, and the service When the flat network is normal, it is often impossible to define whether the second node in the cluster system fails or the link between the second node and the first node fails, or the second node and the first node fail at the same time. , resulting in inaccurate fault detection results. In order to solve this problem, preferably, in this embodiment, the first heartbeat message can also be sent through at least two networks, for example, the first heartbeat message can be sent through two planes, for example: management plane and service plane, The first heartbeat message may also be sent through three planes, for example: management plane, service plane and signaling plane. The first heartbeat message is sent through a multi-physical network to detect whether a node fails, which can improve detection accuracy. It should be noted that, if the number of physical networks is at least two, the at least two physical networks are isolated from each other, so that when some equipment is shared between multiple networks, if the shared equipment fails, thereby The phenomenon that leads to the failure of normal communication between nodes is conducive to improving the accuracy of detection.
步骤703、在第一节点未接收到第一心跳报文的情况下,第一节点向每一其他邻居节点分别发送请求消息,请求消息用于询问每一其他邻居节点是否接收到所述第一心跳报文。Step 703. In the case that the first node does not receive the first heartbeat message, the first node sends a request message to each other neighbor node respectively, and the request message is used to inquire whether each other neighbor node has received the first heartbeat message. Heartbeat message.
在本实施例中,若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文,即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文,因此,第一节点将向第二节点的邻居节点中,除自身以外的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。In this embodiment, if the first node does not receive the first heartbeat message sent by the second node within the preset time, it can be preliminarily determined that the second node may be faulty. Since the second node sends the first heartbeat message to all its neighboring nodes in parallel, the first node will send a request message to other neighboring nodes of the second node except itself to ask other Whether the neighbor node receives the first heartbeat message sent by the second node.
步骤704、第一节点接收每一其他邻居节点发送的携带有接收状态的响应消息,接收状态用于表示是否接收到第一心跳报文。Step 704, the first node receives a response message carrying a receiving status sent by each other neighbor node, and the receiving status is used to indicate whether the first heartbeat message is received.
在本实施例中,每一其他邻居节点接收到第一节点发送的请求消息后,将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。In this embodiment, after each other neighbor node receives the request message sent by the first node, it carries the receiving status of whether it has received the first heartbeat message in the response message and sends it to the first node.
步骤705、在第一节点根据接收到的响应消息中携带的接收状态,确定出其他邻居节点均未接收到第一心跳报文的情况下,第一节点确定第二节点发生故障。Step 705. When the first node determines that none of the other neighboring nodes has received the first heartbeat message according to the reception status carried in the received response message, the first node determines that the second node is faulty.
在本实施例中,每一个其他邻居节点在接收到第一节点发送的请求消息之后,都会向第一节点返回携带有接收状态的响应消息,第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息,判断其他邻居节点是否接收到第一心跳报文,在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文时,即可确定出第二节点发生了故障。In this embodiment, after each other neighbor node receives the request message sent by the first node, it will return a response message carrying the receiving status to the first node, and the first node sends The response message carrying the receiving status, to determine whether other neighbor nodes have received the first heartbeat message, and when it is judged that other neighbor nodes have not received the first heartbeat message sent by the second node, the second heartbeat message can be determined. A node has failed.
本发明实施例提供的集群系统中节点的故障检测方法中,第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, and the first node judges whether the first heartbeat message is received within the preset time. The first heartbeat message sent by the two nodes, wherein the first node is a neighbor node of the second node, and the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, The number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires if it does not receive the first heartbeat message. Whether other neighbor nodes of the second node have received the first heartbeat message, and when it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed . Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图。在图7所示实施例的基础上,对第一节点确定第二节点发生故障之后,各节点重新确定邻居节点的实施例,作详细说明。如图8所示,本实施例的方法可以包括:FIG. 8 is a schematic flowchart of Embodiment 4 of a node failure detection method in a cluster system provided by the present invention. On the basis of the embodiment shown in FIG. 7 , after the first node determines that the second node has failed, each node re-determines an embodiment of a neighbor node, which will be described in detail. As shown in Figure 8, the method of this embodiment may include:
步骤801、第一节点生成第一投票信息,并接收每一其他邻居节点发送的第二投票信息,第一投票信息包括第一节点选举的节点对应的节点标识;第二投票信息包括发送第二投票信息的邻居节点选举的节点对应的节点标识。Step 801, the first node generates the first voting information, and receives the second voting information sent by every other neighbor node, the first voting information includes the node identification corresponding to the node elected by the first node; the second voting information includes sending the second The node ID corresponding to the node elected by the neighboring nodes of the voting information.
在本实施例中,当第二节点的邻居节点确定出第二节点发生故障之后,所有的邻居节点均需要重新计算各自的邻居节点。为便于说明,可以将第二节点的任意一个邻居节点作为第一节点,第一节点需要生成第一投票信息,该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外,第一节点还要接收每一其他邻居节点发送的第二投票信息,该第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中,投票依据与多种因素有关,例如:负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等,如:第一节点可以通过判断哪一个节点所承担的负载最小,并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的,其他邻居节点也可以用类似的方式,将第二投票信息发送给第一节点。In this embodiment, after the neighbor nodes of the second node determine that the second node fails, all the neighbor nodes need to recalculate their respective neighbor nodes. For ease of description, any neighbor node of the second node can be used as the first node, and the first node needs to generate first voting information, which includes the node ID corresponding to the node elected by the first node and the voting basis. In addition, the first node also needs to receive the second voting information sent by every other neighboring node, and the second voting information includes the node identification and voting basis corresponding to the node elected by the neighboring node that sent the second voting information. In practical applications, the voting basis is related to many factors, such as: load, node number size, node cache newness, node network bandwidth, etc. For example, the first node can judge which node bears the least load, And carry the node identifier corresponding to the node with the smallest load in the first voting information and send it to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.
步骤802、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点;第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。Step 802: According to the node ID in the first voting information and the node ID in the second voting information sent by every other neighbor node, the first node counts the number of votes obtained by each node among all nodes elected, and votes The node with the largest number is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbor nodes of the third node include the neighbors of the third node node and the neighbor nodes of the second node.
在本实施例中,第一节点在接收到每个其他邻居节点发送的第二投票信息后,根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识,可以确定出第三节点。在具体的实现过程中,可以根据第一投票信息和第二投票信息中携带的节点标识,通过投票选举的方式,统计被选举的所有节点中每一节点获得的投票数量,并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点,也即接管第二节点与其他节点之间的关联关系,因此,第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文,其中,第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外,还包括第二节点的邻居节点。In this embodiment, after the first node receives the second voting information sent by every other neighboring node, according to the node ID in the first voting information generated by itself and the node ID in the received second voting information, A third node can be determined. In the specific implementation process, according to the node identification carried in the first voting information and the second voting information, through voting, the number of votes obtained by each node among all elected nodes can be counted, and the number of votes obtained will be The node with the most number acts as the third node. The third node is used to take over the neighbor nodes of the failed second node, that is, to take over the association relationship between the second node and other nodes. Therefore, the third node will replace the second node and provide All neighbor nodes send a heartbeat message, wherein all neighbor nodes of the third node include neighbor nodes of the second node in addition to neighbor nodes of the third node itself.
步骤803、第一节点根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定第一节点的邻居节点。Step 803, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in other neighbor nodes.
在本实施例中,第二节点的所有邻居节点通过投票选举的方式确定出第三节点之后,若第一节点为第三节点,则第一节点将接管第二节点的相邻关系,其他邻居节点可以根据第一节点接管第二节点的邻居节点后的相邻关系,重新通过计算确定各自的邻居节点;若第一节点不是第三节点,则第一节点将待第三节点重新确定出相邻关系之后,根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定自身的邻居节点。In this embodiment, after all neighbor nodes of the second node determine the third node through voting, if the first node is the third node, the first node will take over the adjacency relationship of the second node, and other neighbors Nodes can re-determine their neighbor nodes through calculation according to the neighbor relationship after the first node takes over the neighbor nodes of the second node; if the first node is not the third node, the first node will wait for the third node to re-determine the relative After adjacency, according to the neighbor nodes of the third node and the nodes other than the third node in other neighbor nodes, re-determine the neighbor nodes of itself.
本发明实施例提供的集群系统中节点的故障检测方法,第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。另外,通过在确定第一节点发生故障之后,重新确定各自的邻居节点,进而继续进行故障检测,提高了故障检测的准确性。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, and the first node judges whether the second heartbeat message is received within the preset time. The first heartbeat message sent by the node, wherein the first node is a neighbor node of the second node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the first The number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires the first heartbeat message when it does not receive the first heartbeat message. Whether other neighbor nodes of the second node have received the first heartbeat message, and when it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection. In addition, after it is determined that the first node fails, the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.
可选地,在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态,确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定所述未接收到第一心跳报文的节点与第二节点之间的链路发生故障。Optionally, when the first node determines that at least one other neighbor node has received the first heartbeat message according to the reception status carried in the received response message sent by each other neighbor node, the first node determines that the first heartbeat message is received by the first node. The link between the node that has not received the first heartbeat message and the second node fails.
具体地,第一节点在未接收到第二节点发送的第一心跳报文,并向每一其他节点发送请求消息,以询问每一其他邻居节点是否接收到第一心跳报文之后,若根据每一其他邻居节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文,则第一节点可以确定出第二节点是正常的,而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障,其中,未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。Specifically, after the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to inquire whether each other neighbor node has received the first heartbeat message, if according to The response message sent by each other neighbor node determines that at least one other neighbor node has received the first heartbeat message, then the first node can determine that the second node is normal, and it may be that the second node and the first node, And the link between the node that has not received the first heartbeat message and the first node has failed, wherein the node that has not received the first heartbeat message includes the first node and other neighbor nodes that have not received the first heartbeat message. The neighbor node of the heartbeat message.
可选地,所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。Optionally, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
本发明实施例提供的集群系统中节点的故障检测方法,由于第一节点在确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定未接收到第一心跳报文的节点与第二节点之间的链路发生故障,使得故障检测更加全面。In the node failure detection method in the cluster system provided by the embodiment of the present invention, when the first node determines that at least one other neighbor node has received the first heartbeat message, the first node determines that the first heartbeat message has not been received The link between the node and the second node fails, so that the fault detection is more comprehensive.
图9为本发明集群系统中节点的故障检测装置实施例一的结构示意图,如图9所示,本发明实施例提供的集群系统中节点的故障检测装置10包括判断模块11、发送模块12、接收模块13、确定模块14和生成模块15。FIG. 9 is a schematic structural diagram of Embodiment 1 of a node fault detection device in a cluster system according to the present invention. As shown in FIG. A receiving module 13 , a determining module 14 and a generating module 15 .
其中,判断模块11用于判断在预设时间内接收模块13是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述判断模块11判断出所述接收模块13未接收到所述第二节点发送的第一心跳报文的情况下,发送模块12用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息;所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;所述接收模块13还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;确定模块14用于根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;在所述确定模块14确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述确定模块14还用于确定所述第二节点发生故障。在所述第一节点根据接收到的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,确定模块14用于确定所述第二节点发生故障。Wherein, the judging module 11 is used to judge whether the receiving module 13 has received the first heartbeat message sent by the second node within a preset time; the first node is a neighbor node of the second node, and the first heartbeat message is The message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than Or equal to one heartbeat period, and less than two heartbeat periods; when the judging module 11 judges that the receiving module 13 has not received the first heartbeat message sent by the second node, the sending module 12 is used to Sending a request message to other neighbor nodes except the first node among all neighbor nodes of the second node; the request message is used to inquire whether the other neighbor nodes have received the first heartbeat message; The receiving module 13 is also configured to receive a response message carrying a receiving status sent by the other neighbor nodes, and the receiving status is used to indicate whether the first heartbeat message is received; the determining module 14 is used to determine according to the The receiving module 13 receives the reception status carried in the response message sent by each of the other neighbor nodes, and determines whether the other neighbor nodes have not received the first heartbeat message; in the determination module 14 When it is determined that none of the other neighbor nodes has received the first heartbeat message, the determination module 14 is further configured to determine that the second node fails. When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the received response message, the determining module 14 is configured to determine the The second node fails.
本发明实施例提供的集群系统中节点的故障检测装置,判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;接收模块在未接收到第一心跳报文的情况下,发送模块向该第二节点的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第一心跳报文,并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。In the node fault detection device in the cluster system provided by the embodiment of the present invention, the judging module judges whether the receiving module receives the first heartbeat message sent by the second node within a preset time, and the first heartbeat message is sent by the second node in parallel A heartbeat message sent to each neighbor node of the second node, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; the receiving module In the case that the first heartbeat message is not received, the sending module sends a request message to other neighbor nodes of the second node to inquire whether other neighbor nodes have received the first heartbeat message, and when the determination module determines that the first heartbeat message When the other neighbor nodes of the second node do not receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
可选地,生成模块15还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;Optionally, the generating module 15 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;
所述接收模块13还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;The receiving module 13 is also configured to receive the second voting information sent by each of the other neighboring nodes, the second voting information including the node identifier corresponding to the node elected by the neighboring node that sent the second voting information;
所述确定模块14还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The determination module 14 is further configured to count the number of votes obtained by each node among all the nodes elected according to the node ID in the first voting information and the node ID in the second voting information sent by each of the other neighboring nodes. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.
可选地,在所述确定模块14根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,Optionally, the determining module 14 determines that at least one of the other neighboring nodes has received the In the case of the first heartbeat message mentioned above,
所述确定模块14还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The determining module 14 is also used to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message includes A node among the first node and the other neighbor nodes that has not received the first heartbeat message.
可选地,所述确定模块14还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。Optionally, the determination module 14 is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
本实施例的集群系统中节点的故障检测装置,可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。The node fault detection device in the cluster system of this embodiment can be used to implement the technical solution of the node fault detection method in the cluster system provided by any embodiment of the present invention, its implementation principle and technical effect are similar, and will not be repeated here .
图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图,如图10所示,本发明实施例提供的集群系统中节点的故障检测系统20包括第一节点21、第二节点22和其他邻居节点23,所述第一节点21为所述第二节点22的邻居节点,所述其他邻居节点23为所述第二节点22的所有邻居节点中除所述第一节点21之外的节点,所述其他邻居节点23的数目为一个以上。Fig. 10 is a schematic structural diagram of Embodiment 1 of the node fault detection system in the cluster system of the present invention. As shown in Fig. 10, the node fault detection system 20 in the cluster system provided by the embodiment of the present invention includes a first node 21, a second node 22 and other neighbor nodes 23, the first node 21 is a neighbor node of the second node 22, and the other neighbor nodes 23 are all neighbor nodes of the second node 22 except the first node 21 The number of other neighbor nodes 23 is more than one.
其中,所述第二节点22用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文;所述第一节点21用于判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点21还用于向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;所述第一节点21还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点21还用于确定所述第二节点发生故障。Wherein, the second node 22 is used to send the first heartbeat message to the first node and the other neighbor nodes in parallel; the first node 21 is used to judge whether the The first heartbeat message; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; when the first node does not receive the first heartbeat message, the first node 21 is further configured to send a request message to each of the other neighboring nodes, and the request message is used to inquire whether each of the other neighboring nodes has received the first heartbeat message; the first node 21 also uses Receiving a response message carrying a receiving status sent by each of the other neighbor nodes, the receiving status is used to indicate whether the first heartbeat message is received; at the first node according to each received The reception status carried in the response message sent by the other neighbor nodes, and when it is determined that none of the other neighbor nodes have received the first heartbeat message, the first node 21 is also used to determine the The second node fails.
本发明实施例提供的集群系统中节点的故障检测系统中,判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;接收模块在未接收到第一心跳报文的情况下,发送模块向该第二节点的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第一心跳报文,并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。In the node fault detection system in the cluster system provided by the embodiment of the present invention, the judging module judges whether the receiving module receives the first heartbeat message sent by the second node within the preset time, and the first heartbeat message is the second node parallel A heartbeat message sent to each neighbor node of the second node, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; receiving When the module does not receive the first heartbeat message, the sending module sends a request message to other neighbor nodes of the second node to inquire whether other neighbor nodes have received the first heartbeat message, and when the determination module determines that the When no other neighbor nodes of the second node receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
在上述实施例中,所述第一节点21确定所述第二节点发生故障之后,还包括:所述第一节点21还用于:In the above embodiment, after the first node 21 determines that the second node fails, the first node 21 further includes: the first node 21 is further configured to:
生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;Generate first voting information, and receive second voting information sent by each of the other neighbor nodes, the first voting information includes the node identifier corresponding to the node elected by the first node, and the second voting information includes sending The node identifier corresponding to the node elected by the neighbor node of the second voting information;
以及,根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点,所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。And, according to the node identification in the first voting information and the node identification in the second voting information sent by each of the other neighboring nodes, count the number of votes obtained by each node in all nodes elected, and vote The node with the largest number serves as a third node, and the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbors of the third node The nodes include neighbor nodes of the third node itself and neighbor nodes of the second node.
在上述实施例中,在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,In the above embodiment, the first node determines that at least one of the other neighbor nodes has received the first In the case of a heartbeat message,
所述第一节点21还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The first node 21 is also used to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message A node that has not received the first heartbeat message among the first node and the other neighbor nodes is included.
在上述实施例中,所述第一节点21还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。In the above embodiment, the first node 21 is further configured to re-determine the first node's neighbor nodes.
上述系统实施例对应地可用于执行方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The above-mentioned system embodiments can be correspondingly used to implement the technical solutions of the method embodiments, and their implementation principles and technical effects are similar, and will not be repeated here.
图11为本发明节点实施例一的结构示意图,如图11所示,本实施例的节点600包括处理器601、用户接口603、网络接口604和存储器605、发送器606和接收器607,存储器605可以包括操作系统6051、应用程序6052等。处理器601可以是中央处理器(Central Processing Unit,CPU)。存储器605用于存储可执行指令。处理器601可以执行存储器605中存储的可执行指令。其中,接收器607用于接收第二节点发送的第一心跳报文;所述处理器601用于判断在预设时间内所述接收器607是否接收到第二节点发送的第一心跳报文;所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述处理器601判断出所述接收器607未接收到所述第二节点发送的第一心跳报文的情况下,发送器606用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文,所述第一节点为所述第二节点的邻居节点;所述接收器607还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;所述处理器601用于根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;在所述处理器601确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述处理器601还用于确定所述第二节点发生故障。Fig. 11 is a schematic structural diagram of the node embodiment 1 of the present invention. As shown in Fig. 11, the node 600 of this embodiment includes a processor 601, a user interface 603, a network interface 604, a memory 605, a transmitter 606, a receiver 607, and a memory 605 may include an operating system 6051, application programs 6052, and the like. The processor 601 may be a central processing unit (Central Processing Unit, CPU). The memory 605 is used to store executable instructions. Processor 601 may execute executable instructions stored in memory 605 . Wherein, the receiver 607 is used to receive the first heartbeat message sent by the second node; the processor 601 is used to judge whether the receiver 607 has received the first heartbeat message sent by the second node within a preset time ; The first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; The preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; when the processor 601 determines that the receiver 607 has not received the first heartbeat message sent by the second node , the sender 606 is used to send a request message to other neighbor nodes except the first node among all neighbor nodes of the second node, and the request message is used to inquire whether the other neighbor nodes have received the In the first heartbeat message, the first node is a neighbor node of the second node; the receiver 607 is also configured to receive a response message carrying a reception status sent by the other neighbor nodes, and the reception status uses Indicates whether the first heartbeat message is received; the processor 601 is configured to determine whether to None of the other neighbor nodes has received the first heartbeat message; when the processor 601 determines that none of the other neighbor nodes has received the first heartbeat message, the processor 601 It is also used to determine that the second node fails.
本实施例提供的节点,可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。The node provided in this embodiment can be used to execute the technical solution of the node fault detection method in the cluster system provided by any embodiment of the present invention, and its implementation principle and technical effect are similar, and will not be repeated here.
可选地,所述处理器601还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;Optionally, the processor 601 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;
所述接收器607还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;The receiver 607 is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node elected by the neighboring node that sent the second voting information;
所述处理器601还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The processor 601 is further configured to, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes, count the votes obtained by each node among all the nodes elected. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.
可选地,在所述处理器601根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,所述处理器601还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。Optionally, the processor 601 determines that at least one of the other neighbor nodes has received the In the case of the first heartbeat message, the processor 601 is further configured to determine that the link between the node that has not received the first heartbeat message and the second node is faulty; The nodes receiving the first heartbeat message include nodes that have not received the first heartbeat message among the first node and the other neighbor nodes.
可选地,所述处理器601还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。Optionally, the processor 601 is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.
本实施例提供的节点,可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。The node provided in this embodiment can be used to execute the technical solution of the node fault detection method in the cluster system provided by any embodiment of the present invention, and its implementation principle and technical effect are similar, and will not be repeated here.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.
Claims (12)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510306800.0A CN106301853B (en) | 2015-06-05 | 2015-06-05 | Fault detection method and device for nodes in cluster system |
PCT/CN2016/073606 WO2016192408A1 (en) | 2015-06-05 | 2016-02-05 | Fault detection method and apparatus for node in cluster system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510306800.0A CN106301853B (en) | 2015-06-05 | 2015-06-05 | Fault detection method and device for nodes in cluster system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106301853A true CN106301853A (en) | 2017-01-04 |
CN106301853B CN106301853B (en) | 2019-06-18 |
Family
ID=57440098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510306800.0A Active CN106301853B (en) | 2015-06-05 | 2015-06-05 | Fault detection method and device for nodes in cluster system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106301853B (en) |
WO (1) | WO2016192408A1 (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107566219A (en) * | 2017-09-27 | 2018-01-09 | 华为技术有限公司 | Method for diagnosing faults, node device and computer equipment applied to group system |
CN107864486A (en) * | 2017-12-26 | 2018-03-30 | 杭州迪普科技股份有限公司 | A kind of offline AP detection methods and device |
CN108092857A (en) * | 2018-01-15 | 2018-05-29 | 郑州云海信息技术有限公司 | A kind of distributed system heartbeat detecting method and relevant apparatus |
CN108337274A (en) * | 2017-01-19 | 2018-07-27 | 贵州白山云科技有限公司 | A kind of message distributing method and system |
CN108683561A (en) * | 2018-05-16 | 2018-10-19 | 杭州迪普科技股份有限公司 | A kind of station state detection method and device |
CN109218141A (en) * | 2018-11-20 | 2019-01-15 | 郑州云海信息技术有限公司 | A kind of malfunctioning node detection method and relevant apparatus |
CN109428740A (en) * | 2017-08-21 | 2019-03-05 | 华为技术有限公司 | The method and apparatus that equipment fault restores |
CN109525408A (en) * | 2017-09-18 | 2019-03-26 | 杭州海康威视系统技术有限公司 | A kind of unit exception processing method, device and cloud storage system |
CN109714183A (en) * | 2017-10-26 | 2019-05-03 | 阿里巴巴集团控股有限公司 | Data processing method and device in a kind of cluster |
CN109873719A (en) * | 2019-02-03 | 2019-06-11 | 华为技术有限公司 | A kind of fault detection method and device |
CN110324166A (en) * | 2018-03-31 | 2019-10-11 | 华为技术有限公司 | A kind of method, apparatus and system of target information synchronous in multiple nodes |
CN110380934A (en) * | 2019-07-23 | 2019-10-25 | 南京航空航天大学 | A kind of distribution redundant system heartbeat detecting method |
CN110377570A (en) * | 2017-10-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Node switching method, device, computer equipment and storage medium |
CN111181763A (en) * | 2019-11-28 | 2020-05-19 | 泰康保险集团股份有限公司 | Network fault reporting method and device |
CN111586110A (en) * | 2020-04-22 | 2020-08-25 | 广州锦行网络科技有限公司 | Optimization processing method for raft in point-to-point fault |
WO2020220231A1 (en) * | 2019-04-29 | 2020-11-05 | 华为海洋网络有限公司 | Submarine cable failure determination method and apparatus |
CN112398905A (en) * | 2020-09-28 | 2021-02-23 | 联想(北京)有限公司 | Node and information synchronization method |
CN112468372A (en) * | 2017-04-10 | 2021-03-09 | 华为技术有限公司 | Equipment state detection method and device in power line communication network |
CN112911520A (en) * | 2019-12-04 | 2021-06-04 | 哈尔滨海能达科技有限公司 | Method, device and storage medium for determining master node in ad hoc network |
CN112988463A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN113542052A (en) * | 2021-06-07 | 2021-10-22 | 新华三信息技术有限公司 | Node fault determination method and device and server |
CN113783735A (en) * | 2021-09-24 | 2021-12-10 | 小红书科技有限公司 | Method, device, equipment and medium for identifying fault node in Redis cluster |
CN114328709A (en) * | 2020-09-29 | 2022-04-12 | 北京金山云网络技术有限公司 | A failover method, device, electronic device and storage medium |
CN115102886A (en) * | 2022-06-21 | 2022-09-23 | 上海驻云信息科技有限公司 | A task scheduling method and device for multiple acquisition clients |
CN116260705A (en) * | 2022-12-21 | 2023-06-13 | 广西壮族自治区自然资源信息中心 | Geographic information distributed cluster fault processing method, device, medium and equipment |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018214106A1 (en) * | 2017-05-25 | 2018-11-29 | 深圳市伊特利网络科技有限公司 | Update method and system for network connection list |
WO2019000954A1 (en) * | 2017-06-30 | 2019-01-03 | 中兴通讯股份有限公司 | Method, device and system for monitoring node survival state |
US10547499B2 (en) | 2017-09-04 | 2020-01-28 | International Business Machines Corporation | Software defined failure detection of many nodes |
CN109302445B (en) * | 2018-08-14 | 2021-10-12 | 新华三云计算技术有限公司 | Host node state determination method and device, host node and storage medium |
CN113923105B (en) * | 2021-12-13 | 2022-04-22 | 中机联科技(广东)有限公司 | Internet of things equipment fault monitoring method and system based on block chain |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070294596A1 (en) * | 2006-05-22 | 2007-12-20 | Gissel Thomas R | Inter-tier failure detection using central aggregation point |
CN101159536A (en) * | 2007-10-30 | 2008-04-09 | 中兴通讯股份有限公司 | Media gateway node condition synchronizing method in dual-home network |
CN102204169A (en) * | 2011-05-12 | 2011-09-28 | 华为技术有限公司 | Fault detection method, route node and system |
CN102612110A (en) * | 2012-03-02 | 2012-07-25 | 浙江大学 | Distributive self-organized routing method in electric carrier wave illumination control system |
CN102821011A (en) * | 2012-08-28 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Opposite terminal state detection method, device and equipment |
CN103297396A (en) * | 2012-02-28 | 2013-09-11 | 国际商业机器公司 | Management failure transferring device and method in cluster system |
CN103916275A (en) * | 2014-03-31 | 2014-07-09 | 杭州华三通信技术有限公司 | BFD detection device and method |
US20140301401A1 (en) * | 2013-04-07 | 2014-10-09 | Hangzhou H3C Technologies Co., Ltd. | Providing aggregation link groups in logical network device |
CN104283711A (en) * | 2014-09-29 | 2015-01-14 | 中国联合网络通信集团有限公司 | Fault detection method based on BFD, nodes and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102752143B (en) * | 2012-07-05 | 2015-08-19 | 杭州华三通信技术有限公司 | The BFD detection method of MPLS TE bidirectional tunnel and routing device |
-
2015
- 2015-06-05 CN CN201510306800.0A patent/CN106301853B/en active Active
-
2016
- 2016-02-05 WO PCT/CN2016/073606 patent/WO2016192408A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070294596A1 (en) * | 2006-05-22 | 2007-12-20 | Gissel Thomas R | Inter-tier failure detection using central aggregation point |
CN101159536A (en) * | 2007-10-30 | 2008-04-09 | 中兴通讯股份有限公司 | Media gateway node condition synchronizing method in dual-home network |
CN102204169A (en) * | 2011-05-12 | 2011-09-28 | 华为技术有限公司 | Fault detection method, route node and system |
CN103297396A (en) * | 2012-02-28 | 2013-09-11 | 国际商业机器公司 | Management failure transferring device and method in cluster system |
CN102612110A (en) * | 2012-03-02 | 2012-07-25 | 浙江大学 | Distributive self-organized routing method in electric carrier wave illumination control system |
CN102821011A (en) * | 2012-08-28 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Opposite terminal state detection method, device and equipment |
US20140301401A1 (en) * | 2013-04-07 | 2014-10-09 | Hangzhou H3C Technologies Co., Ltd. | Providing aggregation link groups in logical network device |
CN103916275A (en) * | 2014-03-31 | 2014-07-09 | 杭州华三通信技术有限公司 | BFD detection device and method |
CN104283711A (en) * | 2014-09-29 | 2015-01-14 | 中国联合网络通信集团有限公司 | Fault detection method based on BFD, nodes and system |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108337274A (en) * | 2017-01-19 | 2018-07-27 | 贵州白山云科技有限公司 | A kind of message distributing method and system |
CN112468372B (en) * | 2017-04-10 | 2023-10-13 | 华为技术有限公司 | Method and device for detecting equipment state in power line communication network |
CN112468372A (en) * | 2017-04-10 | 2021-03-09 | 华为技术有限公司 | Equipment state detection method and device in power line communication network |
CN109428740A (en) * | 2017-08-21 | 2019-03-05 | 华为技术有限公司 | The method and apparatus that equipment fault restores |
CN109428740B (en) * | 2017-08-21 | 2020-09-08 | 华为技术有限公司 | Method and device for equipment failure recovery |
CN109525408A (en) * | 2017-09-18 | 2019-03-26 | 杭州海康威视系统技术有限公司 | A kind of unit exception processing method, device and cloud storage system |
CN109525408B (en) * | 2017-09-18 | 2021-12-21 | 杭州海康威视系统技术有限公司 | Equipment exception handling method and device and cloud storage system |
CN107566219A (en) * | 2017-09-27 | 2018-01-09 | 华为技术有限公司 | Method for diagnosing faults, node device and computer equipment applied to group system |
CN107566219B (en) * | 2017-09-27 | 2020-09-18 | 华为技术有限公司 | Fault diagnosis method applied to cluster system, node equipment and computer equipment |
CN110377570B (en) * | 2017-10-12 | 2021-06-11 | 腾讯科技(深圳)有限公司 | Node switching method and device, computer equipment and storage medium |
CN110377570A (en) * | 2017-10-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Node switching method, device, computer equipment and storage medium |
CN109714183A (en) * | 2017-10-26 | 2019-05-03 | 阿里巴巴集团控股有限公司 | Data processing method and device in a kind of cluster |
CN107864486A (en) * | 2017-12-26 | 2018-03-30 | 杭州迪普科技股份有限公司 | A kind of offline AP detection methods and device |
CN108092857A (en) * | 2018-01-15 | 2018-05-29 | 郑州云海信息技术有限公司 | A kind of distributed system heartbeat detecting method and relevant apparatus |
CN110324166A (en) * | 2018-03-31 | 2019-10-11 | 华为技术有限公司 | A kind of method, apparatus and system of target information synchronous in multiple nodes |
CN110324166B (en) * | 2018-03-31 | 2020-12-15 | 华为技术有限公司 | Method, device and system for synchronizing target information in multiple nodes |
CN108683561B (en) * | 2018-05-16 | 2020-10-02 | 杭州迪普科技股份有限公司 | Site state detection method and device |
CN108683561A (en) * | 2018-05-16 | 2018-10-19 | 杭州迪普科技股份有限公司 | A kind of station state detection method and device |
CN109218141A (en) * | 2018-11-20 | 2019-01-15 | 郑州云海信息技术有限公司 | A kind of malfunctioning node detection method and relevant apparatus |
CN109873719A (en) * | 2019-02-03 | 2019-06-11 | 华为技术有限公司 | A kind of fault detection method and device |
WO2020220231A1 (en) * | 2019-04-29 | 2020-11-05 | 华为海洋网络有限公司 | Submarine cable failure determination method and apparatus |
US11265080B2 (en) | 2019-04-29 | 2022-03-01 | Hmn Technologies Co., Limited | Submarine cable fault determining method and apparatus |
CN110380934A (en) * | 2019-07-23 | 2019-10-25 | 南京航空航天大学 | A kind of distribution redundant system heartbeat detecting method |
CN111181763A (en) * | 2019-11-28 | 2020-05-19 | 泰康保险集团股份有限公司 | Network fault reporting method and device |
CN112911520B (en) * | 2019-12-04 | 2022-05-31 | 哈尔滨海能达科技有限公司 | Method, device and storage medium for determining master node in ad hoc network |
CN112911520A (en) * | 2019-12-04 | 2021-06-04 | 哈尔滨海能达科技有限公司 | Method, device and storage medium for determining master node in ad hoc network |
CN111586110B (en) * | 2020-04-22 | 2021-03-19 | 广州锦行网络科技有限公司 | Optimization processing method for raft in point-to-point fault |
CN111586110A (en) * | 2020-04-22 | 2020-08-25 | 广州锦行网络科技有限公司 | Optimization processing method for raft in point-to-point fault |
CN112398905A (en) * | 2020-09-28 | 2021-02-23 | 联想(北京)有限公司 | Node and information synchronization method |
CN112398905B (en) * | 2020-09-28 | 2022-05-31 | 联想(北京)有限公司 | Node and information synchronization method |
CN114328709A (en) * | 2020-09-29 | 2022-04-12 | 北京金山云网络技术有限公司 | A failover method, device, electronic device and storage medium |
CN112988463A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN112988463B (en) * | 2021-02-23 | 2022-08-30 | 新华三大数据技术有限公司 | Fault node isolation method and device |
CN113542052A (en) * | 2021-06-07 | 2021-10-22 | 新华三信息技术有限公司 | Node fault determination method and device and server |
CN113783735A (en) * | 2021-09-24 | 2021-12-10 | 小红书科技有限公司 | Method, device, equipment and medium for identifying fault node in Redis cluster |
CN115102886A (en) * | 2022-06-21 | 2022-09-23 | 上海驻云信息科技有限公司 | A task scheduling method and device for multiple acquisition clients |
CN116260705A (en) * | 2022-12-21 | 2023-06-13 | 广西壮族自治区自然资源信息中心 | Geographic information distributed cluster fault processing method, device, medium and equipment |
CN116260705B (en) * | 2022-12-21 | 2023-09-15 | 广西壮族自治区自然资源信息中心 | Geographic information distributed cluster fault processing method, device, medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
WO2016192408A1 (en) | 2016-12-08 |
CN106301853B (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106301853A (en) | The fault detection method of group system interior joint and device | |
CN108833202B (en) | Method, device and computer readable storage medium for detecting fault link | |
CN105187249B (en) | A kind of fault recovery method and device | |
US9191269B2 (en) | Method and system for providing latency detection based on automated latency measurements of communication network paths | |
CN104219107B (en) | A kind of detection method of communication failure, apparatus and system | |
JP5969021B2 (en) | Method, apparatus and system for finding IPTV faults | |
JP6354434B2 (en) | Multi-hop network failure detection method and node | |
EP1697843B1 (en) | System and method for managing protocol network failures in a cluster system | |
CN111988191B (en) | Fault detection method and device for distributed communication network | |
JP5530864B2 (en) | Network system, management server, and management method | |
CN106559166B (en) | Fingerprint-based state detection method and equipment for distributed processing system | |
CN104243232B (en) | Virtual net fault detection and location method | |
CN104521192A (en) | Techniques for flooding optimization for link state protocols in a network topology | |
CN108401490B (en) | A network performance measurement method and detection device | |
WO2014094314A1 (en) | Optimal path selection method, related device and communication system | |
KR20200117029A (en) | Method, apparatus and device for managing threshold pair change | |
CN106034045A (en) | Ethernet link failure positioning method, device and system | |
CN109245961A (en) | Link-quality detection method, device, storage medium and equipment | |
CN112532408B (en) | Method, device and storage medium for extracting fault propagation condition | |
JP2014217062A (en) | Link failure diagnosis device and method | |
CN105530115A (en) | Method and device for realizing operation management and maintenance function | |
CN110943877B (en) | Network state measuring method, equipment and system | |
US20210288899A1 (en) | Method and device for detecting network reliability | |
CN103995901A (en) | Method for determining data node failure | |
JP2009086741A (en) | Distributed processing control method in heterogeneous node existing distributed environment and its system and its program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |