CN106301853A

CN106301853A - The fault detection method of group system interior joint and device

Info

Publication number: CN106301853A
Application number: CN201510306800.0A
Authority: CN
Inventors: 胡琳; 伍湘平; 彭佩星
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-06-05
Filing date: 2015-06-05
Publication date: 2017-01-04
Anticipated expiration: 2035-06-05
Also published as: WO2016192408A1; CN106301853B

Abstract

Embodiments of the present invention provide a node fault detection method and device in a cluster system. The method includes: the first node judges whether the first heartbeat message sent by the second node is received within a preset time, and the first node is the first heartbeat message sent by the second node. The neighbor node of the second node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel; in the case that the first node does not receive the heartbeat message sent by the second node Next, send a request message to other neighbor nodes except the first node among all the neighbor nodes of the second node; the first node receives the response message carrying the reception status sent by other neighbor nodes; the first node determines according to the reception status In the case that no heartbeat message is received by other neighbor nodes, the first node determines that the second node fails. The node failure detection method and device in the cluster system provided by the embodiments of the present invention can improve the efficiency of node failure detection.

Description

Node failure detection method and device in cluster system

技术领域technical field

本发明实施例涉及通信技术，尤其涉及一种集群系统中节点的故障检测方法和装置。Embodiments of the present invention relate to communication technologies, and in particular to a method and device for detecting node faults in a cluster system.

背景技术Background technique

在分布式集群系统中，通常包括一个中心节点和多个普通节点，当中心节点或者普通节点发生故障后，将对分布式集群系统的可靠性造成很大的影响，因此，如何有效的进行节点的故障检测，是非常重要的。In a distributed cluster system, it usually includes a central node and multiple ordinary nodes. When the central node or ordinary nodes fail, it will have a great impact on the reliability of the distributed cluster system. Therefore, how to effectively implement node fault detection is very important.

图1为现有技术中节点的故障检测方法的示意图，如图1所示，普通节点(B、C、D、E)根据心跳周期向中心节点(M)发送心跳报文，中心节点(M)根据检测周期内收到的连续心跳报文的情况，来检测普通节点是否故障，其中，一个检测周期可以包含多个心跳周期。同时，中心节点(M)也可以周期性的向普通节点(B、C、D、E)发送心跳报文，以通知普通节点中心节点所担任的角色以及是否处于正常状态，一旦普通节点(B、C、D、E)在检测周期内未收到中心节点(M)发送的心跳报文，则会判断出中心节点(M)发生故障，此时，普通节点会发起重新选举中心节点的操作，若选举成功，普通节点将感知新的中心节点，并将心跳报文发送到新的中心节点，集群再进行故障检测。Fig. 1 is the schematic diagram of the fault detection method of node in the prior art, as shown in Fig. 1, common node (B, C, D, E) sends heartbeat message to central node (M) according to the heartbeat period, and central node (M ) to detect whether a common node is faulty according to the condition of the continuous heartbeat messages received in the detection period, wherein one detection period may include multiple heartbeat periods. At the same time, the central node (M) can also periodically send heartbeat messages to ordinary nodes (B, C, D, E) to inform the ordinary nodes of the role of the central node and whether it is in a normal state. Once the ordinary node (B , C, D, E) If the heartbeat message sent by the central node (M) is not received within the detection period, it will be judged that the central node (M) has failed. At this time, the ordinary node will initiate the operation of re-election of the central node , if the election is successful, the common node will perceive the new central node and send a heartbeat message to the new central node, and then the cluster will perform fault detection.

然而，在现有技术中，通过判断在检测周期内是否接收到心跳报文的方式来检测节点是否发生故障时，由于在集群规模固定的情况下，发送心跳报文的心跳周期无法改变，因此检测周期的时间也无法改变，使得节点故障检测需要通过多个心跳周期才能检测出来，造成节点故障检测的周期较长，导致节点故障检测的效率较低。However, in the prior art, when detecting whether a node fails by judging whether a heartbeat message is received within the detection period, since the heartbeat period for sending the heartbeat message cannot be changed when the cluster size is fixed, The time of the detection cycle cannot be changed, so that node fault detection needs to be detected through multiple heartbeat cycles, resulting in a longer node fault detection cycle and lower efficiency of node fault detection.

发明内容Contents of the invention

本发明实施例提供一种集群系统中节点的故障检测方法和装置，用于解决现有技术存在着的节点故障检测需要通过多个心跳周期才能检测出来，造成节点故障检测的周期较长的问题，从而提高了节点故障检测的效率。The embodiment of the present invention provides a node fault detection method and device in a cluster system, which is used to solve the problem in the prior art that node fault detection needs to be detected through multiple heartbeat cycles, resulting in a longer cycle of node fault detection , thus improving the efficiency of node failure detection.

第一方面，本发明实施例提供一种集群系统中节点的故障检测方法，包括：In a first aspect, an embodiment of the present invention provides a method for detecting failures of nodes in a cluster system, including:

第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文；所述第一节点为所述第二节点的邻居节点，所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文，所述第二节点的所有邻居节点的数目为两个以上；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；The first node judges whether the first heartbeat message sent by the second node is received within the preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the second A heartbeat message sent by the node to each neighbor node of the second node in parallel, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and Less than two heartbeat cycles;

在所述第一节点未接收到所述第二节点发送的第一心跳报文的情况下，所述第一节点向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息，所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文；When the first node does not receive the first heartbeat message sent by the second node, the first node sends a message to all neighbor nodes of the second node except the first node Other neighbor nodes send a request message, where the request message is used to inquire whether the other neighbor nodes have received the first heartbeat message;

所述第一节点接收所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；The first node receives a response message carrying a reception status sent by the other neighbor nodes, and the reception status is used to indicate whether the first heartbeat message is received;

在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述第一节点确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node determines that the second node fails.

结合第一方面，在第一方面的第一种可能的实现方式中，所述第一节点确定所述第二节点发生故障之后，还包括：With reference to the first aspect, in a first possible implementation manner of the first aspect, after the first node determines that the second node fails, the method further includes:

所述第一节点生成第一投票信息，并接收每一所述其他邻居节点发送的第二投票信息，所述第一投票信息包括所述第一节点选举的节点对应的节点标识；所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识；The first node generates first voting information, and receives second voting information sent by each of the other neighboring nodes, the first voting information includes the node identifier corresponding to the node elected by the first node; The second voting information includes the node identifier corresponding to the node elected by the neighbor node that sent the second voting information;

所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The first node counts the number of votes obtained by each node among all elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighbor nodes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the third node All neighbor nodes of the third node include neighbor nodes of the third node itself and neighbor nodes of the second node.

结合第一方面或第一方面的第一种可能的实现方式，在第一方面的第二种可能的实现方式中，还包括：In combination with the first aspect or the first possible implementation of the first aspect, the second possible implementation of the first aspect further includes:

在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node determines that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the Nodes among the first node and the other neighbor nodes that have not received the first heartbeat message.

结合第一方面、第一方面的第一种至第一方面的第二种任一种可能的实现方式，在第一方面的第三种可能的实现方式中，还包括：Combining the first aspect, the first aspect of the first aspect to any second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, it also includes:

所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。The first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

第二方面，本发明实施例提供一种集群系统中节点的故障检测方法，所述方法包括：In a second aspect, an embodiment of the present invention provides a method for detecting failures of nodes in a cluster system, the method comprising:

第二节点并行地向第一节点和其他邻居节点发送第一心跳报文；所述第一节点为所述第二节点的邻居节点，所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点，所述其他邻居节点的数目为一个以上；The second node sends a first heartbeat message to the first node and other neighbor nodes in parallel; the first node is a neighbor node of the second node, and the other neighbor nodes are all neighbor nodes of the second node For nodes other than the first node, the number of other neighbor nodes is more than one;

所述第一节点判断在预设时间内是否接收到所述第一心跳报文；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；The first node judges whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods;

在所述第一节点未接收到所述第一心跳报文的情况下，所述第一节点向每一所述其他邻居节点分别发送请求消息，所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文；When the first node does not receive the first heartbeat message, the first node sends a request message to each of the other neighbor nodes, and the request message is used to ask each of the other neighbor nodes Whether the neighbor node has received the first heartbeat message;

所述第一节点接收每一所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；The first node receives a response message carrying a reception status sent by each of the other neighbor nodes, and the reception status is used to indicate whether the first heartbeat message is received;

在所述第一节点根据接收到的所述响应消息中携带的接收状态，确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述第一节点确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the received response message, the first node determines that the The second node fails.

结合第二方面，在第二方面的第一种可能的实现方式中，所述第一节点确定所述第二节点发生故障之后，还包括：With reference to the second aspect, in the first possible implementation manner of the second aspect, after the first node determines that the second node fails, the method further includes:

结合第二方面或第二方面的第一种可能的实现方式，在第二方面的第二种可能的实现方式中，还包括：In combination with the second aspect or the first possible implementation manner of the second aspect, the second possible implementation manner of the second aspect further includes:

在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，则所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到第一心跳报文的节点。When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , then the first node determines that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the Nodes that have not received the first heartbeat message among the first node and the other neighbor nodes.

结合第二方面、第二方面的第一种至第二方面的第二种任一种可能的实现方式，在第二方面的第三种可能的实现方式中，还包括：In combination with the second aspect, any of the first possible implementation manners of the second aspect to the second aspect of the second aspect, the third possible implementation manner of the second aspect further includes:

第三方面，本发明实施例提供一种集群系统中节点的故障检测装置，包括：In a third aspect, an embodiment of the present invention provides a node failure detection device in a cluster system, including:

判断模块，用于判断在预设时间内是否接收到第二节点发送的第一心跳报文；所述第一节点为所述第二节点的邻居节点，所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文，所述第二节点的所有邻居节点的数目为两个以上；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；A judging module, configured to judge whether a first heartbeat message sent by a second node is received within a preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the A heartbeat message sent by the second node to each neighbor node of the second node in parallel, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle , and less than two heartbeat cycles;

在所述判断模块判断出接收模块未接收到所述第二节点发送的第一心跳报文的情况下，When the judging module judges that the receiving module has not received the first heartbeat message sent by the second node,

发送模块，用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息，所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文；A sending module, configured to send a request message to other neighbor nodes except the first node among all neighbor nodes of the second node, and the request message is used to ask whether the other neighbor nodes have received the first node. A heartbeat message;

所述接收模块，还用于接收所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；The receiving module is further configured to receive a response message carrying a receiving status sent by the other neighbor nodes, and the receiving status is used to indicate whether the first heartbeat message is received;

确定模块，用于根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定是否所述其他邻居节点均未接收到所述第一心跳报文；A determining module, configured to determine, according to the receiving status carried in the response message sent by each of the other neighboring nodes received by the receiving module, whether the other neighboring nodes have not received the first heartbeat message ;

在所述确定模块确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述确定模块，还用于确定所述第二节点发生故障。In a case where the determining module determines that none of the other neighbor nodes has received the first heartbeat message, the determining module is further configured to determine that the second node fails.

结合第三方面，在第三方面的第一种可能的实现方式中，在所述确定模块确定所述第二节点发生故障之后，还包括：With reference to the third aspect, in a first possible implementation manner of the third aspect, after the determining module determines that the second node fails, it further includes:

生成模块，还用于生成第一投票信息，所述第一投票信息包括所述第一节点选举的节点对应的节点标识；The generating module is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;

所述接收模块，还用于接收每一所述其他邻居节点发送的第二投票信息，所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识；The receiving module is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node elected by the neighboring node that sent the second voting information;

所述确定模块，还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The determining module is further configured to, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighbor nodes, count the number of votes obtained by each node among all the nodes elected. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.

结合第三方面或第三方面的第一种可能的实现方式，在第三方面的第二种可能的实现方式中，In combination with the third aspect or the first possible implementation of the third aspect, in the second possible implementation of the third aspect,

在所述确定模块根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，The determination module determines that at least one of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the response message sent by each of the other neighbor nodes received by the receiving module in the case of,

所述确定模块还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The determination module is also used to determine that the link between the node that has not received the first heartbeat message and the second node is faulty; the node that has not received the first heartbeat message includes the A node that has not received the first heartbeat message among the first node and the other neighbor nodes.

结合第三方面、第三方面的第一种至第三方面的第二种任一种可能的实现方式，在第三方面的第三种可能的实现方式中，Combining the third aspect, any of the first to second possible implementations of the third aspect, in the third possible implementation of the third aspect,

所述确定模块还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。The determination module is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

第四方面，本发明实施例提供一种集群系统中节点的故障检测系统，包括第一节点、第二节点和其他邻居节点，所述第一节点为所述第二节点的邻居节点，所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点，所述其他邻居节点的数目为一个以上，包括：In a fourth aspect, an embodiment of the present invention provides a node failure detection system in a cluster system, including a first node, a second node and other neighbor nodes, the first node is a neighbor node of the second node, and the Other neighbor nodes are nodes other than the first node among all neighbor nodes of the second node, and the number of other neighbor nodes is more than one, including:

所述第二节点，用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文；The second node is configured to send a first heartbeat message to the first node and the other neighbor nodes in parallel;

所述第一节点，用于判断在预设时间内是否接收到所述第一心跳报文；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；The first node is configured to determine whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods;

在所述第一节点未接收到所述第一心跳报文的情况下，所述第一节点还用于向每一所述其他邻居节点分别发送请求消息，所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文；以及，所述第一节点还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；In the case that the first node does not receive the first heartbeat message, the first node is further configured to send a request message to each of the other neighbor nodes, and the request message is used to inquire about each Whether the other neighbor nodes have received the first heartbeat message; and, the first node is also used to receive a response message carrying a reception status sent by each of the other neighbor nodes, and the reception status is used for Indicates whether the first heartbeat message is received;

在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述第一节点还用于确定所述第二节点发生故障。When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes , the first node is further configured to determine that the second node fails.

结合第四方面，在第四方面的第一种可能的实现方式中，所述第一节点确定所述第二节点发生故障之后，还包括：With reference to the fourth aspect, in a first possible implementation manner of the fourth aspect, after the first node determines that the second node fails, the method further includes:

所述第一节点还用于：The first node is also used for:

生成第一投票信息，并接收每一所述其他邻居节点发送的第二投票信息，所述第一投票信息包括所述第一节点选举的节点对应的节点标识，所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识；Generate first voting information, and receive second voting information sent by each of the other neighbor nodes, the first voting information includes the node identifier corresponding to the node elected by the first node, and the second voting information includes sending The node identifier corresponding to the node elected by the neighbor node of the second voting information;

以及，根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。And, according to the node identification in the first voting information and the node identification in the second voting information sent by each of the other neighboring nodes, count the number of votes obtained by each node in all nodes elected, and vote The node with the largest number is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbors of the third node The nodes include neighbor nodes of the third node itself and neighbor nodes of the second node.

结合第四方面或第四方面的第一种可能的实现方式，在第四方面的第二种可能的实现方式中，In combination with the fourth aspect or the first possible implementation of the fourth aspect, in the second possible implementation of the fourth aspect,

在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，When the first node determines that at least one of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each of the other neighbor nodes ,

所述第一节点还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的邻居节点。The first node is further configured to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message includes A neighbor node among the first node and the other neighbor nodes that has not received the first heartbeat message.

结合第四方面、第四方面的第一种至第四方面的第二种任一种可能的实现方式，在第四方面的第三种可能的实现方式中，Combining the fourth aspect and any of the second possible implementation manners from the first to the fourth aspect of the fourth aspect, in the third possible implementation manner of the fourth aspect,

所述第一节点还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。The first node is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

本发明实施例提供的集群系统中节点的故障检测方法和装置中，第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文，其中，第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；第一节点在自身未接收到第一心跳报文的情况下，询问该第二节点的其他邻居节点是否接收到第一心跳报文，并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，可以缩短故障检测的周期，从而提高了节点故障检测的效率。In the node failure detection method and device in the cluster system provided by the embodiments of the present invention, the first node judges whether it receives the first heartbeat message sent by the second node within a preset time, wherein the first node is the second node neighbor nodes, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than Or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node asks other neighbor nodes of the second node whether they have received the first heartbeat message if it does not receive the first heartbeat message, and then When it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon can shorten the cycle of fault detection, thereby improving the efficiency of node fault detection.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为现有技术中集群系统中节点的故障检测方法的结构示意图；FIG. 1 is a schematic structural diagram of a node fault detection method in a cluster system in the prior art;

图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图；FIG. 2 is a schematic flow diagram of Embodiment 1 of a node fault detection method in a cluster system provided by the present invention;

图3为集群系统中节点之间相邻关系的示意图一；Fig. 3 is a schematic diagram 1 of the adjacent relationship between nodes in the cluster system;

图4为集群系统中节点之间相邻关系的示意图二；FIG. 4 is a second schematic diagram of the adjacent relationship between nodes in the cluster system;

图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图；FIG. 5 is a schematic flow diagram of Embodiment 2 of a node fault detection method in a cluster system provided by the present invention;

图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图；FIG. 6A is a schematic diagram of the neighbor relationship between nodes before a node failure is detected in the cluster system;

图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图；6B is a schematic diagram of re-determining the adjacent relationship between nodes after node failure is detected in the cluster system;

图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图；FIG. 7 is a schematic flowchart of Embodiment 3 of a node fault detection method in a cluster system provided by the present invention;

图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图；FIG. 8 is a schematic flowchart of Embodiment 4 of a node fault detection method in a cluster system provided by the present invention;

图9为本发明集群系统中节点的故障检测装置实施例一的结构示意图；9 is a schematic structural diagram of Embodiment 1 of a node fault detection device in a cluster system of the present invention;

图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图图10为本发明节点实施例一的结构示意图；FIG. 10 is a schematic structural diagram of Embodiment 1 of a node fault detection system in a cluster system according to the present invention. FIG. 10 is a schematic structural diagram of Embodiment 1 of a node in the present invention;

图11为本发明节点实施例一的结构示意图。FIG. 11 is a schematic structural diagram of Embodiment 1 of a node according to the present invention.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

本发明实施例适用于集群系统中，其具体适用于分布式集群系统中节点的故障检测的场景。该分布式集群系统包括至少两个节点，该节点例如可以是计算机。可选的，本实施例中的集群系统中的节点与现有的集群系统的不同之处在于：本实施例的集群系统中，将所有的节点都赋予相同的功能，即所有的节点都具有相同的接收心跳报文和发送心跳报文的能力，因此，在本实施例的集群系统中，并不存在中心节点和普通节点的区分，也不需要中心节点管理普通节点。可选的，下述实施例的技术方案均以计算机作为执行主体来介绍。The embodiment of the present invention is applicable to a cluster system, and it is specifically applicable to the scene of node fault detection in a distributed cluster system. The distributed cluster system includes at least two nodes, and the nodes may be computers, for example. Optionally, the difference between the nodes in the cluster system in this embodiment and the existing cluster system is that in the cluster system in this embodiment, all nodes are given the same function, that is, all nodes have The ability to receive heartbeat messages and send heartbeat messages is the same. Therefore, in the cluster system of this embodiment, there is no distinction between central nodes and ordinary nodes, and the central node does not need to manage ordinary nodes. Optionally, the technical solutions of the following embodiments are all introduced using a computer as an execution subject.

图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例以计算机作为执行主体为例来介绍。如图2所示，本实施例的方法可以包括：FIG. 2 is a schematic flowchart of Embodiment 1 of a node failure detection method in a cluster system provided by the present invention. The method involved in the embodiment of the present invention is applicable to a distributed cluster system. This embodiment is introduced by taking a computer as an execution subject as an example. As shown in Figure 2, the method of this embodiment may include:

步骤201、第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文；第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；预设时间大于或等于一个心跳周期，且小于两个心跳周期。Step 201, the first node judges whether it has received the first heartbeat message sent by the second node within the preset time; the first node is the neighbor node of the second node, and the first heartbeat message is sent by the second node in parallel Each neighbor node of the second node sends a heartbeat message, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods.

在本实施例中，第二节点根据集群系统中所有节点的信息，按照集群系统中预设的规则确定出第一节点，其中，第一节点为第二节点的任意一个邻居节点，第二节点的邻居节点为与第二节点有关联关系的节点。图3为集群系统中节点之间相邻关系的示意图一，如图3所示，在集群系统中，节点E根据所有节点的信息，按照集群系统中预设的规则可以确定出其有四个邻居节点，分别是节点A、B、C和D。其中，第一节点可以为节点A、B、C和D中的任意一个。第一节点通过判断在预设时间内是否接收到第二节点发送的第一心跳报文，来检测第二节点是否发生故障。需要进行说明的是，第二节点是通过并行地向它的所有邻居节点发送心跳报文的，因此，第一心跳报文为第二节点并行地在同一个时刻向第二节点的每一邻居节点发送的一个心跳报文。另外，第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文，因此，第一节点可以判断在大于或等于一个心跳周期，且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如：假设心跳周期为5s，即第二节点每隔5s，将并行地向它的所有邻居节点发送一次心跳报文，对于第二节点在第5s发送的第一心跳报文，第一节点将判断在大于或等于5s，且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中，心跳周期可以根据经验或者实际情况进行设置，对于心跳周期的具体取值，本实施例在此不作限制。In this embodiment, the second node determines the first node according to the preset rules in the cluster system according to the information of all nodes in the cluster system, wherein the first node is any neighbor node of the second node, and the second node The neighbor nodes of are nodes that have an association relationship with the second node. Figure 3 is a schematic diagram of the adjacent relationship between nodes in the cluster system. As shown in Figure 3, in the cluster system, node E can determine that there are four The neighbor nodes are nodes A, B, C and D respectively. Wherein, the first node may be any one of nodes A, B, C and D. The first node detects whether the second node fails by judging whether it receives the first heartbeat message sent by the second node within a preset time. It should be noted that the second node sends heartbeat messages to all its neighbor nodes in parallel. Therefore, the first heartbeat message is the second node sending heartbeat messages to each neighbor of the second node in parallel at the same time. A heartbeat message sent by the node. In addition, the second node can send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node can judge whether to receive the first heartbeat message within a time period greater than or equal to one heartbeat period and less than two heartbeat periods. to the first heartbeat message sent by the second node. For example: Assume that the heartbeat period is 5s, that is, the second node will send a heartbeat message to all its neighbor nodes in parallel every 5s, and for the first heartbeat message sent by the second node at 5s, the first node will It is judged whether the first heartbeat message sent by the second node is received within the time greater than or equal to 5s and less than 10s. The heartbeat period may be set according to experience or actual conditions, and this embodiment does not limit the specific value of the heartbeat period.

另外，第二节点可以通过一个物理网络周期性地向第一节点发送第一心跳报文，但是由于基于单物理网络进行故障检测时，在网络发生故障，例如：管理平面网络发生故障，而业务平面网络正常时，往往无法界定是集群系统中第二节点发生了故障还是第二节点和第一节点之间的链路发生了故障，或者第二节点和第一节点同时发生了故障，由此，导致故障的检测结果不准确。为了解决这一问题，优选地，本实施例中还可以通过至少两个网络发送第一心跳报文，举例来说，可以通过双平面发送第一心跳报文，例如：管理平面和业务平面，也可以通过三平面发送第一心跳报文，例如：管理平面、业务平面和信令平面。采用多物理网络的方式发送第一心跳报文，来检测节点是否发生故障，可以提高检测的准确性。需要进行说明的是，若物理网络的数量为至少两个时，该至少两个物理网络之间相互隔离，这样可以避免由于多网络之间存在共用某些设备时，若共用设备发生故障，从而导致节点之间无法正常通信的现象，有利于提高检测的准确性。In addition, the second node may periodically send the first heartbeat message to the first node through a physical network, but when fault detection is performed based on a single physical network, when a network fault occurs, for example: the management plane network fails, and the service When the flat network is normal, it is often impossible to define whether the second node in the cluster system fails or the link between the second node and the first node fails, or the second node and the first node fail at the same time. , resulting in inaccurate fault detection results. In order to solve this problem, preferably, in this embodiment, the first heartbeat message can also be sent through at least two networks, for example, the first heartbeat message can be sent through two planes, for example: management plane and service plane, The first heartbeat message may also be sent through three planes, for example: management plane, service plane and signaling plane. The first heartbeat message is sent through a multi-physical network to detect whether a node fails, which can improve detection accuracy. It should be noted that, if the number of physical networks is at least two, the at least two physical networks are isolated from each other, so that when some equipment is shared between multiple networks, if the shared equipment fails, thereby The phenomenon that leads to the failure of normal communication between nodes is conducive to improving the accuracy of detection.

步骤202、在第一节点未接收到第二节点发送的第一心跳报文的情况下，第一节点向第二节点的所有邻居节点中除第一节点之外的其他邻居节点发送请求消息，请求消息用于询问其他邻居节点是否接收到第一心跳报文。Step 202, when the first node does not receive the first heartbeat message sent by the second node, the first node sends a request message to all neighbor nodes of the second node except the first node, The request message is used to inquire whether other neighbor nodes have received the first heartbeat message.

在现有技术中，在普通节点发送到中心节点的心跳周期固定的情况下，因为中心节点的性能的限制，集群系统无法无限增加普通节点，使得集群系统的扩展性受到影响。针对这一问题，本发明实施例中，若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文，即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文，因此，第一节点将向第二节点的邻居节点中，除自身以外的其他邻居节点发送请求消息，以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。由此可见，当第一节点未接收到第二节点发送的第一心跳报文时，第一节点可以向第二节点的其他邻居节点发送请求消息，而且第二节点的非邻居节点也将不再给第二节点发送心跳报文，由此可以减少第二节点处理心跳报文的数量，从而可以减轻第二节点的负担，使得集群系统的可扩展性较好。In the prior art, when the heartbeat period sent from common nodes to the central node is fixed, the cluster system cannot increase the common nodes infinitely due to the limitation of the performance of the central node, which affects the scalability of the cluster system. To solve this problem, in the embodiment of the present invention, if the first node does not receive the first heartbeat message sent by the second node within the preset time, it can be preliminarily determined that the second node may have failed. Since the second node sends the first heartbeat message to all its neighboring nodes in parallel, the first node will send a request message to other neighboring nodes of the second node except itself to ask other Whether the neighbor node receives the first heartbeat message sent by the second node. It can be seen that when the first node does not receive the first heartbeat message sent by the second node, the first node can send request messages to other neighbor nodes of the second node, and the non-neighbor nodes of the second node will not The heartbeat message is then sent to the second node, thereby reducing the number of heartbeat messages processed by the second node, thereby reducing the burden on the second node and making the cluster system more scalable.

举例来说，图4为集群系统中节点之间相邻关系的示意图二，如图4所示，节点E的邻居节点有X、A、D、C和G，节点E将在每个心跳周期内向它的所有邻居节点X、A、D、C和G发送心跳报文，假设将节点E作为第二节点，将节点A作为第一节点，若在某一个心跳周期内，第一节点A未接收到第二节点E发送的第一心跳报文，则第一节点A将会向其他邻居节点X、D、C和G发送请求消息，以询问节点X、D、C和G是否接收到第一心跳报文。For example, Figure 4 is a schematic diagram 2 of the adjacent relationship between nodes in the cluster system. As shown in Figure 4, the neighbor nodes of node E are X, A, D, C and G, and node E will Send heartbeat messages to all its neighbor nodes X, A, D, C and G. Assume that node E is used as the second node and node A is used as the first node. After receiving the first heartbeat message sent by the second node E, the first node A will send a request message to other neighbor nodes X, D, C and G to ask whether the nodes X, D, C and G have received the first heartbeat message. A heartbeat message.

步骤203、第一节点接收其他邻居节点发送的携带有接收状态的响应消息，该接收状态用于表示是否接收到第一心跳报文。Step 203, the first node receives a response message carrying a receiving status sent by other neighboring nodes, and the receiving status is used to indicate whether the first heartbeat message is received.

在本实施例中，其他邻居节点接收到第一节点发送的请求消息后，将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。In this embodiment, after receiving the request message sent by the first node, other neighboring nodes carry the receiving status of whether they have received the first heartbeat message in the response message and send it to the first node.

步骤204、在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态，确定出其他邻居节点均未接收到第一心跳报文的情况下，第一节点确定第二节点发生故障。Step 204, when the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the reception status carried in the received response message sent by each other neighbor node, the first node determines the second heartbeat message. A node fails.

在本实施例中，每一个其他邻居节点在接收到第一节点发送的请求消息之后，都会向第一节点返回携带有接收状态的响应消息，第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息，判断其他邻居节点是否接收到第一心跳报文，在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文的情况下，即可确定出第二节点发生了故障。In this embodiment, after each other neighbor node receives the request message sent by the first node, it will return a response message carrying the receiving status to the first node, and the first node sends Carrying the response message with receiving status, judge whether other neighbor nodes have received the first heartbeat message, and determine whether other neighbor nodes have received the first heartbeat message sent by the second node. The second node has failed.

需要进行说明的是，节点之间的相邻关系是双向的，即形成邻居关系的节点之间可以相互发送心跳报文，因此，第二节点的所有邻居节点都会单独的执行步骤201-步骤204。It should be noted that the adjacency relationship between nodes is bidirectional, that is, nodes forming a neighbor relationship can send heartbeat messages to each other, therefore, all neighbor nodes of the second node will individually perform steps 201-204 .

本发明实施例提供的集群系统中节点的故障检测方法中，第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文，其中，第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；第一节点在自身未接收到第一心跳报文的情况下，询问该第二节点的其他邻居节点是否接收到第一心跳报文，并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the first node judges whether it has received the first heartbeat message sent by the second node within a preset time, wherein the first node is a neighbor of the second node node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to One heartbeat cycle, and less than two heartbeat cycles; the first node inquires whether other neighbor nodes of the second node have received the first heartbeat message if it has not received the first heartbeat message, and determines the When no other neighbor nodes of the second node receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.

图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图。在图2所示实施例的基础上，对第一节点确定第二节点发生故障之后，各节点重新确定邻居节点的实施例，作详细说明。如图5所示，本实施例的方法可以包括：FIG. 5 is a schematic flowchart of Embodiment 2 of a node failure detection method in a cluster system provided by the present invention. On the basis of the embodiment shown in FIG. 2 , after the first node determines that the second node fails, each node re-determines an embodiment of a neighbor node, which will be described in detail. As shown in Figure 5, the method of this embodiment may include:

步骤501、第一节点生成第一投票信息，并接收每一其他邻居节点发送的第二投票信息，第一投票信息包括第一节点选举的节点对应的节点标识；第二投票信息包括发送第二投票信息的邻居节点选举的节点对应的节点标识。Step 501, the first node generates the first voting information, and receives the second voting information sent by every other neighbor node, the first voting information includes the node identification corresponding to the node elected by the first node; the second voting information includes sending the second voting information The node ID corresponding to the node elected by the neighboring nodes of the voting information.

在本实施例中，当第二节点的邻居节点确定出第二节点发生故障之后，所有的邻居节点均需要重新计算各自的邻居节点。为便于说明，可以将第二节点的任意一个邻居节点作为第一节点，第一节点需要生成第一投票信息，该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外，第一节点还要接收每一其他邻居节点发送的第二投票信息，第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中，投票依据与多种因素有关，例如：负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等，如：第一节点可以通过判断哪一个节点所承担的负载最小，并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的，其他邻居节点也可以用类似的方式，将第二投票信息发送给第一节点。In this embodiment, after the neighbor nodes of the second node determine that the second node fails, all the neighbor nodes need to recalculate their respective neighbor nodes. For ease of description, any neighbor node of the second node can be used as the first node, and the first node needs to generate first voting information, which includes the node ID corresponding to the node elected by the first node and the voting basis. In addition, the first node also needs to receive the second voting information sent by every other neighboring node, and the second voting information includes the node identification and voting basis corresponding to the node elected by the neighboring node that sent the second voting information. In practical applications, the voting basis is related to many factors, such as: load, node number size, node cache newness, node network bandwidth, etc. For example, the first node can judge which node bears the least load, And carry the node identifier corresponding to the node with the smallest load in the first voting information and send it to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.

步骤502、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中的每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点；第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。Step 502, the first node counts the number of votes obtained by each node among all nodes elected according to the node identifier in the first voting information and the node identifier in the second voting information sent by each other neighbor node, and sends The node with the largest number of votes is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbor nodes of the third node include the third node's own neighbor nodes and neighbor nodes of the second node.

在本实施例中，第一节点在接收到每一其他邻居节点发送的第二投票信息后，根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识，可以确定出第三节点。在具体的实现过程中，可以根据第一投票信息和第二投票信息中携带的节点标识，通过投票选举的方式，统计被选举的所有节点中每一节点获得的投票数量，并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点，也即接管第二节点与其他节点之间的关联关系，因此，第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点，其中，第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外，还包括第二节点的邻居节点。In this embodiment, after the first node receives the second voting information sent by every other neighboring node, according to the node identification in the first voting information generated by itself and the node identification in the received second voting information, A third node can be determined. In the specific implementation process, according to the node identification carried in the first voting information and the second voting information, through voting, the number of votes obtained by each node among all elected nodes can be counted, and the number of votes obtained will be The node with the most number acts as the third node. The third node is used to take over the neighbor nodes of the failed second node, that is, to take over the association relationship between the second node and other nodes. Therefore, the third node will replace the second node and provide All neighbor nodes are nodes sending heartbeat messages, wherein all neighbor nodes of the third node include neighbor nodes of the second node in addition to neighbor nodes of the third node itself.

步骤503、第一节点根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点，重新确定第一节点的邻居节点。Step 503, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in other neighbor nodes.

在本实施例中，第二节点的所有邻居节点通过投票选举的方式确定出第三节点之后，若第一节点为第三节点，则第一节点将接管第二节点的相邻关系，其他邻居节点可以根据第一节点接管第二节点的邻居节点后的相邻关系，重新通过计算确定各自的邻居节点；若第一节点不是第三节点，则第一节点将待第三节点重新确定出相邻关系之后，根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点，重新确定自身的邻居节点。In this embodiment, after all neighbor nodes of the second node determine the third node through voting, if the first node is the third node, the first node will take over the adjacency relationship of the second node, and other neighbors Nodes can re-determine their neighbor nodes through calculation according to the neighbor relationship after the first node takes over the neighbor nodes of the second node; if the first node is not the third node, the first node will wait for the third node to re-determine the relative After adjacency, according to the neighbor nodes of the third node and the nodes other than the third node in other neighbor nodes, re-determine the neighbor nodes of itself.

举例来说，图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图，图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图。如图6A所示，假设节点E为第二节点，节点A为第一节点，当第一节点A确定第二节点E发生故障之后，第一节点A将生成第一投票信息，并分别接收节点X、D、C和G发送的第二投票信息，第一节点A根据第一投票信息中的节点标识和第二投票信息中的节点标识确定出第三节点，以使第三节点替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文。如图6B所示，若通过投票选举，确定出第一节点A为第三节点，则由第一节点A来替代第二节点、且并行地向第一节点A的所有邻居节点发送心跳报文，此时，第一节点A需要通过其他邻居节点X、D、C和G重新确定自己的邻居节点，而节点X、D、C和G在等第一节点A确定好自己的邻居节点之后，根据第一节点A确定出的邻居节点重新确定各自的邻居节点。For example, FIG. 6A is a schematic diagram of the neighbor relationship between nodes before a node failure is detected in the cluster system, and FIG. 6B is a schematic diagram of re-determining the neighbor relationship between nodes after a node failure is detected in the cluster system. As shown in Figure 6A, assuming that node E is the second node and node A is the first node, when the first node A determines that the second node E has failed, the first node A will generate the first voting information and receive node For the second voting information sent by X, D, C and G, the first node A determines the third node according to the node ID in the first voting information and the node ID in the second voting information, so that the third node can replace the second node, and send heartbeat messages to all neighbor nodes of the third node in parallel. As shown in Figure 6B, if the first node A is determined to be the third node through voting, the second node will be replaced by the first node A, and heartbeat messages will be sent to all neighbor nodes of the first node A in parallel , at this time, the first node A needs to re-determine its neighbor nodes through other neighbor nodes X, D, C and G, and after nodes X, D, C and G wait for the first node A to determine their neighbor nodes, According to the neighbor nodes determined by the first node A, the respective neighbor nodes are re-determined.

本发明实施例提供的集群系统中节点的故障检测方法，第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文，其中，第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；第一节点在自身未接收到第一心跳报文的情况下，询问该第二节点的其他邻居节点是否接收到第一心跳报文，并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。另外，通过在确定第一节点发生故障之后，重新确定各自的邻居节点，进而继续进行故障检测，提高了故障检测的准确性。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the first node judges whether it has received the first heartbeat message sent by the second node within a preset time, wherein the first node is a neighbor node of the second node , the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires whether other neighbor nodes of the second node have received the first heartbeat message if it does not receive the first heartbeat message, and determines the first heartbeat message When the other neighbor nodes of the second node do not receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection. In addition, after it is determined that the first node fails, the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.

可选地，在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态，确定出至少一个其他邻居节点接收到第一心跳报文的情况下，第一节点确定所述未接收到第一心跳报文的节点与第二节点之间的链路发生故障。Optionally, when the first node determines that at least one other neighbor node has received the first heartbeat message according to the reception status carried in the received response message sent by each other neighbor node, the first node determines that the first heartbeat message is received by the first node. The link between the node that has not received the first heartbeat message and the second node fails.

具体地，第一节点在未接收到第二节点发送的第一心跳报文，并向每一其他节点发送请求消息，以询问每一其他邻居节点是否接收到第一心跳报文之后，若根据每一其他节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文，则第一节点可以确定出第二节点是正常的，而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障，其中，未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。Specifically, after the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to inquire whether each other neighbor node has received the first heartbeat message, if according to The response message sent by each other node determines that at least one other neighbor node has received the first heartbeat message, then the first node can determine that the second node is normal, and it may be the second node and the first node, and The link between the node that has not received the first heartbeat message and the first node has failed, wherein the nodes that have not received the first heartbeat message include the first node and other neighbor nodes that have not received the first heartbeat message Neighbor nodes of the message.

本发明实施例提供的集群系统中节点的故障检测方法，由于第一节点在确定出至少一个其他邻居节点接收到第一心跳报文的情况下，第一节点确定未接收到第一心跳报文的节点与第二节点之间的链路发生故障，使得故障检测更加全面。In the node failure detection method in the cluster system provided by the embodiment of the present invention, when the first node determines that at least one other neighbor node has received the first heartbeat message, the first node determines that the first heartbeat message has not been received The link between the node and the second node fails, so that the fault detection is more comprehensive.

图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例中仍然以计算机作为执行主体为例进行介绍。如图7所示，本实施例的方法可以包括：FIG. 7 is a schematic flowchart of Embodiment 3 of a node failure detection method in a cluster system provided by the present invention. The method involved in the embodiment of the present invention is applicable to a distributed cluster system. In this embodiment, a computer is still used as an execution subject for introduction. As shown in Figure 7, the method of this embodiment may include:

步骤701、第二节点并行地向第一节点和其他邻居节点发送第一心跳报文，第一节点为第二节点的邻居节点；其他邻居节点为第二节点的所有邻居节点中除第一节点之外的节点，其他邻居节点的数目为一个以上。Step 701, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, the first node is the neighbor node of the second node; the other neighbor nodes are all the neighbor nodes of the second node except the first node For nodes other than , the number of other neighbor nodes is more than one.

在本实施例中，第二节点可以根据集群系统中所包含的节点的信息，根据集群系统中预设的规则确定出自身所有的邻居节点，其中，第一节点为第二节点的任意一个邻居节点，第二节点的邻居节点为与该第二节点有关联关系的节点。第二节点在确定出所有的邻居节点之后，会并行地向第一节点和其他邻居节点发送第一心跳报文。In this embodiment, the second node can determine all its neighbor nodes according to the information of the nodes contained in the cluster system and the preset rules in the cluster system, wherein the first node is any neighbor of the second node node, and the neighbor nodes of the second node are nodes associated with the second node. After the second node determines all the neighbor nodes, it will send the first heartbeat message to the first node and other neighbor nodes in parallel.

步骤702、第一节点判断在预设时间内是否接收到第一心跳报文；预设时间大于或等于一个心跳周期，且小于两个心跳周期。Step 702, the first node judges whether a first heartbeat packet is received within a preset time; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods.

在本实施例中，第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文，因此，第一节点可以判断在大于或等于一个心跳周期，且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如：假设心跳周期为5s，即第二节点每隔5s，将并行地向它的邻居节点发送一次心跳报文，对于第二节点在第5s发送的第一心跳报文，第一节点将判断在大于等于5s，且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中，心跳周期可以根据经验或者实际情况进行设置，对于心跳周期的具体取值，本实施例在此不作限制。In this embodiment, the second node can send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node can determine whether the heartbeat period is greater than or equal to one heartbeat period and less than two heartbeat periods. Whether the first heartbeat message sent by the second node is received within the time. For example: Assume that the heartbeat period is 5s, that is, the second node will send a heartbeat message to its neighbor nodes in parallel every 5s, and for the first heartbeat message sent by the second node at 5s, the first node will judge Whether the first heartbeat message sent by the second node is received within 5s or more and less than 10s. The heartbeat period may be set according to experience or actual conditions, and this embodiment does not limit the specific value of the heartbeat period.

步骤703、在第一节点未接收到第一心跳报文的情况下，第一节点向每一其他邻居节点分别发送请求消息，请求消息用于询问每一其他邻居节点是否接收到所述第一心跳报文。Step 703. In the case that the first node does not receive the first heartbeat message, the first node sends a request message to each other neighbor node respectively, and the request message is used to inquire whether each other neighbor node has received the first heartbeat message. Heartbeat message.

在本实施例中，若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文，即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文，因此，第一节点将向第二节点的邻居节点中，除自身以外的其他邻居节点发送请求消息，以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。In this embodiment, if the first node does not receive the first heartbeat message sent by the second node within the preset time, it can be preliminarily determined that the second node may be faulty. Since the second node sends the first heartbeat message to all its neighboring nodes in parallel, the first node will send a request message to other neighboring nodes of the second node except itself to ask other Whether the neighbor node receives the first heartbeat message sent by the second node.

步骤704、第一节点接收每一其他邻居节点发送的携带有接收状态的响应消息，接收状态用于表示是否接收到第一心跳报文。Step 704, the first node receives a response message carrying a receiving status sent by each other neighbor node, and the receiving status is used to indicate whether the first heartbeat message is received.

在本实施例中，每一其他邻居节点接收到第一节点发送的请求消息后，将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。In this embodiment, after each other neighbor node receives the request message sent by the first node, it carries the receiving status of whether it has received the first heartbeat message in the response message and sends it to the first node.

步骤705、在第一节点根据接收到的响应消息中携带的接收状态，确定出其他邻居节点均未接收到第一心跳报文的情况下，第一节点确定第二节点发生故障。Step 705. When the first node determines that none of the other neighboring nodes has received the first heartbeat message according to the reception status carried in the received response message, the first node determines that the second node is faulty.

在本实施例中，每一个其他邻居节点在接收到第一节点发送的请求消息之后，都会向第一节点返回携带有接收状态的响应消息，第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息，判断其他邻居节点是否接收到第一心跳报文，在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文时，即可确定出第二节点发生了故障。In this embodiment, after each other neighbor node receives the request message sent by the first node, it will return a response message carrying the receiving status to the first node, and the first node sends The response message carrying the receiving status, to determine whether other neighbor nodes have received the first heartbeat message, and when it is judged that other neighbor nodes have not received the first heartbeat message sent by the second node, the second heartbeat message can be determined. A node has failed.

本发明实施例提供的集群系统中节点的故障检测方法中，第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文，第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文，其中，第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；第一节点在自身未接收到第一心跳报文的情况下，询问该第二节点的其他邻居节点是否接收到第一心跳报文，并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, and the first node judges whether the first heartbeat message is received within the preset time. The first heartbeat message sent by the two nodes, wherein the first node is a neighbor node of the second node, and the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, The number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires if it does not receive the first heartbeat message. Whether other neighbor nodes of the second node have received the first heartbeat message, and when it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed . Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.

图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图。在图7所示实施例的基础上，对第一节点确定第二节点发生故障之后，各节点重新确定邻居节点的实施例，作详细说明。如图8所示，本实施例的方法可以包括：FIG. 8 is a schematic flowchart of Embodiment 4 of a node failure detection method in a cluster system provided by the present invention. On the basis of the embodiment shown in FIG. 7 , after the first node determines that the second node has failed, each node re-determines an embodiment of a neighbor node, which will be described in detail. As shown in Figure 8, the method of this embodiment may include:

步骤801、第一节点生成第一投票信息，并接收每一其他邻居节点发送的第二投票信息，第一投票信息包括第一节点选举的节点对应的节点标识；第二投票信息包括发送第二投票信息的邻居节点选举的节点对应的节点标识。Step 801, the first node generates the first voting information, and receives the second voting information sent by every other neighbor node, the first voting information includes the node identification corresponding to the node elected by the first node; the second voting information includes sending the second The node ID corresponding to the node elected by the neighboring nodes of the voting information.

在本实施例中，当第二节点的邻居节点确定出第二节点发生故障之后，所有的邻居节点均需要重新计算各自的邻居节点。为便于说明，可以将第二节点的任意一个邻居节点作为第一节点，第一节点需要生成第一投票信息，该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外，第一节点还要接收每一其他邻居节点发送的第二投票信息，该第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中，投票依据与多种因素有关，例如：负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等，如：第一节点可以通过判断哪一个节点所承担的负载最小，并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的，其他邻居节点也可以用类似的方式，将第二投票信息发送给第一节点。In this embodiment, after the neighbor nodes of the second node determine that the second node fails, all the neighbor nodes need to recalculate their respective neighbor nodes. For ease of description, any neighbor node of the second node can be used as the first node, and the first node needs to generate first voting information, which includes the node ID corresponding to the node elected by the first node and the voting basis. In addition, the first node also needs to receive the second voting information sent by every other neighboring node, and the second voting information includes the node identification and voting basis corresponding to the node elected by the neighboring node that sent the second voting information. In practical applications, the voting basis is related to many factors, such as: load, node number size, node cache newness, node network bandwidth, etc. For example, the first node can judge which node bears the least load, And carry the node identifier corresponding to the node with the smallest load in the first voting information and send it to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.

步骤802、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点；第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。Step 802: According to the node ID in the first voting information and the node ID in the second voting information sent by every other neighbor node, the first node counts the number of votes obtained by each node among all nodes elected, and votes The node with the largest number is the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbor nodes of the third node include the neighbors of the third node node and the neighbor nodes of the second node.

在本实施例中，第一节点在接收到每个其他邻居节点发送的第二投票信息后，根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识，可以确定出第三节点。在具体的实现过程中，可以根据第一投票信息和第二投票信息中携带的节点标识，通过投票选举的方式，统计被选举的所有节点中每一节点获得的投票数量，并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点，也即接管第二节点与其他节点之间的关联关系，因此，第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文，其中，第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外，还包括第二节点的邻居节点。In this embodiment, after the first node receives the second voting information sent by every other neighboring node, according to the node ID in the first voting information generated by itself and the node ID in the received second voting information, A third node can be determined. In the specific implementation process, according to the node identification carried in the first voting information and the second voting information, through voting, the number of votes obtained by each node among all elected nodes can be counted, and the number of votes obtained will be The node with the most number acts as the third node. The third node is used to take over the neighbor nodes of the failed second node, that is, to take over the association relationship between the second node and other nodes. Therefore, the third node will replace the second node and provide All neighbor nodes send a heartbeat message, wherein all neighbor nodes of the third node include neighbor nodes of the second node in addition to neighbor nodes of the third node itself.

步骤803、第一节点根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点，重新确定第一节点的邻居节点。Step 803, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in other neighbor nodes.

本发明实施例提供的集群系统中节点的故障检测方法，第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文，第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文，其中，第一节点为第二节点的邻居节点，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；第一节点在自身未接收到第一心跳报文的情况下，询问该第二节点的其他邻居节点是否接收到第一心跳报文，并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。另外，通过在确定第一节点发生故障之后，重新确定各自的邻居节点，进而继续进行故障检测，提高了故障检测的准确性。In the node fault detection method in the cluster system provided by the embodiment of the present invention, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, and the first node judges whether the second heartbeat message is received within the preset time. The first heartbeat message sent by the node, wherein the first node is a neighbor node of the second node, the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the first The number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat cycle, and less than two heartbeat cycles; the first node inquires the first heartbeat message when it does not receive the first heartbeat message. Whether other neighbor nodes of the second node have received the first heartbeat message, and when it is determined that other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection. In addition, after it is determined that the first node fails, the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.

具体地，第一节点在未接收到第二节点发送的第一心跳报文，并向每一其他节点发送请求消息，以询问每一其他邻居节点是否接收到第一心跳报文之后，若根据每一其他邻居节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文，则第一节点可以确定出第二节点是正常的，而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障，其中，未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。Specifically, after the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to inquire whether each other neighbor node has received the first heartbeat message, if according to The response message sent by each other neighbor node determines that at least one other neighbor node has received the first heartbeat message, then the first node can determine that the second node is normal, and it may be that the second node and the first node, And the link between the node that has not received the first heartbeat message and the first node has failed, wherein the node that has not received the first heartbeat message includes the first node and other neighbor nodes that have not received the first heartbeat message. The neighbor node of the heartbeat message.

可选地，所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。Optionally, the first node re-determines the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

图9为本发明集群系统中节点的故障检测装置实施例一的结构示意图，如图9所示，本发明实施例提供的集群系统中节点的故障检测装置10包括判断模块11、发送模块12、接收模块13、确定模块14和生成模块15。FIG. 9 is a schematic structural diagram of Embodiment 1 of a node fault detection device in a cluster system according to the present invention. As shown in FIG. A receiving module 13 , a determining module 14 and a generating module 15 .

其中，判断模块11用于判断在预设时间内接收模块13是否接收到第二节点发送的第一心跳报文；所述第一节点为所述第二节点的邻居节点，所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文，所述第二节点的所有邻居节点的数目为两个以上；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；在所述判断模块11判断出所述接收模块13未接收到所述第二节点发送的第一心跳报文的情况下，发送模块12用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息；所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文；所述接收模块13还用于接收所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；确定模块14用于根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定是否所述其他邻居节点均未接收到所述第一心跳报文；在所述确定模块14确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述确定模块14还用于确定所述第二节点发生故障。在所述第一节点根据接收到的所述响应消息中携带的接收状态，确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，确定模块14用于确定所述第二节点发生故障。Wherein, the judging module 11 is used to judge whether the receiving module 13 has received the first heartbeat message sent by the second node within a preset time; the first node is a neighbor node of the second node, and the first heartbeat message is The message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; the preset time is greater than Or equal to one heartbeat period, and less than two heartbeat periods; when the judging module 11 judges that the receiving module 13 has not received the first heartbeat message sent by the second node, the sending module 12 is used to Sending a request message to other neighbor nodes except the first node among all neighbor nodes of the second node; the request message is used to inquire whether the other neighbor nodes have received the first heartbeat message; The receiving module 13 is also configured to receive a response message carrying a receiving status sent by the other neighbor nodes, and the receiving status is used to indicate whether the first heartbeat message is received; the determining module 14 is used to determine according to the The receiving module 13 receives the reception status carried in the response message sent by each of the other neighbor nodes, and determines whether the other neighbor nodes have not received the first heartbeat message; in the determination module 14 When it is determined that none of the other neighbor nodes has received the first heartbeat message, the determination module 14 is further configured to determine that the second node fails. When the first node determines that none of the other neighbor nodes has received the first heartbeat message according to the receiving status carried in the received response message, the determining module 14 is configured to determine the The second node fails.

本发明实施例提供的集群系统中节点的故障检测装置，判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；接收模块在未接收到第一心跳报文的情况下，发送模块向该第二节点的其他邻居节点发送请求消息，以询问其他邻居节点是否接收到第一心跳报文，并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。In the node fault detection device in the cluster system provided by the embodiment of the present invention, the judging module judges whether the receiving module receives the first heartbeat message sent by the second node within a preset time, and the first heartbeat message is sent by the second node in parallel A heartbeat message sent to each neighbor node of the second node, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; the receiving module In the case that the first heartbeat message is not received, the sending module sends a request message to other neighbor nodes of the second node to inquire whether other neighbor nodes have received the first heartbeat message, and when the determination module determines that the first heartbeat message When the other neighbor nodes of the second node do not receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.

可选地，生成模块15还用于生成第一投票信息，所述第一投票信息包括所述第一节点选举的节点对应的节点标识；Optionally, the generating module 15 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;

所述接收模块13还用于接收每一所述其他邻居节点发送的第二投票信息，所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识；The receiving module 13 is also configured to receive the second voting information sent by each of the other neighboring nodes, the second voting information including the node identifier corresponding to the node elected by the neighboring node that sent the second voting information;

所述确定模块14还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The determination module 14 is further configured to count the number of votes obtained by each node among all the nodes elected according to the node ID in the first voting information and the node ID in the second voting information sent by each of the other neighboring nodes. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.

可选地，在所述确定模块14根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，Optionally, the determining module 14 determines that at least one of the other neighboring nodes has received the In the case of the first heartbeat message mentioned above,

所述确定模块14还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The determining module 14 is also used to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message includes A node among the first node and the other neighbor nodes that has not received the first heartbeat message.

可选地，所述确定模块14还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。Optionally, the determination module 14 is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

本实施例的集群系统中节点的故障检测装置，可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案，其实现原理和技术效果类似，此处不再赘述。The node fault detection device in the cluster system of this embodiment can be used to implement the technical solution of the node fault detection method in the cluster system provided by any embodiment of the present invention, its implementation principle and technical effect are similar, and will not be repeated here .

图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图，如图10所示，本发明实施例提供的集群系统中节点的故障检测系统20包括第一节点21、第二节点22和其他邻居节点23，所述第一节点21为所述第二节点22的邻居节点，所述其他邻居节点23为所述第二节点22的所有邻居节点中除所述第一节点21之外的节点，所述其他邻居节点23的数目为一个以上。Fig. 10 is a schematic structural diagram of Embodiment 1 of the node fault detection system in the cluster system of the present invention. As shown in Fig. 10, the node fault detection system 20 in the cluster system provided by the embodiment of the present invention includes a first node 21, a second node 22 and other neighbor nodes 23, the first node 21 is a neighbor node of the second node 22, and the other neighbor nodes 23 are all neighbor nodes of the second node 22 except the first node 21 The number of other neighbor nodes 23 is more than one.

其中，所述第二节点22用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文；所述第一节点21用于判断在预设时间内是否接收到所述第一心跳报文；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；在所述第一节点未接收到所述第一心跳报文的情况下，所述第一节点21还用于向每一所述其他邻居节点分别发送请求消息，所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文；所述第一节点21还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述第一节点21还用于确定所述第二节点发生故障。Wherein, the second node 22 is used to send the first heartbeat message to the first node and the other neighbor nodes in parallel; the first node 21 is used to judge whether the The first heartbeat message; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; when the first node does not receive the first heartbeat message, the first node 21 is further configured to send a request message to each of the other neighboring nodes, and the request message is used to inquire whether each of the other neighboring nodes has received the first heartbeat message; the first node 21 also uses Receiving a response message carrying a receiving status sent by each of the other neighbor nodes, the receiving status is used to indicate whether the first heartbeat message is received; at the first node according to each received The reception status carried in the response message sent by the other neighbor nodes, and when it is determined that none of the other neighbor nodes have received the first heartbeat message, the first node 21 is also used to determine the The second node fails.

本发明实施例提供的集群系统中节点的故障检测系统中，判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文，第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文，第二节点的所有邻居节点的数目为两个以上；该预设时间大于或等于一个心跳周期，且小于两个心跳周期；接收模块在未接收到第一心跳报文的情况下，发送模块向该第二节点的其他邻居节点发送请求消息，以询问其他邻居节点是否接收到第一心跳报文，并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下，确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期，且小于两个心跳周期，所以采用本发明提供的技术方案进行故障检测时，避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象，缩短了故障检测的周期，从而提高了节点故障检测的效率。In the node fault detection system in the cluster system provided by the embodiment of the present invention, the judging module judges whether the receiving module receives the first heartbeat message sent by the second node within the preset time, and the first heartbeat message is the second node parallel A heartbeat message sent to each neighbor node of the second node, the number of all neighbor nodes of the second node is more than two; the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; receiving When the module does not receive the first heartbeat message, the sending module sends a request message to other neighbor nodes of the second node to inquire whether other neighbor nodes have received the first heartbeat message, and when the determination module determines that the When no other neighbor nodes of the second node receive the first heartbeat message, it is determined that the second node is faulty. Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat cycles, when using the technical solution provided by the present invention for fault detection, it avoids the need for multiple heartbeat cycles in the prior art to detect whether a node is faulty The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.

在上述实施例中，所述第一节点21确定所述第二节点发生故障之后，还包括：所述第一节点21还用于：In the above embodiment, after the first node 21 determines that the second node fails, the first node 21 further includes: the first node 21 is further configured to:

以及，根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点，所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。And, according to the node identification in the first voting information and the node identification in the second voting information sent by each of the other neighboring nodes, count the number of votes obtained by each node in all nodes elected, and vote The node with the largest number serves as a third node, and the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbors of the third node The nodes include neighbor nodes of the third node itself and neighbor nodes of the second node.

在上述实施例中，在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，In the above embodiment, the first node determines that at least one of the other neighbor nodes has received the first In the case of a heartbeat message,

所述第一节点21还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。The first node 21 is also used to determine that the link between the node that has not received the first heartbeat message and the second node has failed; the node that has not received the first heartbeat message A node that has not received the first heartbeat message among the first node and the other neighbor nodes is included.

在上述实施例中，所述第一节点21还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。In the above embodiment, the first node 21 is further configured to re-determine the first node's neighbor nodes.

上述系统实施例对应地可用于执行方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘述。The above-mentioned system embodiments can be correspondingly used to implement the technical solutions of the method embodiments, and their implementation principles and technical effects are similar, and will not be repeated here.

图11为本发明节点实施例一的结构示意图，如图11所示，本实施例的节点600包括处理器601、用户接口603、网络接口604和存储器605、发送器606和接收器607，存储器605可以包括操作系统6051、应用程序6052等。处理器601可以是中央处理器(Central Processing Unit，CPU)。存储器605用于存储可执行指令。处理器601可以执行存储器605中存储的可执行指令。其中，接收器607用于接收第二节点发送的第一心跳报文；所述处理器601用于判断在预设时间内所述接收器607是否接收到第二节点发送的第一心跳报文；所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文，所述第二节点的所有邻居节点的数目为两个以上；所述预设时间大于或等于一个心跳周期，且小于两个心跳周期；在所述处理器601判断出所述接收器607未接收到所述第二节点发送的第一心跳报文的情况下，发送器606用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息，所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文，所述第一节点为所述第二节点的邻居节点；所述接收器607还用于接收所述其他邻居节点发送的携带有接收状态的响应消息，所述接收状态用于表示是否接收到所述第一心跳报文；所述处理器601用于根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定是否所述其他邻居节点均未接收到所述第一心跳报文；在所述处理器601确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下，所述处理器601还用于确定所述第二节点发生故障。Fig. 11 is a schematic structural diagram of the node embodiment 1 of the present invention. As shown in Fig. 11, the node 600 of this embodiment includes a processor 601, a user interface 603, a network interface 604, a memory 605, a transmitter 606, a receiver 607, and a memory 605 may include an operating system 6051, application programs 6052, and the like. The processor 601 may be a central processing unit (Central Processing Unit, CPU). The memory 605 is used to store executable instructions. Processor 601 may execute executable instructions stored in memory 605 . Wherein, the receiver 607 is used to receive the first heartbeat message sent by the second node; the processor 601 is used to judge whether the receiver 607 has received the first heartbeat message sent by the second node within a preset time ; The first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is more than two; The preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; when the processor 601 determines that the receiver 607 has not received the first heartbeat message sent by the second node , the sender 606 is used to send a request message to other neighbor nodes except the first node among all neighbor nodes of the second node, and the request message is used to inquire whether the other neighbor nodes have received the In the first heartbeat message, the first node is a neighbor node of the second node; the receiver 607 is also configured to receive a response message carrying a reception status sent by the other neighbor nodes, and the reception status uses Indicates whether the first heartbeat message is received; the processor 601 is configured to determine whether to None of the other neighbor nodes has received the first heartbeat message; when the processor 601 determines that none of the other neighbor nodes has received the first heartbeat message, the processor 601 It is also used to determine that the second node fails.

本实施例提供的节点，可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案，其实现原理和技术效果类似，此处不再赘述。The node provided in this embodiment can be used to execute the technical solution of the node fault detection method in the cluster system provided by any embodiment of the present invention, and its implementation principle and technical effect are similar, and will not be repeated here.

可选地，所述处理器601还用于生成第一投票信息，所述第一投票信息包括所述第一节点选举的节点对应的节点标识；Optionally, the processor 601 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to a node elected by the first node;

所述接收器607还用于接收每一所述其他邻居节点发送的第二投票信息，所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识；The receiver 607 is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node elected by the neighboring node that sent the second voting information;

所述处理器601还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识，统计被选举的所有节点中每一节点获得的投票数量，并将投票数量最多的节点作为第三节点；所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点；所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。The processor 601 is further configured to, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes, count the votes obtained by each node among all the nodes elected. The number of votes, and the node with the largest number of votes as the third node; the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; the All the neighbor nodes of the third node include the neighbor nodes of the third node itself and the neighbor nodes of the second node.

可选地，在所述处理器601根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态，确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下，所述处理器601还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障；所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。Optionally, the processor 601 determines that at least one of the other neighbor nodes has received the In the case of the first heartbeat message, the processor 601 is further configured to determine that the link between the node that has not received the first heartbeat message and the second node is faulty; The nodes receiving the first heartbeat message include nodes that have not received the first heartbeat message among the first node and the other neighbor nodes.

可选地，所述处理器601还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点，重新确定所述第一节点的邻居节点。Optionally, the processor 601 is further configured to re-determine the neighbor nodes of the first node according to the neighbor nodes of the third node and nodes other than the third node in the other neighbor nodes.

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims

1. the fault detection method of a group system interior joint, it is characterised in that including:

Primary nodal point judges whether receive the first heartbeat message that secondary nodal point sends in Preset Time； Described primary nodal point is the neighbor node of described secondary nodal point, and described first heartbeat message is described second section The heartbeat message that point sends to each neighbor node of described secondary nodal point concurrently, described second section The number of all neighbor nodes of point is two or more；Described Preset Time is more than or equal to a heart beating week Phase, and less than two heart beat cycles；

In the case of described primary nodal point does not receives the first heartbeat message that described secondary nodal point sends, Described primary nodal point other in addition to described primary nodal point in all neighbor nodes of described secondary nodal point Neighbor node sends request message, and described request message is used for inquiring whether other neighbor nodes described receive To described first heartbeat message；

Described primary nodal point receives the response carrying reception state of described other neighbor nodes transmission and disappears Breath, described reception state is used for indicating whether to receive described first heartbeat message；

The described response sent according to other neighbor nodes each described received at described primary nodal point disappears The reception state carried in breath, determines that other neighbor nodes described all do not receive described first heart beating report In the case of literary composition, described primary nodal point determines that described secondary nodal point breaks down.

Method the most according to claim 1, it is characterised in that described primary nodal point determines described After two nodes break down, also include:

Described primary nodal point generates the first vote information, and receives what each other neighbor nodes described sent Second vote information, described first vote information includes the node that node that described primary nodal point is elected is corresponding Mark；Described second vote information includes the node sending the neighbor node election of described second vote information Corresponding node identification；

Described primary nodal point is according to the node identification in described first vote information and each other neighbours described Node identification in the second vote information that node sends, each node in all nodes that statistics is elected The ballot quantity obtained, and using nodes most for quantity of voting as the 3rd node；Described 3rd node is Substitute described secondary nodal point and send heartbeat message to all neighbor nodes of described 3rd node concurrently Node；All neighbor nodes of described 3rd node include described 3rd node self neighbor node and The neighbor node of described secondary nodal point.

Method the most according to claim 1 and 2, it is characterised in that also include:

The described response sent according to other neighbor nodes each described received at described primary nodal point disappears The reception state carried in breath, determines that at least one other neighbor node described receives described first heart In the case of jumping message, described primary nodal point determines node and the institute not receiving described first heartbeat message The link stated between secondary nodal point breaks down；The described node bag not receiving described first heartbeat message Include the node not receiving described first heartbeat message in described primary nodal point and other neighbor nodes described.

4. the fault detection method of a group system interior joint, it is characterised in that described method includes:

Secondary nodal point sends the first heartbeat message to primary nodal point and other neighbor nodes concurrently；Described One node is the neighbor node of described secondary nodal point, and other neighbor nodes described are the institute of described secondary nodal point Having the node in addition to described primary nodal point in neighbor node, the number of other neighbor nodes described is one Above；

Described primary nodal point judges whether receive described first heartbeat message in Preset Time；Described pre- If the time is more than or equal to a heart beat cycle, and less than two heart beat cycles；

In the case of described primary nodal point does not receives described first heartbeat message, described primary nodal point to Each other neighbor nodes described send request message respectively, and described request message is used for inquiring each described Whether other neighbor nodes receive described first heartbeat message；

Described primary nodal point receives the response carrying reception state that each other neighbor nodes described send Message, described reception state is used for indicating whether to receive described first heartbeat message；

At described primary nodal point according to the reception state carried in the described response message received, determine In the case of other neighbor nodes described all do not receive described first heartbeat message, described primary nodal point is true Fixed described secondary nodal point breaks down.

Method the most according to claim 4, it is characterised in that described primary nodal point determines described After two nodes break down, also include:

6. according to the method described in claim 4 or 5, it is characterised in that also include:

The described response sent according to other neighbor nodes each described received at described primary nodal point disappears The reception state carried in breath, determines that at least one other neighbor node described receives described first heart Jump in the case of message, the most described primary nodal point determine the node not receiving described first heartbeat message with Link between described secondary nodal point breaks down；The described node not receiving described first heartbeat message Including the node not receiving the first heartbeat message in described primary nodal point and other neighbor nodes described.

7. the failure detector of a group system interior joint, it is characterised in that including:

Judge module, for judging whether receiver module receives what secondary nodal point sent in Preset Time First heartbeat message；Described primary nodal point is the neighbor node of described secondary nodal point, described first heart beating report Literary composition is the heart beating report that described secondary nodal point sends to each neighbor node of described secondary nodal point concurrently Literary composition, the number of all neighbor nodes of described secondary nodal point is two or more；Described Preset Time more than or Equal to a heart beat cycle, and less than two heart beat cycles；

Judge that described receiver module does not receives first that described secondary nodal point sends at described judge module In the case of heartbeat message,

Sending module, in all neighbor nodes of described secondary nodal point in addition to described primary nodal point Other neighbor nodes send request message, described request message is used for inquiring that other neighbor nodes described are No receive described first heartbeat message；

Described receiver module, is additionally operable to the reception state that carries of other neighbor nodes transmission described in receiving Response message, described reception state is used for indicating whether to receive described first heartbeat message；

Determine module, send for other neighbor nodes each described received according to described receiver module Described response message in the reception state carried, it is determined whether other neighbor nodes described all do not receive Described first heartbeat message；

Determine that module determines that other neighbor nodes described all do not receive described first heartbeat message described In the case of, described determine module, be additionally operable to determine that described secondary nodal point breaks down.

Device the most according to claim 7, it is characterised in that to determine that module determines described described After secondary nodal point breaks down, also include:

Generation module, is additionally operable to generate the first vote information, and described first vote information includes described first The node identification corresponding to node of node election；

Described receiver module, is additionally operable to receive the second vote information that each other neighbor nodes described send, Described second vote information includes that the node sending the neighbor node election of described second vote information is corresponding Node identification；

Described determine module, be additionally operable to according to the node identification in described first vote information and each described Node identification in the second vote information that other neighbor nodes send, in all nodes that statistics is elected The ballot quantity that each node obtains, and using nodes most for quantity of voting as the 3rd node；Described Three nodes are to substitute described secondary nodal point and send to all neighbor nodes of described 3rd node concurrently The node of heartbeat message；All neighbor nodes of described 3rd node include the neighbour of described 3rd node self Occupy node and the neighbor node of described secondary nodal point.

9. according to the device described in claim 7 or 8, it is characterised in that:

Send at described other neighbor nodes each described determining that module receives according to described receiver module Described response message in the reception state carried, determine that at least one other neighbor node described receives In the case of described first heartbeat message,

Described determine that module is additionally operable to determine the node and described second not receiving described first heartbeat message Link between node breaks down；The described node not receiving described first heartbeat message includes described Primary nodal point and other neighbor nodes described do not receive the node of described first heartbeat message.

10. the fault detection system of a group system interior joint, it is characterised in that include primary nodal point, Secondary nodal point and other neighbor nodes, described primary nodal point is the neighbor node of described secondary nodal point, described Other neighbor nodes be described secondary nodal point all neighbor nodes in joint in addition to described primary nodal point Point, the number of other neighbor nodes described is more than one, including:

Described secondary nodal point, for sending the to described primary nodal point and other neighbor nodes described concurrently One heartbeat message；

Described primary nodal point, for judging whether receive described first heartbeat message in Preset Time； Described Preset Time is more than or equal to a heart beat cycle, and less than two heart beat cycles；

In the case of described primary nodal point does not receives described first heartbeat message, described primary nodal point is also For sending request message respectively to each other neighbor nodes described, described request message is used for inquiring often Whether other neighbor nodes described in receive described first heartbeat message；And, described primary nodal point is also For receiving the response message carrying reception state that each other neighbor nodes described send, described in connect Receipts state is used for indicating whether to receive described first heartbeat message；

The described response sent according to other neighbor nodes each described received at described primary nodal point disappears The reception state carried in breath, determines that other neighbor nodes described all do not receive described first heart beating report In the case of literary composition, described primary nodal point is additionally operable to determine that described secondary nodal point breaks down.

11. systems according to claim 10, it is characterised in that described primary nodal point determines described After secondary nodal point breaks down, also include:

Described primary nodal point is additionally operable to:

Generate the first vote information, and receive the second vote information that each other neighbor nodes described send, Described first vote information includes the node identification that node that described primary nodal point is elected is corresponding, described second The node mark that the node of the neighbor node election that vote information includes sending described second vote information is corresponding Know；

And, send out according to the node identification in described first vote information and each other neighbor nodes described Node identification in the second vote information sent, in all nodes that statistics is elected, each node obtains Ballot quantity, and using nodes most for quantity of voting as the 3rd node, described 3rd node is for substituting institute State secondary nodal point and send the node of heartbeat message concurrently to all neighbor nodes of described 3rd node； All neighbor nodes of described 3rd node include the neighbor node and described second of described 3rd node self The neighbor node of node.

12. according to the system described in claim 10 or 11, it is characterised in that:

The described response sent according to other neighbor nodes each described received at described primary nodal point disappears The reception state carried in breath, determines that at least one other neighbor node described receives described first heart In the case of jumping message,

Described primary nodal point is additionally operable to determine the node and described second not receiving described first heartbeat message Link between node breaks down；The described node not receiving described first heartbeat message includes described Primary nodal point and other neighbor nodes described do not receive the node of described first heartbeat message.