CN102355369A

CN102355369A - Virtual clustered system as well as processing method and processing device thereof

Info

Publication number: CN102355369A
Application number: CN2011103017960A
Authority: CN
Inventors: 江滢
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2011-09-27
Filing date: 2011-09-27
Publication date: 2012-02-15
Anticipated expiration: 2031-09-27
Also published as: CN102355369B; WO2013044828A1

Abstract

The invention discloses a virtual clustered system as well as a processing method and a processing device thereof. The system comprises at least two partitions, wherein each partition comprises one main node and at least one spare node; each main node and each spare node are respectively provided with at least one virtual machine; a peer-to-peer architecture is used between the main nodes in different partitions; a star architecture is used between the main node and the spare node in each partition; the main nodes comprise one management main node and at least one normal main node, wherein the management main node is used for reselecting a new normal main node or spare node in the partition of the ineffective normal node or spare node when the normal node or the spare node is ineffective, or, rebooting the virtual machine when the virtual machine on the normal main node or spare node is failed. According to the embodiment of the invention, the expandability and availability of the system can be improved.

Description

Virtualization cluster system and processing method and device thereof

Technical Field

The present invention relates to network communication technologies, and in particular, to a virtualized cluster system, and a processing method and device thereof.

Background

The cluster system has strong overall computing performance, storage performance and management performance, a service form of a single system image, and availability guarantee and fault tolerance which are transparent to users, and becomes a mainstream infrastructure structure of the data center. The application of virtualization technology provides a better and more potential solution direction for cluster development. Virtualization technology allows a platform to run multiple operating systems simultaneously, and application programs can run in mutually independent spaces without mutual influence, thereby significantly improving the work efficiency of a computer. The computing potential of the physical server can be fully utilized by operating a plurality of virtual machines, and the quick response capability is provided for the data center.

With the introduction of virtualization technology, scalability and high availability are the biggest challenges facing clustered systems.

Disclosure of Invention

The embodiment of the invention provides a virtualization cluster system and a processing method and equipment thereof, and improves the expandability and the usability of the virtual machine cluster system.

The embodiment of the invention provides a processing method of a virtualization cluster system, which comprises the following steps:

the node judges whether at least one of the following items occurs: the method comprises the following steps that a failed common main node exists, a failed standby node exists, or a failed virtual machine exists;

after determining that the failed common main node exists, the node regenerates a new common main node; after determining that the invalid standby node exists, regenerating a new standby node; or after the virtual machine with the fault is determined to exist, restarting the virtual machine;

the common main node and the standby nodes are divided into at least two partitions, and each partition comprises a main node and at least one standby node; each main node and each standby node are respectively provided with at least one virtual machine; the main nodes in different partitions adopt an equal type framework; a star-shaped framework is adopted between the main node and the standby node in each partition; the master nodes include a management master node and at least one general master node.

An embodiment of the present invention provides a processing device for a virtualized cluster system, including:

a judging unit for judging whether at least one of the following items occurs: the method comprises the following steps that a failed common main node exists, a failed standby node exists, or a failed virtual machine exists;

the processing unit is used for regenerating a new common main node after determining that the failed common main node exists; after determining that the invalid standby node exists, regenerating a new standby node; or after the virtual machine with the fault is determined to exist, restarting the virtual machine;

The embodiment of the invention provides a virtualization cluster system, which comprises:

each partition comprises a main node and at least one standby node; each main node and each standby node are respectively provided with at least one virtual machine;

the main nodes in different partitions adopt an equal type framework;

a star-shaped framework is adopted between the main node and the standby node in each partition;

the master node comprises a management master node and at least one ordinary master node, wherein the management master node is used for reselecting a new ordinary master node or standby node in a partition where the failed ordinary master node or standby node is located after the ordinary master node or standby node fails, or restarting the virtual machine when the virtual machine on the ordinary master node or standby node fails.

According to the technical scheme, the virtualized cluster system provided by the embodiment of the invention can realize system expansion by partitioning and adding partitions; the main nodes of the partitions adopt a peer-to-peer structure, so that the bottleneck problem can be eliminated, and the reliability can be improved; reliability may be further improved by reselecting a new master node, standby node, or restarting a virtual machine.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a schematic system configuration diagram according to a first embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method according to a first embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus according to a first embodiment of the present invention;

FIG. 4 is a schematic flow chart of a method according to a second embodiment of the present invention;

FIG. 5 is a system diagram illustrating a second embodiment of the present invention;

FIG. 6 is a schematic flow chart of a method according to a third embodiment of the present invention;

FIG. 7 is a schematic system configuration diagram of a third embodiment of the present invention;

FIG. 8 is a schematic flow chart of a method according to a fourth embodiment of the present invention;

fig. 9 is a schematic system structure according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic structural diagram of a system according to a first embodiment of the present invention, referring to fig. 1, the system includes at least two partitions 1, each of which includes a master node (master)11 and at least one slave node (slave) 12; each primary node 11 and each standby node 12 are respectively provided with at least one Virtual Machine (VM) 13.

For example, referring to fig. 1, the master node includes a master node a, a master node B, a master node C, and the like, the standby nodes of the partition where the master node a is located include a standby node a1, a standby node a2, and the like, the standby nodes of the partition where the master node B is located include a standby node B1, a standby node B2, and the like, and the standby nodes of the partition where the master node C is located include a standby node C1, a standby node C2, and the like.

The master nodes 11 in different partitions adopt an equal-type architecture, that is, one master node may send resource status information to any other master node, and may also receive resource status information sent by any other master node. A star architecture is adopted between the master node 11 and the standby node 12 in each partition, that is, the standby node sends resource state information to the master node, and the master node does not send the resource state information to the standby node. The resource status information may indicate whether the corresponding node is normal or failed.

The master node comprises a master leader and at least one ordinary master node, wherein the master leader is used for reselecting a new ordinary master node or standby node in a partition where the failed ordinary master node or standby node is located after the ordinary master node or standby node fails, or restarting the virtual machine when the virtual machine on the ordinary master node or standby node fails.

One of the master nodes may be preset as a management master node, the other master nodes are common master nodes, the management master node stores information of each master node, the standby nodes and the virtual machines on the nodes, performs unified management on all the nodes in the partition, and processes the fault in a unified manner after the fault occurs. For example, referring to fig. 1, the master node C may be set as a management master node, and the master nodes a, B, and the like are normal master nodes.

Corresponding to the above system, the flow between the devices may be as follows.

Fig. 2 is a schematic flow chart of a method according to a first embodiment of the present invention, which includes:

step 21: the node judges whether at least one of the following items occurs: the method comprises the following steps that a failed common main node exists, a failed standby node exists, or a failed virtual machine exists;

step 22: after determining that the failed common main node exists, the node regenerates a new common main node; after determining that the invalid standby node exists, regenerating a new standby node; or after the virtual machine with the fault is determined to exist, restarting the virtual machine;

The nodes may be a common master node, a management master node, and a standby node, and the process may have different specific implementations when the nodes are different and the scenes are different. For details, reference may be made to the following examples.

Correspondingly, the corresponding equipment of the method can be as follows.

Fig. 3 is a schematic structural diagram of an apparatus according to a first embodiment of the present invention, which includes a determining unit 31 and a processing unit 32; the judging unit 31 is configured to judge whether at least one of the following occurs: the method comprises the following steps that a failed common main node exists, a failed standby node exists, or a failed virtual machine exists; the processing unit 32 is configured to regenerate a new common master node after determining that a failed common master node exists; after determining that the invalid standby node exists, regenerating a new standby node; or after the virtual machine with the fault is determined to exist, restarting the virtual machine;

Of course, corresponding to the above method flow, the above devices may be a common master node, a management master node, and a standby node, and the specific functions of the above units are different in different nodes and scenes. See in particular the examples below.

The virtualized cluster system of the embodiment of the invention can realize system expansion by dividing partitions and adding partitions; the main nodes of the partitions adopt a peer-to-peer structure, so that the bottleneck problem can be eliminated, and the reliability can be improved; reliability may be further improved by reselecting a new master node, standby node, or restarting a virtual machine.

Fig. 4 is a schematic flowchart of a method according to a second embodiment of the present invention, and fig. 5 is a schematic diagram of a system structure according to the second embodiment of the present invention, where the present embodiment takes a failure of a general master node as an example.

Referring to fig. 4, the present embodiment includes:

step 41: when the cluster works normally, the ordinary master nodes of the partitions mutually detect heartbeats through a heartbeat detection module (heartbeat testsync).

For example, the heartbeat detection module of the general master node a sends heartbeat information to the heartbeat detection module of the general master node B.

Step 42: and if the heartbeat detection module of the common main node B detects that the heartbeat information of the common main node A stops, multicasting a fault message, wherein the fault message carries the identification information of the common main node A to indicate that the common main node A fails.

When the ordinary master node B does not receive heartbeat information of the ordinary master node a within a certain time, it is determined that the heartbeat of the ordinary master node a is detected to be stopped.

The identification information may be used to distinguish nodes, such as an ID or an address of the general master node a.

Wherein, the other ordinary master nodes and the management master node all receive the failure message.

Step 43: after receiving the failure message, the heartbeat detection module of the management master node reports a master node failure message to a High Availability (HA) module of the management master node, where the master node failure message carries identification information of a common master node a.

Step 44: and the HA module of the management main node is in the partition of the common main node A, and reselects a standby node as a new common main node of the partition.

For example, the standby node a1 in the partition in which a is located is selected as a new ordinary master node according to the ID priority of each standby node and the dynamic load condition of the standby node.

Step 45: the HA module of the management host node sends a migration virtual machine request to a resource management module (ResourceMgmt) of the management host node, where the migration virtual machine request carries identification information of a new ordinary host node a1 and identification information of an ordinary host node a.

Step 46: the resource management module of the management master node migrates the virtual machine on the ordinary master node a to the new ordinary master node a 1.

For example, the configuration information of the virtual machine on the general master node a is sent to the new general master node a1, and the new general master node a1 is instructed to re-run the configuration information to restart the corresponding virtual machine. The configuration information of the virtual machine is information that can start the virtual machine, for example, virtual machine software, and the virtual machine can be started after the virtual machine software is executed.

Further, after the new common master node is added, the master node further updates the membership:

step 47: the new common master node multicasts a join request to the other common master nodes, and after the heartbeat detection modules of the other common master nodes detect the join request, a membership update request is sent to the corresponding member management module (memberships mgmt), wherein the membership update request carries identification information of the new common master node and identification information of the failed common master node.

For example, after the ordinary master node B receives a join request multicast by the new ordinary master node a1, the heartbeat detection module of the ordinary master node B sends a membership update request to the member management module of the ordinary master node B, where the message carries the identification information of a and the identification information of a 1.

And 48: and the member relation management module updates the member relation list.

For example, the identification information of the new general master node a1 is added to the member list, and the identification information of the failed general master node a is deleted.

Referring to the above flow, the corresponding modules may be as follows:

referring to fig. 5, the present embodiment relates to a general master node 51 and a management master node 52. Further, for the ordinary master node, the determining unit is specifically a first Heartbeat module detecting module (Heartbeat Sync)511, and the processing unit is specifically a first membership management module (memberships) 512. For the management master node, the determining unit is specifically the second heartbeat detecting module 521, and the processing unit specifically includes a first High Availability (HA) module 522 and a first resource management module (resourcegmt) 523.

The first heartbeat detection module 511 is configured to determine that a failed general master node exists after detecting that the heartbeat of any other general master node stops, and determine that the normal master node with the stopped heartbeat is the failed general master node;

the first member relationship management module 512 is configured to receive a first member relationship request message, where the first member relationship request message carries identification information of a new common host node and identification information of a failed common host node, add the identification information of the new common host node to a first member relationship list, and delete the identification information of the failed common host node in the first member relationship list;

the new common master node is obtained by reselecting from a standby node in a partition where the failed common master node is located after a management master node receives a first fault message, the first fault message is sent by the common master node after determining that the failed common master node exists, and the first fault message carries identification information of the failed common master node.

The second heartbeat detecting module 521 is configured to determine that a failed general master node exists after receiving a first fault message, where the first fault message is sent by the general master node after determining that the failed general master node exists, and the first fault message carries identification information of the failed general master node;

the first high availability module 522 is configured to receive a master node fault message, where the master node fault message carries identification information of the failed general master node, reselect a new general master node from a standby node in a partition where the failed general master node is located, and send a first migration virtual machine request carrying the identification information of the new general master node and the identification information of the failed general master node, where the master node fault message is sent after receiving the first fault message;

the first resource management module 523 is configured to send the identification information of the virtual machine on the failed general host node to the new general host node and restart the virtual machine according to the first migration virtual machine request message.

The embodiment can realize the expansibility of the cluster system through partitioning. In the embodiment, by adopting the peer-to-peer architecture among the master nodes, after one master node fails, the master node failure can be known in time, and a new master node is reselected, so that the availability is improved.

Fig. 6 is a schematic method flow diagram of a third embodiment of the present invention, and fig. 7 is a schematic system structure diagram of the third embodiment of the present invention, which takes node failure as an example.

Referring to fig. 6, the present embodiment includes:

step 601: when the cluster works normally, the standby nodes of all the partitions send heartbeat information to the heart state detection module of the common main node of the partition through the heartbeat detection module.

For example, the heartbeat detection module of the standby node a1 sends heartbeat information to the heartbeat detection module of the ordinary master node a in the partition.

Step 602: if the heartbeat detection module of the ordinary master node a detects that the heartbeat of the standby node a1 stops, a heartbeat detection message is sent to another standby node of the partition where the heartbeat detection module is located.

For example, if the ordinary master node a does not detect the heartbeat information of the standby node a1 within a set time, the ordinary master node a detects that the heartbeat of the standby node a1 stops, and sends a heartbeat detection message to another standby node a2 of the partition where the ordinary master node a is located, where the heartbeat detection message carries the identification information of the standby node a 1.

Step 603: backup node a2 detects the heartbeat condition of backup node a 1.

For example, the standby node a2 sends a ping message to the standby node a1, which indicates that the standby node a1 has stopped heartbeat if a response message returned by the standby node a1 is not received.

Step 604: the standby node a2 sends a heartbeat detection result to the ordinary master node a, wherein the heartbeat detection result carries the heartbeat detection result of the standby node a 1.

Step 605: if the heartbeat detection result also indicates that the heartbeat of the standby node a1 stops, the ordinary main node a multicasts a fault message, and the fault message carries the identification information of the standby node a 1.

Wherein, the other normal main nodes and the management main node all receive the fault message.

Step 606: after receiving the failure message, the heartbeat detection module of the management main node sends a standby node failure message to the HA module in the management main node, where the standby node failure message carries identification information of the failed standby node a 1.

Step 607: the HA module of the management master node selects another alternate node as the alternate node for migrating the virtual machine in the partition where the alternate node a1 is located.

Wherein, another standby node can be selected according to the priority, the load condition and the like.

Step 608: and the HA module of the management main node sends a virtual machine migration request to the resource management module of the management main node, wherein the virtual machine migration request carries the identification information of the new standby node and the identification information of the failed standby node.

For example, if the reselected standby node is a2, the migration virtual machine request carries the identification information of a1 and the identification information of a 2.

Step 609: the resource management module of the management master node migrates the virtual machine on the standby node a1 to the standby node a 2.

For example, the configuration information of the virtual machine on the standby node a1 is sent to the standby node a2, and the a2 is instructed to rerun the configuration information to restart the corresponding virtual machine. The configuration information of the virtual machine is information of the virtual machine that can be started, for example, virtual machine software, and the virtual machine can be started after the virtual machine software is executed.

Further, the failed standby node may perform the following actions:

step 610: the standby node a1 sends ping message to its gateway after finding out that its heartbeat information is lost.

Step 611: and powering off if the ping is not passed, namely, the response message corresponding to the ping message cannot be received.

Referring to the above flow, the corresponding modules may be as follows:

referring to fig. 7, the present embodiment relates to a normal master node 71, a management master node 72, and a standby node 73. Further, for the common master node, the determining unit and the processing unit are the same module, specifically, the third heartbeat module detecting module 711. For the management host node, the determining unit is specifically a fourth heartbeat detecting module 721, and the processing unit specifically includes a second high availability module 722 and a second resource management module (resourcegmt) 723. For the standby node, the determining unit and the processing unit are the same module, specifically, a fifth heartbeat module detecting module 731.

The third heartbeat detection module 711 is configured to determine that a failed standby node exists after detecting that the heartbeat of any standby node in the partition where the common master node is located stops, and determine that the standby node with the stopped heartbeat is the failed standby node;

the fourth heartbeat detecting module 721 determines that there is a failed standby node after receiving a second fault message, where the second fault identification information is sent by the ordinary master node after determining that there is a failed standby node, and the second fault identification information carries identification information of the failed standby node;

the second high availability module 722 is configured to receive a standby node fault message, where the standby node fault message carries identification information of the failed standby node, reselect a new standby node in a partition where the failed standby node is located, and send a request for carrying the identification information of the new standby node and the identification information of the failed standby node in a second migration virtual machine, where the standby node fault message is sent after receiving the second fault message;

the second resource management module 723 is configured to send, according to the second migration virtual machine request message, the identification information of the virtual machine on the failed standby node to the new standby node and restart the virtual machine.

The fifth heartbeat detection module 731 is configured to send heartbeat information when the standby node is not failed, and not send heartbeat information when the standby node is failed, so that a general host node of a partition where the standby node is located determines whether the standby node is failed according to the condition of the heartbeat information, and performs power-down processing when the standby node is the failed standby node, or detects whether a corresponding standby node is the failed standby node after the detection request is received and the general host node is not the failed standby node, and notifies the detection result to the general host node, so that the general host node performs revalidation of the standby node, where the detection request is sent after the general host node does not receive heartbeat information of any standby node within a certain time, and the detection request carries identification information of the standby node where the heartbeat is stopped.

The embodiment can realize the expansibility of the cluster system through partitioning. In this embodiment, the standby node and the master node adopt a star architecture, so that the master node can migrate the virtual machine on the failed standby node in time after one standby node fails, thereby improving the availability.

Fig. 8 is a schematic flowchart of a method according to a fourth embodiment of the present invention, and fig. 9 is a schematic diagram of a system structure according to the fourth embodiment of the present invention, where a virtual machine failure is taken as an example in this embodiment.

Referring to fig. 8, the present embodiment includes:

step 81: when the cluster works normally, the virtual machine agent module on each node sends heartbeat information to the heartbeat detection module of the node where the virtual machine agent module is located.

For example, a virtual machine agent module of a certain standby node sends heartbeat information to a heartbeat detection module of the standby node.

Step 82: and if the heartbeat detection module of the standby node detects that the heartbeat of the virtual machine stops, sending a fault message to the common main node of the partition where the virtual machine is located.

For example, if the heartbeat detection module on the standby node does not receive the heartbeat information sent by the virtual machine agent module on the corresponding node within a certain time, it is determined that the heartbeat of the corresponding virtual machine stops.

Step 83: and after receiving the fault message, the common main node multicasts the fault message, wherein the fault message carries the identification information of the faulted virtual machine.

Taking the virtual machine failure on the standby node as an example, when the virtual machine on the main node fails, after the heartbeat detection module on the main node does not receive the heartbeat information sent by the virtual machine agent module within a certain time, it is determined that the virtual machine on the main node fails, and a failure message is multicast.

The above-described failure message may be received by the remaining general master nodes and the management master node.

Step 84: after receiving the fault message, the heartbeat detection module of the management main node sends a virtual machine fault message to the HA module of the management main node, where the virtual machine fault message carries identification information of the faulted virtual machine.

Step 85: the HA module of the management main node sends a virtual machine restarting request to the resource management module of the management main node, wherein the virtual machine restarting request carries identification information of a failed virtual machine.

Step 86: and the resource module of the management main node restarts the virtual machine.

For example, the configuration information of the failed virtual machine is sent to the node where the virtual machine is located again, and the corresponding node is instructed to rerun the configuration information to restart the virtual machine. Or the management main node reselects one node as a target node according to the priority, the load condition and the like, then sends the configuration information of the failed virtual machine to the target node, and instructs the target node to operate the configuration information again to restart the virtual machine. Specifically, the resource management module of the target node reselects and runs the configuration information.

Referring to the above flow, the corresponding modules may be as follows:

referring to fig. 9, the present embodiment relates to a general master node 91, a management master node 92, and a standby node 93. Further, for the ordinary master node, the determining unit is specifically a sixth heartbeat module detecting module 911, and the processing unit is specifically a fourth resource management module 912. For the management master node, the determining unit is specifically a seventh heartbeat detecting module 921, and the processing unit specifically includes a third high availability module 922 and a third resource management module 923. For the standby node, the determining unit specifically includes a virtual machine agent module 931 and an eighth heartbeat module detecting module 932, and the processing unit is specifically a fifth resource management module 933.

The sixth heartbeat detecting module 911 is configured to determine, after receiving a virtual machine fault message sent by any backup node in the partition where the ordinary master node is located, or detecting that a heartbeat of a virtual machine of the own virtual machine stops, a virtual machine with a fault, and determine, as the virtual machine with the fault, the virtual machine indicated by the virtual machine fault message or the virtual machine with the heartbeat stopped;

the fourth resource management module 912 is configured to receive, when a virtual machine of the management node fails, configuration information of the failed virtual machine sent by the management node, and restart the failed virtual machine by re-operating the configuration information, where the configuration information of the failed virtual machine is sent by the management node after receiving a third failure message, the third failure message is sent by the general host node after determining that the failed virtual machine exists, and the third failure message carries identification information of the failed virtual machine.

The seventh heartbeat detecting module 921 is configured to determine, after receiving a third fault message, a virtual machine with a fault, where the third fault message carries identification information of the faulty virtual machine;

the third high availability module 922 is configured to receive a virtual machine fault message and send a virtual machine restart request, where the virtual machine fault message is sent after receiving the third fault message, and the virtual machine fault message and the virtual machine restart request carry identification information of a faulty virtual machine;

the third resource management module 923 is configured to send configuration information of a virtual machine corresponding to the failed virtual machine to a node where the failed virtual machine is located, and instruct the node to rerun the configuration information to restart the failed virtual machine.

The virtual machine agent module 931 is configured to send heartbeat information when the corresponding virtual machine is normal, and not send heartbeat information when a fault occurs;

the eighth heartbeat detection module 932 is configured to determine, according to the sending condition of the heartbeat information, that a virtual machine on the standby node has a failure after detecting that the heartbeat of the virtual machine stops, and determine that the virtual machine with the stopped heartbeat is a failed virtual machine;

the fifth resource management module 933 is configured to receive configuration information of a failed virtual machine sent by a management master node, and rerun the configuration information to restart the failed virtual machine, where the configuration information of the failed virtual machine is sent by the management master node after receiving a third failure message, the third failure message is sent by the ordinary master node after receiving a virtual machine failure message, the third failure message carries identification information of the failed virtual machine, the virtual machine failure message is sent by the standby node after detecting that a heartbeat of a virtual machine on the standby node stops, and the virtual machine failure message carries identification information of the failed virtual machine.

The embodiment can realize the expansibility of the cluster system through partitioning. In the embodiment, the peer-to-peer architecture is adopted between the main nodes, and the standby node and the main nodes adopt the star architecture, so that the virtual machine fault can be timely known and restarted after the virtual machine fault occurs, and the availability is improved.

In summary, by setting partitions, the embodiment of the present invention can implement cluster scale expansion by adding partitions; by adopting peer-to-peer management by a plurality of main nodes, HA bottleneck can be eliminated; by synchronizing the resource state information among the main nodes and not synchronizing the resource utilization rate information, the fault monitoring communication cost is low and the state synchronization cost is low; when the heartbeat of a standby node stops, the main node of the partition selects other standby nodes in the partition for arbitration, so that the misjudgment can be reduced and the availability can be improved; the equal type architecture is adopted among the main nodes, and compared with a star architecture, the reliability of the main nodes is further enhanced; by effectively utilizing the standby nodes and migrating the virtual machine, the resource waste can be reduced and the management overhead can be reduced.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for processing a virtualized cluster system, comprising:

2. The method of claim 1, wherein if the node is a normal master node,

the common main node for judging the existence of the failure comprises the following steps: after the heartbeat of any other common main node is detected to stop, determining that a failed common main node exists, and determining that the common main node with the heartbeat stopped is the failed common main node;

after determining that the failed common master node exists, regenerating a new common master node, including:

receiving a first member relation request message, wherein the first member relation request message carries identification information of a new common main node and identification information of a failed common main node, adding the identification information of the new common main node into a first member relation list, and deleting the identification information of the failed common main node in the first member relation list;

3. The method according to claim 1 or 2, wherein, when the node is a normal master node,

judging that the standby node with failure exists comprises the following steps: after the heartbeat of any standby node in the partition where the common main node is located is detected to stop, determining that a failed standby node exists, and determining that the standby node with the stopped heartbeat is the failed standby node;

after determining that the failed standby node exists, regenerating the new standby node, including:

receiving a second member relationship request message, wherein the second member relationship request message carries identification information of a new standby node and identification information of a failed standby node, adding the identification information of the new standby node to a second member relationship list, and deleting the identification information of the failed standby node in the second member relationship list;

the new standby node is obtained by reselecting from the standby node in the partition where the failed standby node is located after the management master node receives a second fault message, the second fault message is sent by the common master node after determining that the failed standby node exists, and the second fault message carries identification information of the failed standby node.

4. The method according to claim 1 or 2, wherein, when the node is a normal master node,

the virtual machine for judging the existence of the fault comprises the following steps: after a virtual machine fault message sent by any standby node in a partition where the common main node is located is received or the heartbeat of the virtual machine is detected to stop, determining the virtual machine with the fault, and determining the virtual machine indicated by the virtual machine fault message or the virtual machine with the stopped heartbeat as the fault virtual machine;

after determining that there is a failed virtual machine, restarting the virtual machine, including:

when a virtual machine of the management host node fails, receiving configuration information of the failed virtual machine sent by the management host node, and re-running the configuration information to restart the failed virtual machine, wherein the configuration information of the failed virtual machine is sent by the management host node after receiving a third failure message, the third failure message is sent by the common host node after determining that the failed virtual machine exists, and the third failure message carries identification information of the failed virtual machine.

5. The method of claim 1, wherein when the node is a management master node,

the common main node for judging the existence of the failure comprises the following steps: after receiving a first fault message, determining that a failed common main node exists, wherein the first fault message is sent by the common main node after determining that the failed common main node exists, and the first fault message carries identification information of the failed common main node;

receiving a main node fault message, wherein the main node fault message carries identification information of the failed common main node, reselecting a new common main node from standby nodes of a partition where the failed common main node is located, and carrying the identification information of the new common main node and the identification information of the failed common main node in a first migration virtual machine request to send, wherein the main node fault message is sent after receiving the first fault message;

and sending the identification information of the virtual machine on the failed common main node to the new common main node and restarting the virtual machine according to the first migration virtual machine request message.

6. The method of claim 1 or 5, wherein, when the node is a management master node,

judging that the standby node with failure exists comprises the following steps: after receiving a second fault message, determining that a failed standby node exists, wherein second fault identification information is sent by a common main node after determining that the failed standby node exists, and the second fault identification information carries identification information of the failed standby node;

receiving a standby node fault message, wherein the standby node fault message carries identification information of the failed standby node, reselecting a new standby node in a partition where the failed standby node is located, and carrying the identification information of the new standby node and the identification information of the failed standby node in a second migration virtual machine request to send, wherein the standby node fault message is sent after receiving the second fault message;

and sending the identification information of the virtual machine on the failed standby node to the new standby node and restarting the virtual machine according to the second migration virtual machine request message.

7. The method of claim 1 or 5, wherein, when the node is a management master node,

the virtual machine for judging the existence of the fault comprises the following steps: after receiving a third fault message, determining a virtual machine with a fault, wherein the third fault message carries identification information of the virtual machine with the fault;

receiving a virtual machine fault message and sending a virtual machine restarting request, wherein the virtual machine fault message is sent after receiving the third fault message, and the virtual machine fault message and the virtual machine restarting request carry identification information of a fault virtual machine;

and sending the configuration information of the virtual machine corresponding to the fault virtual machine to the node where the fault virtual machine is located, and instructing the node to operate the configuration information again to restart the fault virtual machine.

8. The method of claim 1, wherein when the node is a standby node,

judging that the standby node with failure exists, and regenerating a new standby node after determining that the standby node with failure exists, wherein the method comprises the following steps:

sending heartbeat information when the standby node is not failed, not sending the heartbeat information when the standby node is failed, so that a common main node of a partition where the standby node is located determines whether the standby node is failed according to the condition of the heartbeat information, and performing power-off processing when the standby node is the failed standby node, or detecting whether a corresponding standby node is the failed standby node after the standby node is not the failed standby node and a detection request is received, and notifying the common main node of a detection result, so that the common main node performs standby node reactivation processing, wherein the detection request is sent after the common main node does not receive the heartbeat information of any standby node within a certain time, and the detection request carries identification information of the standby node with the stopped heartbeat.

9. The method of claim 1 or 8, wherein when the node is a standby node,

the virtual machine for judging the existence of the fault comprises the following steps:

sending heartbeat information when the corresponding virtual machine is normal, and not sending heartbeat information when the corresponding virtual machine is in failure;

determining a virtual machine with a fault after detecting that the heartbeat of the virtual machine on the standby node stops according to the sending condition of the heartbeat information, and determining that the virtual machine with the stopped heartbeat is the fault virtual machine;

receiving configuration information of a fault virtual machine sent by a management main node, and re-operating the configuration information to restart the fault virtual machine, wherein the configuration information of the fault virtual machine is sent by the management main node after receiving a third fault message, the third fault message is sent by a common main node after receiving a virtual machine fault message, the third fault message carries identification information of the fault virtual machine, the virtual machine fault message is sent by a standby node after detecting that a heartbeat of a virtual machine on the standby node stops, and the virtual machine fault message carries the identification information of the fault virtual machine.

10. A processing device for a virtualized cluster system, comprising:

11. The apparatus of claim 10, wherein when the apparatus is a normal master node,

the judging unit includes:

the first heartbeat detection module is used for determining that a failed common main node exists after the heartbeat of any other common main node is detected to stop, and determining that the common main node with the stopped heartbeat is the failed common main node;

the processing unit includes:

a first member relationship management module, configured to receive a first member relationship request message, where the first member relationship request message carries identification information of a new common host node and identification information of a failed common host node, add the identification information of the new common host node to a first member relationship list, and delete the identification information of the failed common host node in the first member relationship list;

12. The apparatus of claim 10, wherein when the apparatus is a management master node,

the judging unit includes:

the second heartbeat detection module is used for determining that a failed common main node exists after receiving a first fault message, wherein the first fault message is sent by the common main node after determining that the failed common main node exists, and the first fault message carries identification information of the failed common main node;

the processing unit includes:

a first high-availability module, configured to receive a master node fault message, where the master node fault message carries identification information of the failed general master node, reselect a new general master node from a standby node in a partition where the failed general master node is located, and send a first migration virtual machine request carrying the identification information of the new general master node and the identification information of the failed general master node, where the master node fault message is sent after receiving the first fault message;

and the first resource management module is used for sending the configuration information of the virtual machine on the failed common main node to the new common main node and restarting the virtual machine according to the first migration virtual machine request message.

13. The apparatus according to claim 10 or 11, wherein, when the apparatus is a normal master node,

the judging unit and the processing unit are positioned in a third heartbeat detection module, and the third heartbeat detection module is used for determining that a failed standby node exists after the heartbeat of any standby node in the partition where the common main node is positioned is detected to stop, and determining that the standby node with the stopped heartbeat is the failed standby node;

14. The apparatus according to claim 10 or 12, wherein, when the apparatus is a management master node,

the judging unit includes:

the fourth heartbeat detection module is used for determining that a failed standby node exists after receiving a second fault message, wherein the second fault identification information is sent by a common main node after determining that the failed standby node exists, and the second fault identification information carries identification information of the failed standby node;

the processing unit includes:

a second high availability module, configured to receive a standby node fault message, where the standby node fault message carries identification information of the failed standby node, reselect a new standby node in a partition where the failed standby node is located, and send a request for carrying the identification information of the new standby node and the identification information of the failed standby node in a second migration virtual machine, where the standby node fault message is sent after receiving the second fault message;

and the second resource management module is used for sending the identification information of the virtual machine on the failed standby node to the new standby node and restarting the virtual machine according to the second migration virtual machine request message.

15. The apparatus of claim 10, wherein when the apparatus is a standby node,

the judgment unit and the processing unit form a fifth heartbeat detection module, the fifth heartbeat detection module is used for sending heartbeat information when the standby node is not invalid, when the standby node fails, the heartbeat information is not sent, so that a common main node of a partition where the standby node is located determines whether the standby node fails according to the condition of the heartbeat information, and performs power-off processing when the standby node is the failed standby node, or, after the detection request is received and the corresponding standby node is not the invalid standby node, whether the corresponding standby node is the invalid standby node is detected, and the detection result is informed to the ordinary main node, so that the ordinary main node carries out the processing of the reactivation standby node, the detection request is sent after the common main node does not receive heartbeat information of any standby node within a certain time, and the detection request carries identification information of the standby node with the heartbeat stopped.

16. The apparatus according to claim 10 or 11, wherein, when the apparatus is a normal master node,

the judging unit includes:

a sixth heartbeat detection module, configured to determine, after receiving a virtual machine fault message sent by any backup node in a partition where the common master node is located, or detecting that a heartbeat of a virtual machine of the sixth module stops, a virtual machine with a fault, and determine, as the virtual machine with the fault, the virtual machine indicated by the virtual machine fault message or the virtual machine with the stopped heartbeat;

the processing unit includes:

the fourth resource management module is configured to receive configuration information of a failed virtual machine sent by a management master node when a virtual machine of the fourth resource management module fails, and rerun the configuration information to restart the failed virtual machine, where the configuration information of the failed virtual machine is sent by the management master node after receiving a third failure message, the third failure message is sent by the common master node after determining that a failed virtual machine exists, and the third failure message carries identification information of the failed virtual machine.

17. The apparatus according to claim 10 or 12, wherein, when the apparatus is a management master node,

the judging unit includes:

a seventh heartbeat detection module, configured to determine, after receiving a third fault message, a virtual machine with a fault, where the third fault message carries identification information of the faulty virtual machine;

the processing unit includes:

a third high availability module, configured to receive a virtual machine fault message and send a virtual machine restart request, where the virtual machine fault message is sent after receiving the third fault message, and the virtual machine fault message and the virtual machine restart request carry identification information of a faulty virtual machine;

and the third resource management module is used for sending the configuration information of the virtual machine corresponding to the fault virtual machine to the node where the fault virtual machine is located, and instructing the node to operate the configuration information again to restart the fault virtual machine.

18. The apparatus according to claim 10 or 15, wherein, when the apparatus is a standby node,

the judging unit includes:

the virtual machine agent module is used for sending heartbeat information when the corresponding virtual machine is normal and not sending the heartbeat information when the corresponding virtual machine fails;

an eighth heartbeat detection module, configured to determine, according to a sending condition of the heartbeat information, that a heartbeat of a virtual machine on the standby node is stopped, a virtual machine with a fault exists, and determine that the virtual machine with the stopped heartbeat is a fault virtual machine;

the processing unit includes:

the fifth resource management module is configured to receive configuration information of a failed virtual machine sent by a management master node, and rerun the configuration information to restart the failed virtual machine, where the configuration information of the failed virtual machine is sent by the management master node after receiving a third failure message, the third failure message is sent by a common master node after receiving a virtual machine failure message, the third failure message carries identification information of the failed virtual machine, the virtual machine failure message is sent by the standby node after detecting that a heartbeat of a virtual machine on the standby node stops, and the virtual machine failure message carries identification information of the failed virtual machine.

19. A virtualized cluster system, comprising:

the main nodes in different partitions adopt an equal type framework;

20. The system of claim 19,

the generic host node is the apparatus of claim 11; the management master node is the apparatus of claim 12;

or,

the generic master node is the apparatus of claim 13; the management master node is the apparatus of claim 14; and the standby node is the apparatus of claim 15;

or,

the generic host node is the apparatus of claim 16; the management master node is the apparatus of claim 17; and the standby node is the apparatus of claim 18.