CN104199747A - High-availability system obtaining method and system based on health management - Google Patents
High-availability system obtaining method and system based on health management Download PDFInfo
- Publication number
- CN104199747A CN104199747A CN201410403247.8A CN201410403247A CN104199747A CN 104199747 A CN104199747 A CN 104199747A CN 201410403247 A CN201410403247 A CN 201410403247A CN 104199747 A CN104199747 A CN 104199747A
- Authority
- CN
- China
- Prior art keywords
- node
- health
- control node
- standby
- computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a high-availability system obtaining method based on health management. The method includes the steps that firstly, at least two control nodes and at least two computing nodes are set, one control node is set to be a master control node, the other control nodes are set to be standby control nodes, a priority sequence is configured for the standby control nodes, and the priority of the master control node is set to be higher than that of any standby control node; secondly, health state data of all the computing nodes are repeatedly acquired and stored through the control nodes respectively; thirdly, whether the current control node is the master control node or not is judged, if yes, the fifth step is executed, and if not, the fourth step is executed; fourthly, the standby control node with the highest priority detects whether the master control node breaks down or not in real time through heartbeats, the standby control node with the lowest priority detects whether the standby control node with the priority one level higher than the priority of the standby control node breaks down or not in real time, and if yes, the priority sequence is modified; fifthly, the master control node analyzes the stored health state data and manages the computing nodes according to an analysis result.
Description
Technical field
The present invention relates to the high available techniques of a kind of computing machine field, relate in particular to a kind of high-availability system implementation method and system based on health control.
Background technology
The reliability of computer system was measured with the mean free error time, and how long computer system on average can normally be moved, and primary fault just occurs.The availability of computer system is defined as the number percent that system keeps the uptime.High availability (High Availability; HA) for describing a system through special design, thereby reduce shut down time, and keep the high degree of availability of its service.In order to improve the high availability of server apparatus or software, conventional method mainly contains at present:
1, two-shipper cold standby technology: main frame is in running status, and backup machine is in waiting status.Once main frame breaks down, backup machine is opened the service of main frame immediately.
2, two-node cluster hot backup technology: this technology is usually said activation (active)/wait (standby) working method at present, and Active and standby equipment have identical hardware configuration and software systems.Break down once active equipment, standby equipment activates application service immediately, ensures that application service recovers normal use at short notice completely.High Availability Software (HA) comprises the softwares such as RoseHa.
3, the high available fault-tolerant technique of virtual machine: by server is carried out virtual after, on virtual server, carry out redundancy with failover capabilities configuration, by heartbeat detection and virtual machine real-time migration technology, realize the transparent fault-tolerant calculation function of all application of computer system again.
There is following defect in the method that more than improves the high availability of server apparatus or software:
1, two-shipper cold standby technology and two-node cluster hot backup technology are to design for a certain definite service processes mostly, be the specialized designs that improves the high availability of a certain service and carry out, do not support the fault-tolerant calculation of other application program, therefore the customer service of the application software of user oneself exploitation must be by amendment application program, and call the special API of high available software, realize the fault-tolerant processing of customer service, make the application software of user oneself exploitation will realize by commercial high available software high available more difficult.
2, the system resource overhead of the high available fault-tolerant technique of virtual machine is larger.
Summary of the invention
In view of this, be necessary to provide a kind of system resource overhead little, support high-availability system implementation method and the system based on health control of the fault-tolerant calculation of various application programs.
The technical solution adopted for the present invention to solve the technical problems is: construct a kind of high-availability system implementation method based on health control, the described high-availability system implementation method based on health control comprises the steps:
Nodes and at least two computing nodes are controlled in S1, at least two of settings; One of them is controlled to Node configuration is main control node, is standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration; The priority of main control node is set higher than any one standby node of controlling;
S2, respectively control node repeated acquisition the state of health data of storing all computing nodes respectively;
S3, judge that whether current control node is main control node, if current control node is main control node, jumps to step S5; If current control node is the standby node of controlling, jump to step S4;
Whether S4, the standby control node that priority is the highest are surveyed main control node by heartbeat in real time and are broken down; Whether the standby node of controlling that priority is low is surveyed than the higher leveled standby node of controlling of its priority and is broken down by heartbeat in real time; Break down if control node, revise priority orders;
S5, main control node analysis are stored in the state of health data of storing on the storer of main control node, according to analysis result, computing node are managed.
The present invention also provides a kind of high-availability system based on health control to realize system, it is characterized in that, the described high-availability system based on health control is realized system and comprised as lower module:
Node configuration module, controls node and at least two computing nodes for arranging at least two; One of them is controlled to Node configuration is main control node, is standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration; The priority of main control node is set higher than any one standby node of controlling;
Data acquisition module, for by each control node repeated acquisition state of health data of storing all computing nodes respectively;
Node judge module, for judging that whether current control node is main control node, if current control node is main control node, jumps to data analysis module; If current control node is the standby node of controlling, jump to fault judge module;
Whether fault judge module, break down for surveying main control node by heartbeat in real time by the highest standby control node of priority; Survey than the higher leveled standby node of controlling of its priority and whether break down by heartbeat in real time by the standby node of controlling that priority is low; In the time that control node breaks down, amendment priority orders;
Data analysis module, for be stored in the state of health data of storing on the storer of main control node by main control node analysis, manages computing node according to analysis result.
High-availability system implementation method and system based on health control provided by the invention, by quantity and the priority of flexible configuration control node, configures very flexible.And be stored in the state of health data of storing on the storer of main control node by main control node analysis, and according to analysis result, computing node is managed, realize the tendency spoilage malfunction of predicting in advance computing node.State of health data is obtained by the bypass of computing node, does not take the resource such as CPU, internal memory, network of computing node, with respect to two-shipper cold standby technology, two-node cluster hot backup technology, the high available fault-tolerant technique of virtual machine, has saved system resource overhead.
Brief description of the drawings
Fig. 1 is the high-availability system implementation method process flow diagram based on health control of a preferred embodiment of the present invention;
Fig. 2 is the sub-process figure of step S2 in Fig. 1;
Fig. 3 is the sub-process figure of step S4 in Fig. 1;
Fig. 4 is the sub-process figure of step S5 in Fig. 1;
Fig. 5 is the sub-process figure of step S58 in Fig. 4;
Fig. 6 is the logic relation picture of high-availability system in the embodiment of the present invention;
Fig. 7 is main control node and the standby workflow diagram of controlling node in the embodiment of the present invention;
Fig. 8 controls node to take over sequential schematic in the embodiment of the present invention;
Fig. 9 is state of health data collection analysis work for the treatment of process flow diagram in the embodiment of the present invention;
Figure 10 is state of health data history value matrix schematic diagram in the embodiment of the present invention.
Figure 11 is the structured flowchart that the high-availability system based on health control of a preferred embodiment of the present invention is realized system;
Figure 12 is the minor structure block diagram of data acquisition module in Figure 11;
Figure 13 is the minor structure block diagram of fault judge module in Figure 11;
Figure 14 is the minor structure block diagram of data analysis module in Figure 11;
Figure 15 is the minor structure block diagram of computing node administrative unit in Figure 14.
Embodiment
As shown in Figure 1, the embodiment of the present invention provides a kind of high-availability system implementation method based on health control, and the described high-availability system implementation method based on health control comprises the steps:
Nodes and at least two computing nodes are controlled in S1, at least two of settings.One of them is controlled to Node configuration is main control node, is standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration.The priority of main control node is set higher than any one standby node of controlling.
The invention process can be according to the task significance of high-availability system and high available demand, adds or deletes the number of controlling node, and the number of preferably controlling node is no more than the number of computing node, and two or three control node.Can, according to the task treatment scale of high-availability system, add or delete the number of computing node, suggestion retains at least two computing nodes.
Outside arbitrary computing node, other all computing nodes all can be configured to the standby computing node of this computing node, have therefore increased substantially the redundance of computing node, have improved the high availability of computing node.
S2, respectively control node repeated acquisition the state of health data of storing all computing nodes respectively.
State of health data can comprise the state of health data types such as the temperature, fan, voltage, electric current of computing node.
Control node in the embodiment of the present invention and computing node can be server or the computing equipments in network system.
State of health data in the embodiment of the present invention can be obtained by the bypass of the front-end module of computing node (BMC module), do not take the resources such as the CPU, internal memory, network of computing node, therefore compare two-node cluster hot backup, two-shipper cold standby technology, virtual machine fault-tolerant technique etc. and realize high methods availalbe.The embodiment of the present invention is less to the performance impact of computing node itself.
Alternatively, as shown in Figure 2, described step S2 comprises following sub-step:
S21, respectively control node and gather respectively once the state of health data of all computing nodes, and the described state of health data gathering is saved on each control node storer separately;
S22, judge whether the number of times gathering is less than pre-determined number p, if the number of times gathering is less than pre-determined number p, jump to step S21 after waiting for the first preset time T 1; If the number of times gathering is greater than pre-determined number p, delete historical state of health data unnecessary on each control node memory, retain the nearest state of health data collecting for p time, jump to step S23; If the number of times gathering equals pre-determined number p, jump to step S23;
The state of health data of S23, n computing node that each control node is collected for p time is arranged to the state of health data history value matrix of the capable n row of p.
State of health data history value matrix in described step S23 can be referring to Figure 10.
S3, judge that whether current control node is main control node, if current control node is main control node, jumps to step S5.If current control node is the standby node of controlling, jump to step S4.
Whether S4, the standby control node that priority is the highest are surveyed main control node by heartbeat in real time and are broken down.Whether the standby node of controlling that priority is low is surveyed than the higher leveled standby node of controlling of its priority and is broken down by heartbeat in real time.Break down if control node, revise priority orders.
Alternatively, as shown in Figure 3, described step S4 comprises following sub-step:
Whether S41, the standby control node that priority is the highest are surveyed main control node by heartbeat in real time and are broken down; Whether the standby node of controlling that priority is low is surveyed than the higher leveled standby node of controlling of its priority and is broken down by heartbeat in real time.
If S42 detects main control, node breaks down, and the standby node of controlling that priority is the highest is configured to main control node, will improve separately one-level except the highest standby other the standby priority of controlling node controlled node of priority.
If S43 detects the standby node of controlling and breaks down, priority is improved to one-level separately lower than the standby all standby priority of controlling node of controlling node that this breaks down.
The present embodiment can be changed the priority of controlling node flexibly, and in the time that main control node breaks down, can be main control node by standby control Node configuration the highest priority in time.Realize when control node breaks down and having seamlessly transitted.
S5, main control node analysis are stored in the state of health data of storing on the storer of main control node, according to analysis result, computing node are managed.
Alternatively, as shown in Figure 4, described step S5 comprises following sub-step:
S51, main control node are analyzed state of health data history value matrix by column, the corresponding computing node of every row state of health data.
S52, judge that whether all computing nodes have all analyzed one time, if all computing nodes have all been analyzed one time, jump to step S57.If exist computing node not analyze, jump to step S53.
S53, add up current computing node respectively and exceed health status term of reference and the number of times lower than the collection of health status term of reference;
If S54 exceedes the number of times of the collection of health status term of reference and is less than k1, and be less than k2 lower than the number of times of the collection of health status term of reference, judge that this computing node is as health status, and jump to step S55.Be more than or equal to k1 if exceed the number of times of the collection of health status term of reference, or be more than or equal to k2 lower than the number of times of the collection of health status term of reference, judge that this computing node is as unhealthy status, and jump to step S56.
K1/p represents the tolerance (more conservative more greatly, more little more radical) to exceeding health status term of reference, and K2/p represents the tolerance lower than health status term of reference (more conservative more greatly, more little more radical).Tolerance also can be configured according to the requirement of high-availability system, flexible configuration.
The non-healthy value of statistical indicant of S55, this computing node is set to 0, and jumps to step S51.
S56, obtain the number of times of the collection that exceedes health status term of reference and the maximal value lower than the number of times of the collection of health status term of reference, this maximal value is set to the unhealthy status value of this computing node, and jumps to step S51.
S57, analyze the non-healthy value of statistical indicant of all computing nodes, judge whether to exist the computing node of non-zero value; If there is no the computing node of non-zero value, determines that all computing nodes are all healthy, if there is the computing node of non-zero value, determines the computing node that has non-health.
S58, according to Provisioning Policy, computing node is managed, jump to step S21 after waiting for the first preset time T 1.
Alternatively, as shown in Figure 5, described step S58 comprises following sub-step:
S581, the first strategy, the second strategy, the 3rd strategy are set, the first strategy comprises that prompt alarm, Autonomic Migration Framework task are gone out, alarm simultaneously Autonomic Migration Framework task go out three options; The second strategy comprises three options such as prompt alarm, automatic shutdown, alarm while automatic shutdown; The 3rd strategy comprises that prompt alarm, Autonomic Migration Framework task are returned, alarm Autonomic Migration Framework task three options of returning simultaneously.
S582, judge whether the computing node of non-health exists Processing tasks, if the computing node of non-health exists Processing tasks, inquire about the first strategy and process according to the first strategy.If the computing node of non-health does not exist Processing tasks, inquire about the second strategy and process according to the second strategy.
S583, judge whether healthy computing node exists Processing tasks, if healthy computing node exists Processing tasks, this healthy computing node is not operated; If there is not Processing tasks in healthy computing node, determine that this healthy computing node reverts to health status from unhealthy status, and record current time and stab, if deducting the difference of timestamp last time, current time stamp is less than the second preset time T 2, or there is no timestamp last time, do not do any operation, if difference is more than or equal to the second preset time T 2, process according to the 3rd strategy.
After S584, wait the first preset time T 1, jump to step S21.
The embodiment of the present invention triggers the task immigration of computing node by analysis state of health data, compare automatic triggering task immigration while delaying machine, or manually carry out task immigration, and state of health data detects science and flexible more.And before tendency spoilage malfunction appears in computing node, in state of health data, there is antedating response (inferior health), therefore improved the fault sensed in advance ability of system.Compare other technologies, the embodiment of the present invention is saved high-availability system own resources expense more.
Below in conjunction with Fig. 6 to Figure 10, the principle of the embodiment of the present invention is done to further description.
Fig. 6 is the logic relation picture of high-availability system in the embodiment of the present invention.Can find out the corresponding all computing nodes of each control node.Whether the control node of low priority breaks down by the control node of the high one-level right of priority of heartbeat detection.
By reference to the accompanying drawings 7 and Fig. 9 present embodiment of the present invention is elaborated.
7 pairs of main control nodes of the present invention and the standby node workflow of controlling describe by reference to the accompanying drawings, and detailed process is as follows.
Step 1.1: control the state of health data that node is collected all computing nodes, then enter step 1.2.
Step 1.2: read configuration file, judge whether current control node is main control node.If not main control node (i.e. the standby node of controlling), enter step 1.3; If main control node enters step 1.5.
Step 1.3: control whether fault of node by heartbeat detection network detection main control node and higher standby of priority, if there is the node failure of control to enter step 1.4, if do not control node failure, stand-by period T1, enters step 1.1.
Step 1.4: the priority of change control node, can be with reference to accompanying drawing 8, Fig. 8 controls node to take over sequential schematic in the embodiment of the present invention.If main control node failure, the first standby node of controlling that priority is the highest becomes main control node; Control node if standby and break down, than this standby control node the more standby node of controlling of low priority automatically improve one-level priority.Revise after the priority of control node, entered step 1.2.
Step 1.5: state of health data is carried out to analyzing and processing, can be with reference to accompanying drawing 9.
Fig. 9 is state of health data collection analysis work for the treatment of process flow diagram in the embodiment of the present invention;
The workflow of 9 pairs of state of health data analyzing and processing of the present invention describes by reference to the accompanying drawings, and detailed process is as follows.This state of health data analyzing and processing workflow is applicable to the state of health data types such as temperature, fan, voltage, electric current.
Step 2.1: gather the state of health data of current all computing nodes, then enter step 2.2.
Step 2.2: if the number of times gathering is less than p, represent that the historical data gathering is inadequate, enters step 2.1 after stand-by period T.If the time point number gathering is more than or equal to p time, enter step 2.3.
Step 2.3: the historical data of Delete superfluous, only retain the nearest state of health data gathering for p time, be now total to the state of health data history value matrix of the capable n row of p of data composition of n computing node under p time point, can be with reference to accompanying drawing 10.The corresponding each time point of row of matrix, the corresponding each computing node of this matrix column, enters step 2.4.
Step 2.4: state of health data history value matrix is analyzed by column, and the corresponding computing node of every column data, enters step 2.5.
Step 2.5: also have computing node not analyze complete, enter step 2.6; If all computing nodes have all been analyzed one time, the current state of health data analysis once gathering completes, and enters step 2.10.
Step 2.6: statistics is worked as the times of collection that prostatitis exceedes health status term of reference, statistics, when prostatitis is lower than the times of collection of health status term of reference, then enters step 2.7.
Step 2.7: be less than k1 if exceed the time point number of times of health status term of reference, and be less than k2 lower than the time point number of times of health status term of reference, still judge that this computing node, as health status, enters step 2.8; Be more than or equal to k1 if exceed the time point number of times of health status term of reference, or the time point number of times lower than health status term of reference is more than or equal to k2, represent that this computing node repeatedly exceedes or lower than health status term of reference, and tend towards stability, judge that this computing node, as unhealthy status, enters step 2.9.
Step 2.8: the non-healthy value of statistical indicant of putting current computing node is 0, then enters step 2.4.
Step 2.9: get and exceed the times of collection of health status term of reference and the maximal value lower than the time point number of times of health status term of reference, as the non-healthy value of statistical indicant of this computing node, then enter step 2.4.
Step 2.10: analyze the non-healthy value of statistical indicant of all computing nodes, the if there is no computing node of non-zero value, represents that all computing nodes are all healthy, if there is the computing node of non-zero value, represents to have non-healthy computing node, enters step 2.11.
Step 2.11: the first strategy setting comprises three options such as prompt alarm, Autonomic Migration Framework task are gone out, alarm while Autonomic Migration Framework task is gone out; The second strategy setting comprises three options such as prompt alarm, automatic shutdown, alarm while automatic shutdown; The 3rd strategy comprises three options such as prompt alarm, Autonomic Migration Framework task are returned, alarm while Autonomic Migration Framework task is returned.According to the requirement of Provisioning Policy, computing node is operated accordingly:
Check that whether non-healthy computing node exists Processing tasks, if there is Processing tasks, inquires about the first strategy, and processes according to the operation of the first strategy; If there is no Processing tasks is inquired about the second strategy, and processes according to the operation of the second strategy;
Check that whether healthy computing node exists Processing tasks, if there is Processing tasks, does not operate; If there is no Processing tasks, represent that this computing node reverts to health status from unhealthy status, and record current time and stab, if deducting the difference of timestamp last time, current time stamp is less than the second preset time T 2, or there is no timestamp last time, do not do any operation, if difference is more than or equal to the second preset time T 2, process according to the 3rd strategy
Stand-by period, T1 returned to step 2.1.
The parameter-definition involving in state of health data analyzing and processing is as follows:
N: the computing node number that represents participation;
P: the number of times that represents to gather state of health data;
K1: expression can be tolerated the time point number of times that exceedes health status term of reference, and this value is not more than p, k1/p represents the tolerance (more conservative more greatly, more little more radical) to exceeding health status term of reference;
K2: expression can be tolerated the time point number of times lower than health status term of reference, and this value is not more than p, k2/p represents the tolerance lower than health status term of reference (more conservative more greatly, more little more radical);
T1: represent the time cycle of each state of health data analyzing and processing, this value is not less than the time cycle that state of health data gathers, otherwise there will be the situation of collecting repeating data.
As shown in figure 11, the embodiment of the present invention also provides a kind of high-availability system based on health control to realize system, and the described high-availability system based on health control is realized system and comprised as lower module:
Node configuration module 10, controls node and at least two computing nodes for arranging at least two.And be main control node for one of them is controlled to Node configuration, be standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration.And for priority that main control node is set higher than any one standby node of controlling.
Data acquisition module 20, for by each control node repeated acquisition state of health data of storing all computing nodes respectively.
Preferably, as shown in figure 12, described data acquisition module 20 comprises with lower unit:
Data acquisition unit 21, for gathering the once state of health data of all computing nodes every a set time T1 respectively by each control node, and is saved in the described state of health data gathering on each control node storer separately.
Whether times of collection judging unit 22, be less than pre-determined number p for the number of times that judges collection, in the time that the number of times of fruit collection is less than pre-determined number p, and the function of log-on data collecting unit 21.And in the time that the number of times gathering is greater than pre-determined number p, delete historical state of health data unnecessary on each control node memory, retain the nearest state of health data collecting for p time.And in the time that the number of times gathering equals pre-determined number p, directly retain the nearest state of health data collecting for p time; .
Matrix configuration unit 23, is arranged to the state of health data history value matrix of the capable n row of p for the state of health data of n computing node that each control node is collected for p time.
Node judge module 30, for judging that whether current control node is main control node, if current control node is main control node, jumps to data analysis module 50.If current control node is the standby node of controlling, jump to fault judge module 40.
Whether fault judge module 40, break down for surveying main control node by heartbeat in real time by the highest standby control node of priority.And for surveying than the higher leveled standby node of controlling of its priority and whether break down by heartbeat in real time by the low standby node of controlling of priority.And in the time that control node breaks down, amendment priority orders.
Preferably, as shown in figure 13, described fault judge module 40 comprises with lower unit:
Whether fault judging unit 41, break down for surveying main control node by heartbeat in real time by the highest standby control node of priority; And for surveying than the higher leveled standby node of controlling of its priority and whether break down by heartbeat in real time by the low standby node of controlling of priority.
The first adjustment unit 42, in the time detecting main control node and break down, standby the control node the highest by priority is configured to main control node, will control other node priority for control nodes except the highest standby of priority and improve separately one-level.
The second adjustment unit 43, for standby when controlling node and breaking down when detecting, improves one-level by priority lower than the standby all standby priority of controlling node of controlling node that this breaks down separately.
Data analysis module 50, for be stored in the state of health data of storing on the storer of main control node by main control node analysis, manages computing node according to analysis result.
Preferably, as shown in figure 14, described data analysis module 50 comprises with lower unit:
Analytic unit 51, for analyze by column state of health data history value matrix by main control node, the corresponding computing node of every row state of health data.
Analyze judging unit 52, for judging whether all computing nodes have all been analyzed one time, in the time that all computing nodes have all been analyzed one time, start the function of computing node judging unit.And in the time existing computing node not analyze, start the function of statistic unit 53.
Statistic unit 53, exceedes health status term of reference and the number of times lower than the collection of health status term of reference for adding up respectively current computing node.
State judging unit 54, for being less than k1 at the number of times of the collection that exceedes health status term of reference, and while being less than k2 lower than the number of times of the collection of health status term of reference, judging that this computing node is as health status, and start the function of the first value of statistical indicant setting unit 55.And be more than or equal to k1 for the number of times when the collection that exceedes health status term of reference, or while being more than or equal to k2 lower than the number of times of the collection of health status term of reference, judge that this computing node is as unhealthy status, and start the function of the second value of statistical indicant setting unit 56.
The first value of statistical indicant setting unit 55, is set to 0 for the non-healthy value of statistical indicant of this computing node, and the function of startup analysis unit 51.
The second value of statistical indicant setting unit 56, for obtaining the number of times of the collection that exceedes health status term of reference and the maximal value lower than the number of times of the collection of health status term of reference, this maximal value is set to the unhealthy status value of this computing node, and the function of startup analysis unit 51.
Computing node judging unit 57, for analyzing the non-healthy value of statistical indicant of all computing nodes, judges whether to exist the computing node of non-zero value.And in the time not there is not the computing node of non-zero value, determine that all computing nodes are all healthy, when there is the computing node of non-zero value, determine the computing node that has non-health.
Computing node administrative unit 58, for according to Provisioning Policy, computing node being managed, waits for the function of the rear log-on data collecting unit 21 of the first preset time T 1.
Preferably, as shown in figure 15, described computing node administrative unit 58 comprises following subelement:
Strategy arranges subelement 581, and for the first strategy, the second strategy, the 3rd strategy are set, the first strategy comprises that prompt alarm, Autonomic Migration Framework task are gone out, alarm while Autonomic Migration Framework task is gone out three options; The second strategy comprises three options such as prompt alarm, automatic shutdown, alarm while automatic shutdown; The 3rd strategy comprises that prompt alarm, Autonomic Migration Framework task are returned, alarm Autonomic Migration Framework task three options of returning simultaneously.
Whether first task judgment sub-unit 582, there is Processing tasks for the computing node that judges non-health, and in the time that the computing node of non-health exists Processing tasks, inquiry the first strategy is also processed according to the first strategy.And in the time that the computing node of non-health does not exist Processing tasks, inquiry the second strategy is also processed according to the second strategy.
The second task judgment sub-unit 583, for judging whether healthy computing node exists Processing tasks, in the time that healthy computing node exists Processing tasks, does not operate this healthy computing node.In the time there is not Processing tasks in healthy computing node, determine that this healthy computing node reverts to health status from unhealthy status, and to this healthy computing node timing, timing is not done any operation during less than the second preset time T 2, timing is processed according to the 3rd strategy to T2.
Redirect subelement 584, for the function waiting for the rear log-on data collecting unit 21 of the first preset time T 1.
High-availability system based on health control provided by the invention is realized system, by quantity and the priority of flexible configuration control node, configures very flexibly, supports the fault-tolerant calculation of other application program.And be stored in the state of health data of storing on the storer of main control node by main control node analysis, and according to analysis result, computing node is managed, realize the tendency spoilage malfunction of predicting in advance computing node.State of health data is obtained by the bypass of computing node, does not take the resource such as CPU, internal memory, network of computing node, with respect to two-shipper cold standby technology, two-node cluster hot backup technology, the high available fault-tolerant technique of virtual machine, has saved system resource overhead.
More than installing embodiment and embodiment of the method is one to one, and the simple part of device embodiment, referring to embodiment of the method.
In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment, between each embodiment identical similar part mutually referring to.
Professional can also further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software is clearly described, in the above description according to functional composition and the step of having described in general manner each example.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can realize described function with distinct methods to each specifically should being used for, but this realization should not exceed scope of the present invention.
The software module that the method for describing in conjunction with embodiment disclosed herein or the step of algorithm can directly use hardware, processor to carry out, or the combination of the two is implemented.Software module can be placed in random access memory, internal memory, ROM (read-only memory), electrically programmable ROM, electricity can sassafras except known any other forms of storage medium in programming ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.
By reference to the accompanying drawings embodiments of the invention are described above; but the present invention is not limited to above-mentioned embodiment; above-mentioned embodiment is only schematic; instead of restrictive; those of ordinary skill in the art is under enlightenment of the present invention; not departing from the scope situation that aim of the present invention and claim protect, also can make a lot of forms, within these all belong to protection of the present invention.
Claims (10)
1. the high-availability system implementation method based on health control, is characterized in that, the described high-availability system implementation method based on health control comprises the steps:
Nodes and at least two computing nodes are controlled in S1, at least two of settings; One of them is controlled to Node configuration is main control node, is standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration; The priority of main control node is set higher than any one standby node of controlling;
S2, respectively control node repeated acquisition the state of health data of storing all computing nodes respectively;
S3, judge that whether current control node is main control node, if current control node is main control node, jumps to step S5; If current control node is the standby node of controlling, jump to step S4;
Whether S4, the standby control node that priority is the highest are surveyed main control node by heartbeat in real time and are broken down; Whether the standby node of controlling that priority is low is surveyed than the higher leveled standby node of controlling of its priority and is broken down by heartbeat in real time; Break down if control node, revise priority orders;
S5, main control node analysis are stored in the state of health data of storing on the storer of main control node, according to analysis result, computing node are managed.
2. the high-availability system implementation method based on health control according to claim 1, is characterized in that, described step S2 comprises following sub-step:
S21, respectively control node and gather respectively once the state of health data of all computing nodes, and the described state of health data gathering is saved on each control node storer separately;
S22, judge whether the number of times gathering is less than pre-determined number p, if the number of times gathering is less than pre-determined number p, jump to step S21 after waiting for the first preset time T 1; If the number of times gathering is greater than pre-determined number p, delete historical state of health data unnecessary on each control node memory, retain the nearest state of health data collecting for p time, jump to step S23; If the number of times gathering equals pre-determined number p, jump to step S23;
The state of health data of S23, n computing node that each control node is collected for p time is arranged to the state of health data history value matrix of the capable n row of p.
3. the high-availability system implementation method based on health control according to claim 2, is characterized in that, described step S4 comprises following sub-step:
Whether S41, the standby control node that priority is the highest are surveyed main control node by heartbeat in real time and are broken down; Whether the standby node of controlling that priority is low is surveyed than the higher leveled standby node of controlling of its priority and is broken down by heartbeat in real time;
If S42 detects main control, node breaks down, and the standby node of controlling that priority is the highest is configured to main control node, will improve separately one-level except the highest standby other the standby priority of controlling node controlled node of priority;
If S43 detects the standby node of controlling and breaks down, priority is improved to one-level separately lower than the standby all standby priority of controlling node of controlling node that this breaks down.
4. the high-availability system implementation method based on health control according to claim 3, is characterized in that, described step S5 comprises following sub-step:
S51, main control node are analyzed state of health data history value matrix by column, the corresponding computing node of every row state of health data;
S52, judge that whether all computing nodes have all analyzed one time, if all computing nodes have all been analyzed one time, jump to step S57; If exist computing node not analyze, jump to step S53;
S53, add up current computing node respectively and exceed health status term of reference and the number of times lower than the collection of health status term of reference;
If S54 exceedes the number of times of the collection of health status term of reference and is less than k1, and be less than k2 lower than the number of times of the collection of health status term of reference, judge that this computing node is as health status, and jump to step S55; Be more than or equal to k1 if exceed the number of times of the collection of health status term of reference, or be more than or equal to k2 lower than the number of times of the collection of health status term of reference, judge that this computing node is as unhealthy status, and jump to step S56;
The non-healthy value of statistical indicant of S55, this computing node is set to 0, and jumps to step S51;
S56, obtain the number of times of the collection that exceedes health status term of reference and the maximal value lower than the number of times of the collection of health status term of reference, this maximal value is set to the unhealthy status value of this computing node, and jumps to step S51;
S57, analyze the non-healthy value of statistical indicant of all computing nodes, judge whether to exist the computing node of non-zero value; If there is no the computing node of non-zero value, determines that all computing nodes are all healthy, if there is the computing node of non-zero value, determines the computing node that has non-health;
S58, according to Provisioning Policy, computing node is managed, jump to step S21 after waiting for the first preset time T 1.
5. the high-availability system implementation method based on health control according to claim 4, is characterized in that, described step S58 comprises following sub-step:
S581, the first strategy, the second strategy, the 3rd strategy are set, the first strategy comprises that prompt alarm, Autonomic Migration Framework task are gone out, alarm simultaneously Autonomic Migration Framework task go out three options; The second strategy comprises three options such as prompt alarm, automatic shutdown, alarm while automatic shutdown; The 3rd strategy comprises that prompt alarm, Autonomic Migration Framework task are returned, alarm Autonomic Migration Framework task three options of returning simultaneously;
S582, judge whether the computing node of non-health exists Processing tasks, if the computing node of non-health exists Processing tasks, inquire about the first strategy and process according to the first strategy; If the computing node of non-health does not exist Processing tasks, inquire about the second strategy and process according to the second strategy;
S583, judge whether healthy computing node exists Processing tasks, if healthy computing node exists Processing tasks, this healthy computing node is not operated; If there is not Processing tasks in healthy computing node, determine that this healthy computing node reverts to health status from unhealthy status, and record current time and stab, if deducting the difference of timestamp last time, current time stamp is less than the second preset time T 2, or there is no timestamp last time, do not do any operation, if difference is more than or equal to the second preset time T 2, process according to the 3rd strategy;
After S584, wait the first preset time T 1, jump to step S21.
6. the high-availability system based on health control is realized a system, it is characterized in that, the described high-availability system based on health control is realized system and comprised as lower module:
Node configuration module, controls node and at least two computing nodes for arranging at least two; One of them is controlled to Node configuration is main control node, is standby node the standby priority orders of controlling node of configuration controlled by other control Node configuration; The priority of main control node is set higher than any one standby node of controlling;
Data acquisition module, for by each control node repeated acquisition state of health data of storing all computing nodes respectively;
Node judge module, for judging that whether current control node is main control node, if current control node is main control node, jumps to data analysis module; If current control node is the standby node of controlling, jump to fault judge module;
Whether fault judge module, break down for surveying main control node by heartbeat in real time by the highest standby control node of priority; Survey than the higher leveled standby node of controlling of its priority and whether break down by heartbeat in real time by the standby node of controlling that priority is low; In the time that control node breaks down, amendment priority orders;
Data analysis module, for be stored in the state of health data of storing on the storer of main control node by main control node analysis, manages computing node according to analysis result.
7. the high-availability system based on health control according to claim 6 is realized system, it is characterized in that, described data acquisition module comprises with lower unit:
Data acquisition unit, for gathering the once state of health data of all computing nodes every a set time T1 respectively by each control node, and is saved in the described state of health data gathering on each control node storer separately;
Whether times of collection judging unit, be less than pre-determined number p for the number of times that judges collection, in the time that the number of times of fruit collection is less than pre-determined number p, and the function of log-on data collecting unit; In the time that the number of times gathering is greater than pre-determined number p, delete historical state of health data unnecessary on each control node memory, retain the nearest state of health data collecting for p time; In the time that the number of times gathering equals pre-determined number p, directly retain the nearest state of health data collecting for p time;
Matrix configuration unit, is arranged to the state of health data history value matrix of the capable n row of p for the state of health data of n computing node that each control node is collected for p time.
8. the high-availability system based on health control according to claim 7 is realized system, it is characterized in that, described fault judge module comprises with lower unit:
Whether fault judging unit, break down for surveying main control node by heartbeat in real time by the highest standby control node of priority; Survey than the higher leveled standby node of controlling of its priority and whether break down by heartbeat in real time by the standby node of controlling that priority is low;
The first adjustment unit, in the time detecting main control node and break down, is configured to main control node by the standby node of controlling the highest priority, will control other node priority for control nodes except the highest standby of priority and improve separately one-level;
The second adjustment unit, for standby when controlling node and breaking down when detecting, improves one-level by priority lower than the standby all standby priority of controlling node of controlling node that this breaks down separately.
9. the high-availability system based on health control according to claim 8 is realized system, it is characterized in that, described data analysis module comprises with lower unit:
Analytic unit, for analyze by column state of health data history value matrix by main control node, the corresponding computing node of every row state of health data;
Analyze judging unit, for judging whether all computing nodes have all been analyzed one time, in the time that all computing nodes have all been analyzed one time, start the function of computing node judging unit; In the time existing computing node not analyze, start the function of statistic unit;
Statistic unit, exceedes health status term of reference and the number of times lower than the collection of health status term of reference for adding up respectively current computing node;
State judging unit, for being less than k1 at the number of times of the collection that exceedes health status term of reference, and while being less than k2 lower than the number of times of the collection of health status term of reference, judging that this computing node is as health status, and start the function of the first value of statistical indicant setting unit; When the number of times of the collection that exceedes health status term of reference is more than or equal to k1, or while being more than or equal to k2 lower than the number of times of the collection of health status term of reference, judging that this computing node is as unhealthy status, and start the function of the second value of statistical indicant setting unit;
The first value of statistical indicant setting unit, is set to 0 for the non-healthy value of statistical indicant of this computing node, and the function of startup analysis unit;
The second value of statistical indicant setting unit, for obtaining the number of times of the collection that exceedes health status term of reference and the maximal value lower than the number of times of the collection of health status term of reference, this maximal value is set to the unhealthy status value of this computing node, and the function of startup analysis unit;
Computing node judging unit, for analyzing the non-healthy value of statistical indicant of all computing nodes, judges whether to exist the computing node of non-zero value; In the time not there is not the computing node of non-zero value, determine that all computing nodes are all healthy, when there is the computing node of non-zero value, determine the computing node that has non-health;
Computing node administrative unit, for according to Provisioning Policy, computing node being managed, waits for the function of the rear log-on data collecting unit of the first preset time T 1.
10. the high-availability system based on health control according to claim 9 is realized system, it is characterized in that, described computing node administrative unit comprises following subelement:
Strategy arranges subelement, and for the first strategy, the second strategy, the 3rd strategy are set, the first strategy comprises that prompt alarm, Autonomic Migration Framework task are gone out, alarm while Autonomic Migration Framework task is gone out three options; The second strategy comprises three options such as prompt alarm, automatic shutdown, alarm while automatic shutdown; The 3rd strategy comprises that prompt alarm, Autonomic Migration Framework task are returned, alarm Autonomic Migration Framework task three options of returning simultaneously;
Whether first task judgment sub-unit, there is Processing tasks for the computing node that judges non-health, and in the time that the computing node of non-health exists Processing tasks, inquiry the first strategy is also processed according to the first strategy; In the time that the computing node of non-health does not exist Processing tasks, inquiry the second strategy is also processed according to the second strategy;
The second task judgment sub-unit, for judging whether healthy computing node exists Processing tasks, in the time that healthy computing node exists Processing tasks, does not operate this healthy computing node; In the time there is not Processing tasks in healthy computing node, determine that this healthy computing node reverts to health status from unhealthy status, and record current time and stab, if deducting the difference of timestamp last time, current time stamp is less than the second preset time T 2, or there is no timestamp last time, do not do any operation, if difference is more than or equal to the second preset time T 2, process according to the 3rd strategy;
Redirect subelement, for the function waiting for the rear log-on data collecting unit of the first preset time T 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410403247.8A CN104199747B (en) | 2014-08-15 | 2014-08-15 | High-availability system obtaining method and system based on health management |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410403247.8A CN104199747B (en) | 2014-08-15 | 2014-08-15 | High-availability system obtaining method and system based on health management |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104199747A true CN104199747A (en) | 2014-12-10 |
CN104199747B CN104199747B (en) | 2017-05-03 |
Family
ID=52085044
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410403247.8A Active CN104199747B (en) | 2014-08-15 | 2014-08-15 | High-availability system obtaining method and system based on health management |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104199747B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590016A (en) * | 2017-09-14 | 2018-01-16 | 成都西加云杉科技有限公司 | Recognition method and device for restart after power failure |
CN108062271A (en) * | 2018-01-04 | 2018-05-22 | 联想(北京)有限公司 | Collecting method and system |
CN111427959A (en) * | 2020-03-27 | 2020-07-17 | 北京贝斯平云科技有限公司 | Data storage method and device |
CN112131088A (en) * | 2020-09-29 | 2020-12-25 | 北京计算机技术及应用研究所 | High availability method based on health examination and container |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1512375A (en) * | 2002-12-31 | 2004-07-14 | 联想(北京)有限公司 | Fault-tolerance approach using machine group node interacting buckup |
CN1588341A (en) * | 2004-08-11 | 2005-03-02 | 北京四方继保自动化股份有限公司 | Method for realizing multiple spare part of key application module in power automatic system |
US7287180B1 (en) * | 2003-03-20 | 2007-10-23 | Info Value Computing, Inc. | Hardware independent hierarchical cluster of heterogeneous media servers using a hierarchical command beat protocol to synchronize distributed parallel computing systems and employing a virtual dynamic network topology for distributed parallel computing system |
CN103152434A (en) * | 2013-03-27 | 2013-06-12 | 江苏辰云信息科技有限公司 | Leader node replacing method of distributed cloud system |
CN103838635A (en) * | 2012-11-23 | 2014-06-04 | 中国银联股份有限公司 | Host computer health degree detecting method |
-
2014
- 2014-08-15 CN CN201410403247.8A patent/CN104199747B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1512375A (en) * | 2002-12-31 | 2004-07-14 | 联想(北京)有限公司 | Fault-tolerance approach using machine group node interacting buckup |
US7287180B1 (en) * | 2003-03-20 | 2007-10-23 | Info Value Computing, Inc. | Hardware independent hierarchical cluster of heterogeneous media servers using a hierarchical command beat protocol to synchronize distributed parallel computing systems and employing a virtual dynamic network topology for distributed parallel computing system |
CN1588341A (en) * | 2004-08-11 | 2005-03-02 | 北京四方继保自动化股份有限公司 | Method for realizing multiple spare part of key application module in power automatic system |
CN103838635A (en) * | 2012-11-23 | 2014-06-04 | 中国银联股份有限公司 | Host computer health degree detecting method |
CN103152434A (en) * | 2013-03-27 | 2013-06-12 | 江苏辰云信息科技有限公司 | Leader node replacing method of distributed cloud system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590016A (en) * | 2017-09-14 | 2018-01-16 | 成都西加云杉科技有限公司 | Recognition method and device for restart after power failure |
CN107590016B (en) * | 2017-09-14 | 2023-08-15 | 成都西加云杉科技有限公司 | Power-down restarting identification method and device |
CN108062271A (en) * | 2018-01-04 | 2018-05-22 | 联想(北京)有限公司 | Collecting method and system |
CN111427959A (en) * | 2020-03-27 | 2020-07-17 | 北京贝斯平云科技有限公司 | Data storage method and device |
CN111427959B (en) * | 2020-03-27 | 2024-04-23 | 北京贝斯平云科技有限公司 | Data storage method and device |
CN112131088A (en) * | 2020-09-29 | 2020-12-25 | 北京计算机技术及应用研究所 | High availability method based on health examination and container |
CN112131088B (en) * | 2020-09-29 | 2024-04-09 | 北京计算机技术及应用研究所 | High availability method based on health examination and container |
Also Published As
Publication number | Publication date |
---|---|
CN104199747B (en) | 2017-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Birke et al. | Failure analysis of virtual and physical machines: Patterns, causes and characteristics | |
US20150205657A1 (en) | Predicting failure of a storage device | |
EP2685380B1 (en) | Operations management unit, operations management method, and program | |
US20110238376A1 (en) | Automatic Determination of Dynamic Threshold for Accurate Detection of Abnormalities | |
CN112783792B (en) | Fault detection method and device for distributed database system and electronic equipment | |
US20160020965A1 (en) | Method and apparatus for dynamic monitoring condition control | |
AU2012221821B2 (en) | Network event management | |
US20130185021A1 (en) | Automated Performance Data Management and Collection | |
CN110851320A (en) | Server downtime supervision method, system, terminal and storage medium | |
CN108633311A (en) | A kind of method, apparatus and control node of the con current control based on call chain | |
US9747156B2 (en) | Management system, plan generation method, plan generation program | |
CN104199747A (en) | High-availability system obtaining method and system based on health management | |
US10831587B2 (en) | Determination of cause of error state of elements in a computing environment based on an element's number of impacted elements and the number in an error state | |
US12111800B2 (en) | Data center modeling for facility operations | |
CN111181774A (en) | A high-availability method, system, terminal and storage medium for MapReduce tasks | |
US8421614B2 (en) | Reliable redundant data communication through alternating current power distribution system | |
US9465708B2 (en) | System and method to proactively and intelligently schedule disaster recovery (DR) drill(s)/test(s) in computing system environment | |
US20100083034A1 (en) | Information processing apparatus and configuration control method | |
CN119168590A (en) | A fault detection method, system, device, equipment, medium and program product | |
Lu et al. | Iaso: an autonomous fault-tolerant management system for supercomputers | |
CN104158843A (en) | Storage unit invalidation detecting method and device for distributed file storage system | |
Das et al. | Holistic root cause analysis of node failures in production HPC | |
CN105446861A (en) | Linux-based IPMI (intelligent platform management interface) interface load stability monitoring method | |
US20060168479A1 (en) | Real time event logging and analysis in a software system | |
Lundin et al. | Significant advances in Cray system architecture for diagnostics, availability, resiliency and health |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |