JP5285044B2

JP5285044B2 - Cluster system recovery method, server, and program

Info

Publication number: JP5285044B2
Application number: JP2010252890A
Authority: JP
Inventors: 崇幸田中; 充智今崎; 敬志斉藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-06-04
Filing date: 2010-11-11
Publication date: 2013-09-11
Anticipated expiration: 2030-11-11
Also published as: JP2012014673A

Description

本発明は、クラスタシステム復旧方法及びサーバ及びプログラムに係り、特に、複数のサーバシステムを連携して一つのシステムとして運用するシステムにおいて、障害発生したサーバの故障を検出し、復旧させるためのクラスタシステム復旧方法及びサーバ及びプログラムに関する。 The present invention relates to a cluster system recovery method, a server, and a program, and more particularly to a cluster system for detecting and recovering from a failure of a failed server in a system that operates a plurality of server systems in cooperation as a single system. The present invention relates to a recovery method, a server, and a program.

サービスの重要性が増すにつれ、ダウンタイムの少ないシステムの要求が高まっている。このため、複数のサーバで冗長構成されたクラスタシステムを構築し、何らかの故障が発生したときに自動的にサーバを切り替えることにより、サービスの継続を可能とするＨｅａｒｔｂｅａｔおよびＰａｃｅｍａｋｅｒなどの高可用性クラスタソフトが開発されている（非特許文献１参照）。 As the importance of services increases, so does the demand for systems with low downtime. For this reason, a highly available cluster software such as Heartbeat and Pacemaker that enables continuation of services by building a redundant cluster system with multiple servers and automatically switching servers when some failure occurs. It has been developed (see Non-Patent Document 1).

高可用性クラスタソフトでは、サーバ上のリソース、ネットワーク、共有ディスク等を監視しており、サービス稼働中のサーバで故障を検知すると、予め待機しているサーバに切り替え、サービスを継続させる。 High-availability cluster software monitors resources, networks, shared disks, etc. on the server. If a failure is detected on a server that is in service, the server is switched to a standby server in advance and the service is continued.

図１に、高可用性クラスタソフトを用いたクラスタシステムの概略図を示す。クラスタシステムは、ネットワークに接続されている複数のサーバ（現用機及び予備機）と、これらの複数のサーバで共有して用いられる共有ディスクとを有する。 FIG. 1 shows a schematic diagram of a cluster system using high-availability cluster software. The cluster system includes a plurality of servers (active machine and spare machine) connected to a network and a shared disk that is shared and used by the plurality of servers.

現用機及び予備機は、オペレーティングシステム（ＯＳ）と、高可用性クラスタソフトと、サービスを提供するために必要な構成要素であるリソースとをそれぞれ有する。高可用性クラスタソフトは、現用機での故障の発生を検知し、故障が発生したときに自動的に予備機に切り替える。サーバにおけるサービスの稼働状態、リソースの稼働状態及び故障状態は、内蔵ディスクの状態管理情報記憶部に格納され、故障箇所等の詳細な情報は内蔵ディスクのログ記憶部に格納され、故障状態を含むサーバのクラスタ状態の管理情報は状態管理情報記憶部に格納される。 The active machine and the spare machine each have an operating system (OS), high-availability cluster software, and resources that are components necessary for providing a service. The high-availability cluster software detects the occurrence of a failure in the active machine and automatically switches to a spare machine when a failure occurs. The service operating status, resource operating status, and failure status of the server are stored in the internal disk status management information storage unit, and detailed information such as failure location is stored in the internal disk log storage unit, including failure status Server cluster state management information is stored in a state management information storage unit.

現用機及び予備機は、サービスLANと呼ばれるネットワークに接続されており、リソースによるサービスをクライアントに提供する。また、現用機及び予備機は、インターコネクトLANと呼ばれるネットワークに接続されており、サーバにおけるサービスの稼働状態、リソースの稼働状態、故障状態等の情報を交換する。更に、現用機及び予備機は、管理LANと呼ばれるネットワークに接続されており、保守端末からのコマンドを受け付けることができる。 The active machine and the spare machine are connected to a network called a service LAN, and provide services based on resources to clients. In addition, the current machine and the spare machine are connected to a network called an interconnect LAN, and exchange information such as service operating status, resource operating status, and fault status in the server. Furthermore, the current machine and the spare machine are connected to a network called a management LAN, and can accept commands from the maintenance terminal.

また、現用機及び予備機は、高可用性クラスタソフトに故障時に他サーバの電源を強制的に切断する強制電源断機能を含めることができる。強制電源断機能は、管理LANを経由して他サーバのハードウェア制御ボードに対して電源を切断する指示を送信することにより、他サーバの電源を切断する。 In addition, the active machine and the spare machine can include a forced power-off function for forcibly turning off the power of other servers when a failure occurs in the high-availability cluster software. The forced power-off function cuts off the power of the other server by sending an instruction to turn off the power to the hardware control board of the other server via the management LAN.

共有ディスクは、サービスの一貫性を保つために、サービス提供に用いられるデータを保存する記憶装置である。共有ディスクにより、現用機から予備機に切り替わった後も、同じデータを用いてサービスを継続できる。 The shared disk is a storage device that stores data used for service provision in order to maintain service consistency. With the shared disk, the service can be continued using the same data even after switching from the current machine to the spare machine.

このように、高可用性クラスタソフトでリソースの故障を監視しているため、リソース故障が発生した場合に、予備機でサービスを継続させることができる。予備機に系切り替えを行った後は、予備機でサービスが継続される（特許文献１参照）。 As described above, the failure of the resource is monitored by the high availability cluster software. Therefore, when a resource failure occurs, the service can be continued with the spare machine. After the system is switched to the spare machine, the service is continued on the spare machine (see Patent Document 1).

特許第４３５３００５号，「クラスタ構成コンピュータシステムの系切り替え方法」Japanese Patent No. 4353005, “System Switching Method for Cluster Configuration Computer System”

三井一能他，「サービスの可用性を向上させるＯＳＳミドルＨｅａｒｔｂｅａｔの開発」，ＮＴＴ技術ジャーナル，２００９年３月，４６〜４９ページMitsunori Mitsuno et al., “Development of OSS Middle Heartbeat to Improve Service Availability”, NTT Technical Journal, March 2009, pages 46-49

上記従来の技術は、高可用性クラスタソフトは、現用機での故障の発生を検知し、故障が発生したときに自動的に予備機に切り替えることができるが、これは、予備機側に故障が発生していないことが前提となる。例えば、図２に示すように、予備機側に内蔵ディスク故障、ネットワーク故障、共有ディスク故障の少なくともいずれか１つの故障が発生していても、現用機側が正常にサービスを稼動している場合には問題がないが、現用機側で何らかの障害が発生した場合に予備機側に系切り替えを行おうとしても、系切り替えを行うことができず、現用機は、当該現用機側で実施されている停止処理が正常に終了するのを待機している"SBY[遷移中]"という状態になってしまう。 In the above conventional technology, the high availability cluster software can detect the occurrence of a failure in the active machine and can automatically switch to the spare machine when a failure occurs. It is assumed that it has not occurred. For example, as shown in FIG. 2, when at least one of the internal disk failure, the network failure, and the shared disk failure has occurred on the spare device side, but the service is operating normally on the active device side. There is no problem, but if any failure occurs on the active machine side, even if you try to switch the system to the spare machine side, the system switchover cannot be performed, and the active machine is implemented on the active machine side. It will be in the state of "SBY [Transitioning]" waiting for the stop processing to finish normally.

また、図３に示すように、現用機または予備機のいずれかに強制電源断機能の故障が考えられる場合は、クラスタ・ソフトウェアの強制電源断機能が正常に実行できなくなるが、現用機のサービス稼動状況には影響はなく、系切り替え処理が発生したとしても予備機に切り替えることができる。しかし、系切り替え時にサービス停止に失敗した場合には、強制電源断機能が実行されず、現用機は上記と同様に"SBY[遷移中]"という状態になってしまうという問題がある。 In addition, as shown in FIG. 3, if there is a failure of the forced power-off function in either the active machine or the standby machine, the forced power-off function of the cluster software cannot be executed normally, but the service of the active machine There is no effect on the operating status, and even if a system switching process occurs, it can be switched to a spare machine. However, if the service stoppage fails during the system switchover, the forced power-off function is not executed, and there is a problem that the active machine enters a state of “SBY [in transit]” as described above.

本発明は、上記の点に鑑みなされたもので、クラスタシステムの現用機として動作するサーバが、故障等による系切り替えが発生したが、リソース停止失敗等により系切り替えが終了していない状態に陥ることを回避することが可能なクラスタシステム復旧方法及びサーバ及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and a server operating as an active machine of a cluster system is in a state where system switching has occurred due to a failure or the like, but system switching has not been completed due to a resource stop failure or the like. An object of the present invention is to provide a cluster system recovery method, server and program capable of avoiding this.

上記の目的を達成するために、本発明のクラスタシステム復旧方法（請求項１）は、故障状態を監視する故障監視手段と、故障状態に基づいて現用機及び予備機のサービス稼働状態を示すクラスタ状態を管理する状態管理手段と、サービス稼動中の状態（ACT）、ACTに遷移できる状態またはACTへ遷移可能かどうかは不明な状態（SBY：online）を含むクラスタ状態及び故障状態情報を格納する状態管理情報記憶手段と、をそれぞれ含む現用機と予備機、及び、該現用機と該予備機で共有する共有ディスクとで構成されるクラスタシステムで、該現用機がクラスタ構成に組み入れられ、該予備機がACTに遷移可能か不明な状態（SBY:online）におけるクラスタシステム復旧方法であって、
現用機の状態確認手段が、現用機の状態管理手段を介して状態管理情報記憶手段から故障状態情報を取得する故障状態取得ステップと、
故障状態情報が強制電源断機能の故障を示している場合には、予備機側の強制電源断機能の故障が疑われるため、保守端末に対してエラー出力する強制電源断機能エラー出力ステップと、
ACTへ遷移可能かどうかは不明な状態（SBY：online）の予備機側で保守端末から導通確認が指示されると、該予備機の導通確認手段が、現用機側のハードウェア制御手段に対して導通を確認し、導通が確認された場合には、該予備機側の強制電源断機能の故障として、該予備機の状態管理手段が状態管理情報記憶手段の強制電源断機能監視リソースの故障回数をクリアし、現用機からの系切り替えを可能とする状態（SBY：online）に遷移する故障回数クリアステップと、を有する。 In order to achieve the above object, a cluster system recovery method of the present invention (claim 1) includes failure monitoring means for monitoring a failure state, and a cluster indicating service operating states of active and standby devices based on the failure state. State management means for managing the state, and the status of the service in operation (ACT), the state that can be changed to ACT, or the state that can be changed to ACT is stored (SBY: online) and the cluster state and failure state information are stored. state management information storage means and the working machine, each containing a spare machine, and, in a cluster system constituted by the shared disk shared by the developing for machine and said spare machine, the developing for machine incorporated in a cluster configuration, the A cluster system recovery method in a state where it is unknown whether the spare machine can transition to ACT (SBY: online)
A fault status acquisition step in which the status check means of the active machine acquires fault status information from the status management information storage means via the status management means of the active machine;
If the failure status information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output step that outputs an error to the maintenance terminal,
When it is instructed by the maintenance terminal on the spare unit in an unknown state (SBY: online) whether the transition to ACT is possible or not, the continuity confirmation unit of the spare unit will instruct the hardware control unit on the active unit side. If the continuity is confirmed and the continuity is confirmed, a failure of the forced power-off function monitoring resource of the state management information storage means is detected as a failure of the forced power-off function of the spare machine. A failure count clearing step for transitioning to a state (SBY: online) that clears the count and enables system switching from the active machine.

また、本発明（請求項２）は、故障状態取得ステップにおいて取得した故障状態情報が、ネットワーク故障を示している場合には、予備機側の導通確認手段からルータまでの導通を確認し、導通が失敗した場合には、保守端末に対してエラー出力するネットワークエラー出力ステップを更に有する。 In the present invention (Claim 2), when the failure state information acquired in the failure state acquisition step indicates a network failure, the continuity from the continuity confirmation means on the spare unit side to the router is confirmed. In the case of failure, a network error output step of outputting an error to the maintenance terminal is further provided.

また、本発明（請求項３）は、故障状態取得ステップにおいて取得した故障状態情報が、共有ディスクまたは内蔵ディスクの故障を示している場合には、保守端末に対してエラー出力するディスクエラー出力ステップを更に有する。 Further, the present invention (Claim 3) is a disk error output step for outputting an error to the maintenance terminal when the failure state information acquired in the failure state acquisition step indicates a failure of the shared disk or the built-in disk. It has further.

本発明（請求項４）は、故障状態を監視する故障監視手段と、故障状態に基づいて現用機及び予備機のサービス稼働状態を示すクラスタ状態を管理する状態管理手段と、サービス稼動中の状態（ACT）、ACTに遷移できる状態またはACTへ遷移可能かどうかは不明な状態（SBY：online）を含むクラスタ状態及び故障状態情報を格納する状態管理情報記憶手段と、をそれぞれ含む現用機と予備機、及び、該現用機と該予備機で共有する共有ディスクとで構成され、該現用機がクラスタ構成に組み入れられ、該予備機がACTに遷移可能か不明な状態（SBY:online）にある場合のクラスタシステム復旧システムであって、
現用機は、
状態管理手段を介して状態管理情報記憶手段から故障状態情報を取得する故障状態取得手段と、
故障状態情報が強制電源断機能の故障を示している場合には、予備機側の強制電源断機能の故障が疑われるため、保守端末に対してエラー出力する強制電源断機能エラー出力手段と、
を有し、
予備機は、
ACTへ遷移可能かどうかは不明な状態（SBY：online）であるとき、保守端末から導通確認が指示されると、現用機側のハードウェア制御手段に対して導通を確認する導通確認手段と、
導通確認手段で導通が確認された場合には、該予備機側の強制電源断機能の故障として、状態管理情報記憶手段の強制電源断機能監視リソースの故障回数をクリアし、現用機からの系切り替えを可能とする状態（SBY：online）に遷移させる故障回数クリア手段と、を有する。 The present invention (Claim 4) includes a failure monitoring unit for monitoring a failure state, a state management unit for managing a cluster state indicating a service operation state of the active machine and the spare unit based on the failure state, and a state during service operation (ACT), state management information storage means for storing cluster state and failure state information including a state that can be changed to ACT or whether it is possible to change to ACT (SBY: online) and a spare And a shared disk shared by the working machine and the spare machine, the working machine is incorporated into the cluster configuration , and it is unknown whether the spare machine can transition to ACT (SBY: online) If the cluster system recovery system,
The current machine is
Failure state acquisition means for acquiring failure state information from the state management information storage means via the state management means;
If the failure status information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output means that outputs an error to the maintenance terminal,
Have
The spare machine is
When it is in an unknown state (SBY: online) whether it is possible to transition to ACT, when continuity confirmation is instructed from the maintenance terminal, continuity confirmation means for confirming continuity to the hardware control means on the active machine side,
When continuity is confirmed by the continuity confirmation means, the number of failure of the forced power-off function monitoring resource of the state management information storage means is cleared as a failure of the forced power-off function on the spare machine side, and the system from the active machine Fault number clearing means for transitioning to a state (SBY: online) that enables switching.

また、本発明（請求項５）の現用機は、故障状態取得手段にて取得した故障状態情報が、ネットワーク故障を示している場合には、ルータまでの導通を確認するルータ導通手段と、
ルータ導通手段による導通が失敗した場合には、保守端末に対してエラー出力するネットワークエラー出力手段と、を更に有する。 In addition, the active machine of the present invention (Claim 5), when the failure state information acquired by the failure state acquisition means indicates a network failure, router conduction means for confirming conduction to the router,
Network error output means for outputting an error to the maintenance terminal when the connection by the router conduction means fails.

また、本発明（請求項６）の現用機は、故障状態取得手段にて取得した前記故障状態情報が、共有ディスクまたは内蔵ディスクの故障を示している場合には、前記保守端末に対してエラー出力するディスクエラー出力手段、を更に有する。 In addition, the working machine according to the present invention (Claim 6) has an error to the maintenance terminal when the failure status information acquired by the failure status acquisition means indicates a failure of the shared disk or the built-in disk. Disk error output means for outputting.

本発明（請求項７）は、故障状態を監視する故障監視手段と、故障状態に基づいて現用機及び予備機のサービス稼働状態を示すクラスタ状態を管理する状態管理手段と、サービス稼動中の状態（ACT）、ACTに遷移できる状態またはACTへ遷移可能かどうかは不明な状態（SBY：online）を含むクラスタ状態及び故障状態情報を格納する状態管理情報記憶手段と、をそれぞれ含む現用機と予備機、及び、該現用機と該予備機で共有する共有ディスクとで構成され、該現用機がクラスタ構成に組み入れられ、該予備機がACTへ遷移可能かどうかは不明な状態（SBY：online）にある場合に現用機として動作するサーバであって、
状態管理手段を介して故障状態情報を取得する故障状態取得手段と、
故障状態情報が強制電源断機能の故障を示している場合には、予備機側の強制電源断機能の故障が疑われるため、保守端末に対してエラー出力する強制電源断機能エラー出力手段と、
故障状態取得手段にて取得した故障状態情報が、ネットワーク故障を示している場合には、ルータまでの導通を確認するルータ導通手段と、
ルータ導通手段による導通が失敗した場合には、保守端末に対してエラー出力するネットワークエラー出力手段と、
故障状態取得手段にて取得した故障状態情報が、共有ディスクまたは内蔵ディスクの故障を示している場合には、保守端末に対してエラー出力するディスクエラー出力手段と、を有する。 The present invention (Claim 7) includes failure monitoring means for monitoring a failure state, state management means for managing a cluster state indicating a service operation state of the active machine and the spare machine based on the failure state, and a state in which the service is in operation (ACT), state management information storage means for storing cluster state and failure state information including a state that can be changed to ACT or whether it is possible to change to ACT (SBY: online) and a spare And a shared disk shared by the working machine and the spare machine, the working machine is incorporated in the cluster configuration, and it is unknown whether the spare machine can transition to ACT (SBY: online) A server that operates as an active machine when
Failure state acquisition means for acquiring failure state information via the state management means;
If the failure status information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output means that outputs an error to the maintenance terminal,
If the failure state information acquired by the failure state acquisition means indicates a network failure, the router conduction means for confirming continuity to the router;
A network error output means for outputting an error to the maintenance terminal when the connection by the router conduction means fails;
And disk error output means for outputting an error to the maintenance terminal when the failure state information acquired by the failure state acquisition means indicates a failure of the shared disk or the built-in disk.

本発明（請求項８）は、故障状態を監視する故障監視手段と、故障状態に基づいて現用機及び予備機のサービス稼働状態を示すクラスタ状態を管理する状態管理手段と、サービス稼動中の状態（ACT）、ACTに遷移できる状態またはACTへ遷移可能かどうかは不明な状態（SBY：online）を含むクラスタ状態及び故障状態情報を格納する状態管理情報記憶手段と、をそれぞれ含む現用機と予備機、及び、該現用機と該予備機で共有する共有ディスクとで構成され、該現用機がクラスタ構成に組み入れられ、該予備機がACTに遷移可能か不明な状態（SBY:online）にある場合に、予備機として動作するサーバであって、
保守端末から導通確認が指示されると、現用機側のハードウェア制御手段に対して導通を確認する導通確認手段と、
導通確認手段で導通が確認された場合には、該予備機側の強制電源断機能の故障として、状態管理情報記憶手段の強制電源断機能監視リソースの故障回数をクリアし、現用機からの系切り替えを可能とする状態（SBY：online）に遷移させる故障回数クリア手段と、を有する。 The present invention (Claim 8) includes a failure monitoring means for monitoring a failure state, a state management means for managing a cluster state indicating a service operation state of an active machine and a spare machine based on the failure state, and a state during service operation (ACT), state management information storage means for storing cluster state and failure state information including a state that can be changed to ACT or whether it is possible to change to ACT (SBY: online) and a spare And a shared disk shared by the working machine and the spare machine, the working machine is incorporated into the cluster configuration, and it is unknown whether the spare machine can transition to ACT (SBY: online) A server that acts as a spare machine,
When continuity confirmation is instructed from the maintenance terminal, continuity confirmation means for confirming continuity with respect to the hardware control means on the active machine side,
When continuity is confirmed by the continuity confirmation means, the number of failure of the forced power-off function monitoring resource of the state management information storage means is cleared as a failure of the forced power-off function on the spare machine side, and the system from the active machine Fault number clearing means for transitioning to a state (SBY: online) that enables switching.

本発明（請求項９）は、請求項７または８に記載のサーバを構成する各手段としてコンピュータを機能させるためのプログラムである。 The present invention (Claim 9) is a program for causing a computer to function as each means constituting the server according to Claim 7 or 8.

上記のように、本発明によれば、クラスタシステムの現用機がクラスタ構成に組み入れられている状態において、何らかの障害が発生し、予備機に系切り替えを行う際に、状態管理情報からネットワーク故障か、強制電源断機能故障か、または、共有ディスク・内蔵ディスクの故障であるかを判断し、ネットワーク故障または共用ディスクの故障である場合は予備機は切り替えが不可能なクラスタ状態であると判断し、予備系の強制電源断機能故障である場合は、系切り替えが可能なクラスタ状態であると判断することが可能となるため、強制電源断機能を復旧させることにより、対故障性を向上させることが可能となる。 As described above, according to the present invention, in the state in which the active device of the cluster system is incorporated in the cluster configuration, when a failure occurs and the system is switched to the standby device, whether the network failure is detected from the state management information. Determine whether it is a forced power-off function failure or a shared disk / built-in disk failure. If it is a network failure or a shared disk failure, determine that the spare is in a cluster state that cannot be switched. In the case of a failure of the forced power-off function of the standby system, it is possible to determine that the cluster is in a state where system switchover is possible, so that the fault tolerance can be improved by restoring the forced power-off function. Is possible.

高可用性クラスタソフトを用いたクラスタシステムの概略図である。1 is a schematic diagram of a cluster system using high availability cluster software. FIG. 予備機の故障状態を示す図である。It is a figure which shows the failure state of a spare machine. 現用機または予備機の強制電源断機能の故障を示す図である。It is a figure which shows the failure of the forced power-off function of an active machine or a spare machine. 本発明の一実施の形態におけるクラスタシステムの機能ブロック図である。It is a functional block diagram of the cluster system in one embodiment of this invention. 状態管理部で管理されるクラスタ状態の状態遷移図である。It is a state transition diagram of the cluster state managed by the state management unit. 本発明の一実施の形態における状態管理情報記憶部に格納される情報の例である。It is an example of the information stored in the state management information storage part in one embodiment of this invention. 本発明の一実施の形態における通常運用状態から復旧手順終了後の状態を示す図である。It is a figure which shows the state after completion | finish of a recovery procedure from the normal operation state in one embodiment of this invention. 本発明の一実施の形態における故障検出手順のフローチャートである。It is a flowchart of the failure detection procedure in one embodiment of the present invention. 本発明の一実施例における図８のステップ１０２の動作を示す図である。It is a figure which shows the operation | movement of step 102 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１０３の動作を示す図である。It is a figure which shows the operation | movement of step 103 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１０４の動作を示す図である。It is a figure which shows operation | movement of the step 104 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１０５の動作を示す図である。It is a figure which shows operation | movement of the step 105 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１０６の動作を示す図である。It is a figure which shows the operation | movement of step 106 of FIG. 8 in one Example of this invention.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

本発明の実施例に係るクラスタシステム及び方法を詳細に説明する前に、まず、本発明の実施例で用いられる用語について説明する。 Before describing the cluster system and method according to the embodiment of the present invention in detail, first, terms used in the embodiment of the present invention will be described.

・クラスタ構成：複数のサーバを相互に接続し、サービスを提供するユーザ又は他サーバに対して全体で１つのサーバであるかのように振舞わせる技術のことを言う。クラスタ構成により、１つのサーバが故障しても、システム全体でサービスを継続させることができ、また、サービス継続中に故障修理や交換を行うことができる。 Cluster configuration: A technology for connecting a plurality of servers to each other and making a user or other server providing a service behave as if they are one server as a whole. With a cluster configuration, even if one server fails, the service can be continued throughout the system, and failure repair or replacement can be performed while the service is continuing.

・現用機：クラスタシステムにおいて、サービス提供を開始して故障が発生していないときに、サービス稼働中であるサーバのことを言う。 -Active machine: A server that is in service when a service has started and no failure has occurred in a cluster system.

・予備機：クラスタシステムにおいて、現用機の故障発生時にサービスを引き継ぐサーバのことを言う。予備機は、１つの現用機のサービスを引き継いでもよく、複数の現用機のサービスを引き継いでもよい。すなわち、現用機と予備機との関係は、１対１の関係でもよく、Ｎ対１の関係でもよい。 Spare machine: A server that takes over services when a failure occurs in the active machine in a cluster system. The spare machine may take over the service of one working machine or may take over the services of a plurality of working machines. That is, the relationship between the active machine and the spare machine may be a one-to-one relationship or an N-to-one relationship.

・高可用性クラスタソフト：クラスタ構成を提供するためのソフトウェアのことを言う。高可用性クラスタソフトは、サーバの故障を監視し、故障時に系切り替えを実施する。 High availability cluster software: Software for providing a cluster configuration. High-availability cluster software monitors server failures and performs system switching when a failure occurs.

・リソース：サービスを提供するために必要な構成要素のことを言う。クラスタ構成におけるリソースとは、高可用性クラスタソフトが起動、停止、監視等の制御対象とするアプリケーションを示す。リソースには、データベース等が含まれる。・ Resource: Refers to the components necessary to provide a service. A resource in a cluster configuration refers to an application that is subject to control by the highly available cluster software such as starting, stopping, and monitoring. Resources include databases and the like.

・クラスタ状態：サーバにおけるサービスの稼働状態を言う。クラスタ状態には、"ACT"と、"SBY［online］"と、"SBY［standby］"と、"SBY［遷移中］"と、"OUS"と、"NONE"とが含まれる。 Cluster state: Refers to the service operating state in the server. The cluster state includes “ACT”, “SBY [online]”, “SBY [standby]”, “SBY [in transit]”, “OUS”, and “NONE”.

・リソース状態：サーバにおけるリソースの稼働状態を言う。リソース状態には、他サーバでリソース稼働中である状態と、自サーバでリソースが稼働中である状態と、リソースが停止中である状態と、リソースの管理を行わない状態とが含まれる。 Resource status: This refers to the operating status of resources in the server. The resource state includes a state where the resource is operating on another server, a state where the resource is operating on the local server, a state where the resource is stopped, and a state where the resource is not managed.

・"ACT"：サーバでサービス稼働中の状態を言う。クラスタ構成において、データベース等のサービスを提供するリソースが稼働しているサーバの状態を言う。・ "ACT": Indicates that the service is running on the server. In a cluster configuration, it refers to the state of a server on which resources that provide services such as databases are operating.

・"SBY［online］"：ＡＣＴへ遷移できる状態を言う。クラスタ構成において、故障等による系切り替えが発生した場合、ＡＣＴからリソースを切り替えることが可能なサーバの状態をと言う。 “SBY [online]”: A state in which transition to ACT is possible. In a cluster configuration, when system switching occurs due to a failure or the like, the state of a server that can switch resources from ACT is called.

・"SBY［standby］"："ACT"への遷移が抑止されている状態を言う。クラスタ構成において、故障等による系切り替えが発生した場合でも、"ACT"に遷移しないように抑止されているサーバの状態を言う。 “SBY [standby]”: A state where transition to “ACT” is suppressed. In a cluster configuration, this refers to the status of servers that are prevented from transitioning to "ACT" even when system switchover occurs due to a failure.

・"SBY［遷移中］"：系切り替え中の状態を言う。クラスタ構成において、故障等による系切り替えが発生したが、リソース停止に失敗して系切り替えが終了していないサーバの状態を言う。・ "SBY [Transitioning]": Says the status during system switching. In a cluster configuration, this refers to the state of a server that has undergone system switchover due to a failure or the like, but has failed to stop resources and has not yet switched over.

・"OUS"：サーバでリソース故障中の状態を言う。クラスタ構成において、リソース故障が発生している状態を言う。・ "OUS": Indicates that the resource is faulty on the server. A state in which a resource failure has occurred in a cluster configuration.

・"NONE"：サーバがクラスタ構成に組み込まれていない状態を言う。高可用性クラスタソフトが停止している状態のように、クラスタ構成に組み込まれていないサーバの状態を言う。 -“NONE”: A state in which the server is not incorporated in the cluster configuration. The state of a server that is not built into the cluster configuration, such as when the highly available cluster software is stopped.

＜クラスタシステムの構成＞
図４は、本発明の一実施の形態におけるクラスタシステムの機能ブロック図を示す。クラスタシステムは、相互に接続されている複数のサーバ（現用機１０及び予備機２０）と、これらの複数のサーバで共有して用いられる共有ディスク３０とを有する。現用機１０及び予備機２０は、ルータ４０を介してクライアントにサービスを提供する。なお、現用機１０の性能は、予備機２０の性能より優れていてもよい。また、クラスタシステムは、２つ以上の現用機と１つの予備機とで構成されてもよい。 <Configuration of cluster system>
FIG. 4 shows a functional block diagram of the cluster system in one embodiment of the present invention. The cluster system includes a plurality of servers (current machine 10 and spare machine 20) connected to each other, and a shared disk 30 that is shared and used by the plurality of servers. The active machine 10 and the spare machine 20 provide services to clients via the router 40. Note that the performance of the current machine 10 may be superior to the performance of the spare machine 20. Further, the cluster system may be composed of two or more active machines and one spare machine.

現用機１０は、リソース１０１、高可用性クラスタソフト１１０、実行制御部１２０、状態管理情報記憶部１１９、オペレーティングシステム（ＯＳ）１５１、電源制御部１５３、電源１５５を有する。 The active machine 10 includes a resource 101, high availability cluster software 110, an execution control unit 120, a state management information storage unit 119, an operating system (OS) 151, a power supply control unit 153, and a power supply 155.

高可用性クラスタソフト１１０は、状態管理手段であり、故障監視部１１１と、リソース起動・停止部１１３と、状態管理部１１５と、強制電源断機能部１１６、強制電源断機能監視部１１７から構成される。具体的には、本発明では、高可用性クラスタソフトを用いるものとする。 The high availability cluster software 110 is a state management unit, and includes a failure monitoring unit 111, a resource start / stop unit 113, a state management unit 115, a forced power-off function unit 116, and a forced power-off function monitoring unit 117. The Specifically, in the present invention, high availability cluster software is used.

制御実行部１２０は、導通確認部１２３と、クラスタ構成起動部１２５と、系切り替え指示部１２７と、状態確認部１３１と、コマンド実行部１３３から構成される。 The control execution unit 120 includes a continuity confirmation unit 123, a cluster configuration activation unit 125, a system switching instruction unit 127, a state confirmation unit 131, and a command execution unit 133.

予備機２０は、リソース２０１、高可用性クラスタソフト２１０、実行制御部２２０、状態管理情報記憶部２１９、オペレーティングシステム（ＯＳ）２５１、電源制御部２５３、電源２５５を有する。 The spare machine 20 includes a resource 201, high availability cluster software 210, an execution control unit 220, a state management information storage unit 219, an operating system (OS) 251, a power supply control unit 253, and a power supply 255.

高可用性クラスタソフト２１０は、状態管理手段であり、故障監視部２１１と、リソース起動・停止部２１３と、状態管理部２１５、強制電源断機能部２１６、強制電源断機能監視部２１７とから構成される。 The high availability cluster software 210 is a state management unit, and includes a failure monitoring unit 211, a resource start / stop unit 213, a state management unit 215, a forced power-off function unit 216, and a forced power-off function monitoring unit 217. The

制御実行部２２０は、導通確認部２２３と、クラスタ構成起動部２２５と、系切り替え指示部２２７と、状態確認部２３１と、コマンド実行部２３３から構成される。 The control execution unit 220 includes a continuity confirmation unit 223, a cluster configuration activation unit 225, a system switching instruction unit 227, a state confirmation unit 231, and a command execution unit 233.

リソース１０１及び２０１は、クライアントにサービスを提供するアプリケーションである。リソース１０１及び２０１は、クラスタ状態がサービス稼働中"ACT"のサーバで起動している。 Resources 101 and 201 are applications that provide services to clients. Resources 101 and 201 are activated on a server whose cluster status is “ACT” during service operation.

高可用性クラスタソフト１１０，２１０の故障監視部１１１及び２１１は、サーバの故障状態を監視する。例えば、リソース、ネットワーク、共有ディスク・内臓ディスク等を監視する。リソースはサービス稼働中"ACT"のサーバのみで監視されるが、ネットワーク及び共有ディスク・内蔵ディスクは、現用機１０と予備機２０の双方で監視される。現用機１０で故障が検知された場合、故障状態は、状態管理部１１５を介して状態管理情報記憶部１１９に格納される。例えば、故障状態として、故障回数や、故障発生タイミング（リソース開始失敗、リソース監視時の故障、リソース停止失敗）を示すエラーステータスが状態管理情報記憶部１１９に格納される。以下に説明するように、現用機１０の故障状態は、状態管理部１１５から予備機の状態管理部２１５を介して状態管理情報記憶部２１９にも格納される。予備機２０で故障が検知された場合も同様に、予備機の故障状態が、状態管理部２１５を介して状態管理情報記憶部２１９に格納され、更に、現用機１０の状態管理部１１５を介して状態管理情報記憶部１１９に格納される。 The failure monitoring units 111 and 211 of the high availability cluster software 110 and 210 monitor the failure state of the server. For example, resources, networks, shared disks / built-in disks, etc. are monitored. The resource is monitored only by the “ACT” server during service operation, but the network and the shared disk / built-in disk are monitored by both the current machine 10 and the spare machine 20. When a failure is detected in the active machine 10, the failure state is stored in the state management information storage unit 119 via the state management unit 115. For example, an error status indicating the number of failures and failure occurrence timing (resource start failure, failure during resource monitoring, resource stop failure) is stored in the state management information storage unit 119 as a failure state. As will be described below, the failure state of the active machine 10 is also stored in the state management information storage unit 219 from the state management unit 115 via the state management unit 215 of the spare machine. Similarly, when a failure is detected in the spare machine 20, the failure state of the spare machine is stored in the state management information storage unit 219 via the state manager 215, and further via the state manager 115 of the active machine 10. And stored in the state management information storage unit 119.

リソース起動・停止部１１３及び２１３は、クラスタ状態及び故障状態に基づいてリソースを起動及び停止させる。サーバのクラスタ状態がＡＣＴへ遷移できる状態"SBY[online]"のときに他サーバのリソースが停止した場合、リソース起動・停止部１１３及び２１３は、リソースを起動させる。サーバのクラスタ状態がサービス稼働中"ACT"のときに故障が発生した場合、リソース起動・停止部１１３及び２１３は、リソースを停止させる。 The resource start / stop units 113 and 213 start and stop resources based on the cluster state and the failure state. When the resources of other servers are stopped when the cluster state of the server is in the state “SBY [online]” where the cluster state can be changed to ACT, the resource start / stop units 113 and 213 start the resources. If a failure occurs when the cluster state of the server is “ACT” during service operation, the resource start / stop units 113 and 213 stop the resource.

状態管理部１１５及び２１５は、故障状態に基づいてクラスタ状態を管理する。現用機１０の状態管理部１１５と予備機２０の状態管理部２１５は、互いに状態管理情報記憶部に格納された故障状態（故障回数、エラーステータス）、クラスタ状態等の情報を交換し、各サーバの情報を状態管理情報記憶部１１９及び２１９に格納する。 The state management units 115 and 215 manage the cluster state based on the failure state. The status management unit 115 of the active machine 10 and the status management unit 215 of the standby machine 20 exchange information such as the failure status (number of failures, error status) and cluster status stored in the status management information storage unit, and each server Is stored in the state management information storage units 119 and 219.

強制電源断機能部１１６，２１６は、対向機（現用機であれば予備機、予備機であれば現用機）のハードウェア制御ボードを保守管理用LAN経由で監視し、その監視がタイムアウトの場合は、状態管理情報記憶部の対向機の故障回数を"１"、エラーステータスを"２"とする。 The forced power-off function units 116 and 216 monitor the hardware control board of the opposite device (the active device if it is the active device, the active device if it is the standby device) via the maintenance management LAN, and the monitoring is timed out In the state management information storage unit, the number of failures of the opposite device is “1”, and the error status is “2”.

強制電源断機能監視部１１７，２１７は、自装置内の強制電源断機能部１１６，２１６を監視し、当該強制電源断機能部のプロセス故障の場合に、状態管理情報記憶部の故障回数を"１"、エラーステータスを"２"とする。 The forced power-off function monitoring units 117 and 217 monitor the forced power-off function units 116 and 216 in the self-device, and in the case of a process failure of the forced power-off function unit, indicate the number of failures in the state management information storage unit. 1 ”and error status“ 2 ”.

図５に、状態管理部１１５及び２１５で管理されるクラスタ状態の状態遷移図を示す。クラスタ状態には、"ACT"と、"SBY［online］"と、"SBY［standby］"と、"SBY［遷移中］"と、"OUS"と、"NONE"とが含まれる。"ACT"のサーバにリソース故障が発生した場合、クラスタ状態は"ACT"から"OUS"になる（Ｔ１）。"ACT"のサーバにリソース以外の故障（ネットワーク、共有ディスク等の故障）が発生した場合、クラスタ状態は"ACT"から"SBY［遷移中］"になる（Ｔ２）。"OUS"のサーバの故障状態がクリアされた場合、クラスタ状態は"OUS"から"SBY［standby］"になる（Ｔ３）。故障等により系切り替えが発生して、"SBY［online］"のサーバがサービスを引き継ぐ場合、クラスタ状態は"SBY［online］"から"SBY［遷移中］"になり（Ｔ４）、更に、"ACT"になる（Ｔ５）。"ACT"のサーバから他サーバにサービスを引き継ぐために"ACT"のサーバでサービスの稼働が抑止された場合、クラスタ状態は"ACT"から"SBY［standby］"になる（Ｔ６）。"SBY［standby］"のサーバで"ACT"への遷移抑止が解除された場合、クラスタ状態は"SBY［standby］"から"SBY［standby］"になる（Ｔ７）。"SBY［standby］"のサーバでＡＣＴへの遷移が抑止された場合、クラスタ状態は"SBY［standby］"から"SBY［standby］"になる（Ｔ８）。また、電源の停止、オペレーティングシステムの停止又は高可用性クラスタソフト自体の停止により、高可用性クラスタソフトが停止した場合、クラスタ状態は"NONE"になる（Ｔ９〜Ｔ１３）。高可用性クラスタソフトが起動した場合、クラスタ状態は"NONE"から"SBY［online］"になる（Ｔ１４）。また、現用機及び予備機の双方のクラスタ状態がＮＯＮＥのときに高可用性クラスタソフトが起動した場合、クラスタ状態はＮＯＮＥからＡＣＴになる（Ｔ１５）。 FIG. 5 shows a state transition diagram of the cluster state managed by the state management units 115 and 215. The cluster state includes “ACT”, “SBY [online]”, “SBY [standby]”, “SBY [in transit]”, “OUS”, and “NONE”. When a resource failure occurs in the “ACT” server, the cluster status changes from “ACT” to “OUS” (T1). When a failure other than resources (failure of network, shared disk, etc.) occurs in the “ACT” server, the cluster status changes from “ACT” to “SBY [Transitioning]” (T2). When the failure state of the “OUS” server is cleared, the cluster state is changed from “OUS” to “SBY [standby]” (T3). When a system switchover occurs due to a failure, etc., and the "SBY [online]" server takes over the service, the cluster status changes from "SBY [online]" to "SBY [in transit]" (T4). ACT "(T5). When the service operation is inhibited on the “ACT” server in order to take over the service from the “ACT” server to another server, the cluster status changes from “ACT” to “SBY [standby]” (T6). When the transition suppression to “ACT” is canceled in the server “SBY [standby]”, the cluster state is changed from “SBY [standby]” to “SBY [standby]” (T7). When the transition to ACT is inhibited in the server “SBY [standby]”, the cluster state is changed from “SBY [standby]” to “SBY [standby]” (T8). Further, when the high availability cluster software is stopped due to the power supply stop, the operating system stop or the high availability cluster software itself, the cluster state becomes “NONE” (T9 to T13). When the high availability cluster software is activated, the cluster state changes from “NONE” to “SBY [online]” (T14). If the high availability cluster software is activated when the cluster status of both the active machine and the standby machine is NONE, the cluster status changes from NONE to ACT (T15).

状態管理情報記憶部１１９及び２１９は、各サーバ毎にクラスタ状態及び故障状態を格納する。具体的には、状態管理情報記憶部１１９及び２１９は、現用機１０の情報と予備機２０の情報との双方をそれぞれ格納し、状態管理部１１５と状態管理部２１５との情報交換によって、状態管理情報記憶部１１９に格納される情報と状態管理情報記憶部２１９に格納される情報とは同一に保持される。なお、状態管理情報記憶部１１９，２１９を参照する際には、サーバID等のサーバを一意に識別できる識別子を用いるものとする。 The state management information storage units 119 and 219 store a cluster state and a failure state for each server. Specifically, the state management information storage units 119 and 219 store both the information on the active machine 10 and the information on the spare machine 20, respectively, and exchange the information between the state management unit 115 and the state management unit 215. The information stored in the management information storage unit 119 and the information stored in the state management information storage unit 219 are held the same. Note that when referring to the state management information storage units 119 and 219, an identifier such as a server ID that can uniquely identify the server is used.

図６に、状態管理情報記憶部１１９及び２１９に格納される情報の例を示す。状態管理情報記憶部１１９及び２１９は、サーバ毎にクラスタ状態、故障回数、エラーステータス、リソース状態、及び、インタフェース属性値を格納する。状態管理情報記憶部１１９及び２１９は、クラスタ状態として、"ACT"と、"SBY［online］"と、"SBY［standby］"と、"SBY［遷移中］"と、"OUS"と、"NONE"とのうちいずれかを記憶する。故障回数として、故障が発生した回数（０〜Ｎの値）を記憶する。故障発生タイミングを示すエラーステータスとして、エラー無しの状態と、リソース開始に失敗した状態と、リソース監視時に故障を検知した状態と、リソース停止に失敗した状態とのうちいずれかを記憶する。リソース状態として、他サーバでリソース稼働中である状態と、自サーバでリソースが稼働中である状態と、リソースが停止中である状態と、リソースの管理を行わない状態とのうちいずれを記憶する。インタフェース属性値として、エラー無し"０"、PINGエラーがあり、"Link is failure"が表示されている状態"１"、ディスクエラーがあり"Disk is failure"が表示されている状態"２"を記憶する。 FIG. 6 shows an example of information stored in the state management information storage units 119 and 219. The state management information storage units 119 and 219 store a cluster state, the number of failures, an error status, a resource state, and an interface attribute value for each server. The state management information storage units 119 and 219 have, as cluster states, “ACT”, “SBY [online]”, “SBY [standby]”, “SBY [in transit]”, “OUS”, “ Memorize either "NONE". As the number of failures, the number of failures (values 0 to N) is stored. As an error status indicating the failure occurrence timing, any one of a no error state, a state in which resource start has failed, a state in which a failure has been detected during resource monitoring, and a state in which resource stop has failed is stored. As the resource status, any of the status where the resource is operating on another server, the status where the resource is operating on the local server, the status where the resource is stopped, and the status where the resource is not managed is stored. . As an interface attribute value, there is no error “0”, there is a PING error, “Link is failure” is displayed “1”, and there is a disk error “Disk is failure” is displayed “2”. Remember.

導通確認部１２３は、状態確認部１３１で故障箇所がネットワーク故障であると推定された場合には、ルータ４０までの導通を確認する。導通確認が成功した場合には、ネットワークの瞬断による一時的な故障と考えられる。また、状態確認部１３１で故障箇所が電源制御部（ハードウェア制御ボード）２５３であると推定された場合には、他サーバ（予備機２０）の電源制御部（ハードウェア制御ボード）２５３までの導通を確認する。これらの導通確認には、PINGが用いられてもよい。 The continuity confirmation unit 123 confirms continuity to the router 40 when the state confirmation unit 131 estimates that the failure location is a network failure. If the continuity check is successful, it is considered a temporary failure due to an instantaneous network interruption. Further, when it is estimated by the state confirmation unit 131 that the failure location is the power supply control unit (hardware control board) 253, the power control unit (hardware control board) 253 of the other server (spare unit 20) Check continuity. PING may be used for these conduction confirmations.

状態確認部２３１で故障箇所が強制電源断機能の故障であると推定された場合に、導通確認部２２３は、現用機１０の電源制御部１５３までの導通を確認する。導通確認には、PINGが用いられてもよい。導通確認が成功した場合には、ネットワークの瞬断による一時的な故障と考えられる。 When the state confirmation unit 231 estimates that the failure location is a failure of the forced power-off function, the conduction confirmation unit 223 confirms conduction to the power control unit 153 of the active machine 10. PING may be used for continuity confirmation. If the continuity check is successful, it is considered a temporary failure due to an instantaneous network interruption.

クラスタ構成起動部１２５は、現用機１０をクラスタ構成に組み込み、状態管理情報記憶部１１９に格納された現用機のクラスタ状態をサービス稼働中へ遷移できる状態へ遷移させる。具体的には、クラスタ構成起動部１２５は、現用機１０の高可用性クラスタソフト１１０を起動させる。例えば、クラスタ構成起動部１２５が状態管理部１１５を起動させ、状態管理部１１５が故障管理部１１１及びリソース起動・停止部１１３を起動させてもよい。この起動により、状態管理部２１５を介して、状態管理情報記憶部２１９に格納された現用機１０のクラスタ状態も遷移し、現用機１０のクラスタ状態は、"SBY［online］"になる。予備機２０のクラスタ構成起動部２２５も同様である。 The cluster configuration activation unit 125 incorporates the active device 10 into the cluster configuration, and changes the cluster state of the active device stored in the state management information storage unit 119 to a state where the service device can be shifted to service operation. Specifically, the cluster configuration activation unit 125 activates the high availability cluster software 110 of the active machine 10. For example, the cluster configuration activation unit 125 may activate the state management unit 115, and the state management unit 115 may activate the failure management unit 111 and the resource activation / deactivation unit 113. By this activation, the cluster state of the active device 10 stored in the state management information storage unit 219 is also transitioned via the state management unit 215, and the cluster state of the active device 10 becomes “SBY [online]”. The same applies to the cluster configuration activation unit 225 of the spare machine 20.

系切り替え指示部１２７は、状態管理部１１５を介して、予備機２０から現用機１０への系切り替えを指示する。具体的には、状態管理情報記憶部１１９に格納された予備機２０のクラスタ状態をサービス稼働中への遷移が抑止されている状態へ遷移させる。この遷移抑止により、現用機１０のクラスタ状態はサービス稼働中へ遷移する。この結果、状態管理部２１５を介して、状態管理情報記憶部２１９に格納された予備機２０及び現用機１０のクラスタ状態も遷移し、予備機２０のクラスタ状態は"SBY［standby］"になり、現用機１０のクラスタ状態はＡＣＴになる。そして、予備機２０のリソース２０１が停止し、現用機１０のリソース１０１が起動する。予備機２０の系切り替え指示部２２７も同様である。 The system switching instruction unit 127 instructs system switching from the standby machine 20 to the active machine 10 via the state management unit 115. Specifically, the cluster state of the spare machine 20 stored in the state management information storage unit 119 is shifted to a state where the transition to service operation is suppressed. Due to this transition inhibition, the cluster state of the active machine 10 transitions to service operation. As a result, the cluster status of the standby machine 20 and the active machine 10 stored in the status management information storage unit 219 is also changed via the status management unit 215, and the cluster status of the standby machine 20 becomes “SBY [standby]”. The cluster state of the active machine 10 becomes ACT. Then, the resource 201 of the standby machine 20 is stopped, and the resource 101 of the active machine 10 is started. The system switching instruction unit 227 of the spare machine 20 is the same.

状態確認部１３１は、状態管理部１１５を介して、状態管理情報記憶部１１９に格納された情報を確認する。例えば、現用機１０及び予備機２０の双方のクラスタ状態、故障回数、エラーステータス及びリソース状態を確認する。予備機２０の状態確認部２３１も同様である。 The state confirmation unit 131 confirms the information stored in the state management information storage unit 119 via the state management unit 115. For example, the cluster status, the number of failures, the error status, and the resource status of both the active machine 10 and the spare machine 20 are confirmed. The same applies to the state confirmation unit 231 of the spare machine 20.

オペレーティングシステム１５１，２５１は、サーバ上で高可用性クラスタソフト１１０，２１０やアプリケーション等を動作させるための基本ソフトウェアである。 The operating systems 151 and 251 are basic software for operating the high availability cluster software 110 and 210 and applications on the server.

電源制御部１５３は、本実施の形態では、他サーバ（予備機２０）の強制電源断機能部２１６から強制的に電源を切断する指示を受信し、サーバに電力を供給する電源１５５をオン及びオフにする。予備機の電源制御部２５３も同様である。 In the present embodiment, the power control unit 153 receives an instruction to forcibly turn off the power from the forced power-off function unit 216 of another server (spare unit 20), and turns on the power source 155 that supplies power to the server. Turn off. The same applies to the power control unit 253 of the spare machine.

まず、現用機１０と予備機２０の故障状態について説明する。 First, the failure state of the active machine 10 and the spare machine 20 will be described.

図７は、本発明の一実施の形態における通常運用状態から復旧手順終了の状態を示す。 FIG. 7 shows a state where the restoration procedure is completed from the normal operation state according to the embodiment of the present invention.

（１）同図（a）に示すように、現用機１０は、クラスタ状態が"ACT"であるため、サービス稼動中であり、予備機２０は、"SBY[online]"状態であり、"ACT"へ遷移可能な状態である。 (1) As shown in FIG. 5A, the active machine 10 is in service because the cluster status is “ACT”, and the spare machine 20 is in the “SBY [online]” status. It is in a state where transition to ACT "is possible.

（２）上記の状態から、図２に示すように、予備機２０が強制電源断機能部２１６の故障、または、インタフェース故障により、"ACT"に遷移可能かどうかは不明な状態になっている（図７（ｂ））。このとき、リソース１０１は現用機１０で稼動しているため、何もアクションは起こっていない状態である。つまり、この時点における現用機１０、予備機２０の状態は、
（現用機）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
I/F属性値：０
（予備機）
クラスタ状態："SBY[online]"
故障回数：０
エラーステータス：０
リソース状態：０
I/F属性値：１or２
である。予備機２０では、状態管理情報記憶部２１９のインタフェース（I/F）属性値が"１"または"２"であるため、ネットワーク故障か、ディスク故障のいずれかであると推測できる。この場合、現用機１０から予備機２０に系切り替え処理が発生した場合、I/F属性値が"０"以外の値をとっているため、系切り替えは実行できずに、現用機１０が故障等による系切り替えが発生したが、リソース停止に失敗して系切り替えが終了していないサーバの状態である"SBY[遷移中] "のクラスタ状態となってしまう。 (2) From the above state, as shown in FIG. 2, it is unknown whether or not the spare unit 20 can transition to “ACT” due to a failure of the forced power-off function unit 216 or an interface failure. (FIG. 7B). At this time, since the resource 101 is operating on the active machine 10, no action has occurred. That is, the state of the active machine 10 and the spare machine 20 at this time is
(Current machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
I / F attribute value: 0
(Spare machine)
Cluster status: "SBY [online]"
Number of failures: 0
Error status: 0
Resource status: 0
I / F attribute value: 1 or 2
It is. In the spare machine 20, since the interface (I / F) attribute value of the state management information storage unit 219 is “1” or “2”, it can be estimated that either a network failure or a disk failure occurs. In this case, when a system switching process occurs from the active machine 10 to the standby machine 20, the I / F attribute value is a value other than “0”, so the system switching cannot be executed and the active machine 10 fails. The system switchover occurs due to, for example, the cluster status of "SBY [Transition]", which is the state of the server where the system switchover has not been completed because the resource stop failed.

また、図７（b）の状態として図３のようなケースが考えられる。 Moreover, the case as shown in FIG. 3 can be considered as the state of FIG.

図３の例は、現用機１０、予備機２０のいずれかの強制電源断機能部に故障が発生した場合である。 The example of FIG. 3 is a case where a failure has occurred in the forced power-off function unit of either the active machine 10 or the standby machine 20.

このときの現用機１０と予備機２０の状態は、
（現用機）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
I/F属性値：０
（予備機）
クラスタ状態："SBY[online]"
故障回数：１
エラーステータス：２
リソース状態：０
I/F属性値：０
である。上記では、予備機２０のエラーステータスが"２"（リソース監視時にエラーを検出）となっている。この場合、強制電源断機能部２１６が正常に実行できなくなるが、サービス稼動状況には影響がなく、現用機１０から予備機２０への系切り替え処理が発生した場合でも予備機２０に切り替えることが可能である。但し、系切り替え時にサービス停止に失敗した場合は強制電源断機能部２１６の処理が実行されず、現用機１０は図２と同様に、"SBY[遷移中]"のクラスタ状態となってしまう。 At this time, the status of the active machine 10 and the spare machine 20 is as follows:
(Current machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
I / F attribute value: 0
(Spare machine)
Cluster status: "SBY [online]"
Number of failures: 1
Error status: 2
Resource status: 0
I / F attribute value: 0
It is. In the above, the error status of the spare machine 20 is “2” (error is detected during resource monitoring). In this case, the forced power-off function unit 216 cannot be executed normally, but the service operation status is not affected, and the system can be switched to the spare machine 20 even when a system switching process from the active machine 10 to the spare machine 20 occurs. Is possible. However, if the service stoppage fails during system switching, the processing of the forced power-off function unit 216 is not executed, and the active machine 10 enters a cluster state of “SBY [in transit]” as in FIG.

本発明は、上記のような状態（図７（ｂ））において、予備機２０のみの故障復旧処理を行うことで、図７（ｃ）に示すように、現用機１０からの系切り替え発生時に、予備機２０を"ACT"へ遷移可能な状態"SBY[online]"に遷移させるものである。 In the state as described above (FIG. 7B), the present invention performs failure recovery processing for only the spare machine 20, so that when system switching occurs from the active machine 10, as shown in FIG. 7C. The spare machine 20 is changed to the state “SBY [online]” that can be changed to “ACT”.

図８は、本発明の一実施の形態における故障検出手順のフローチャートである。 FIG. 8 is a flowchart of the failure detection procedure in one embodiment of the present invention.

ステップ１０１）保守端末から現用機１０、予備機２０にログインする。このとき、予備機２０へのログインに成功した場合はステップ１０２に移行し、失敗した場合はステップ１０３に移行する。この時点の現用機１０と予備機２０の状態は以下の通りである。 Step 101) Log in to the active machine 10 and the spare machine 20 from the maintenance terminal. At this time, if the login to the spare machine 20 is successful, the process proceeds to step 102, and if it is unsuccessful, the process proceeds to step 103. The states of the current machine 10 and the spare machine 20 at this time are as follows.

（現用機）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
I/F属性値：０
（予備機）
クラスタ状態："SBY[online]"
故障回数：０or１
エラーステータス：０or２（エラー無しまたは、リソース監視時にエラー検出）
リソース状態：０（他サーバでリソース稼動中）
I/F属性値：１or２（エラー無しまたはリンクエラー）
ステップ１０２）現用機１０の状態確認部１３１は、状態確認コマンドを高可用性クラスタソフト１１０に実行させることにより、状態管理情報記憶部１１９の状態管理情報を取得する。取得した状態管理情報が故障回数"１"、エラーステータス"２"である場合は、予備機２０の強制電源断機能部２１６の監視タイムアウトまたは、予備機２０の強制電源断機能部２１６のプロセスが故障したと推測されるため、強制電源断機能のエラーを保守端末に出力して、ステップ１０５に移行する。それ以外の場合（故障回数"０"、エラーステータス"０"）は、別原因であるので、ステップ１０３に移行する。 (Current machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
I / F attribute value: 0
(Spare machine)
Cluster status: "SBY [online]"
Number of failures: 0 or 1
Error status: 0 or 2 (no error or error detected during resource monitoring)
Resource status: 0 (resource is running on another server)
I / F attribute value: 1 or 2 (no error or link error)
Step 102) The state confirmation unit 131 of the active device 10 acquires the state management information in the state management information storage unit 119 by causing the high availability cluster software 110 to execute a state confirmation command. When the acquired state management information is the failure count “1” and the error status “2”, the monitoring time-out of the forced power-off function unit 216 of the standby machine 20 or the process of the forced power-off function unit 216 of the standby machine 20 Since it is estimated that a failure has occurred, an error of the forced power-off function is output to the maintenance terminal, and the process proceeds to step 105. In other cases (failure count “0”, error status “0”), the cause is another cause, and the process proceeds to step 103.

ステップ１０３）現用機１０の状態確認部１３１は、状態確認コマンドを高可用性クラスタソフト１１０に実行させることにより、状態管理情報記憶部１１９の状態管理情報を取得する。取得した状態管理情報のI/F属性値の値から、故障箇所の推定を行う。I/F属性値が"１"（Link is failure）の場合は、予備機２０からルータ４０間（ネットワーク）の通信が遮断していると推測されるためステップ１０４に移行する。I/F属性値が"２"（Disk is failure）の場合は、ディスク故障と推測されるため、ステップ１０７に移行する。 Step 103) The state confirmation unit 131 of the active device 10 acquires the state management information in the state management information storage unit 119 by causing the high availability cluster software 110 to execute a state confirmation command. The failure location is estimated from the I / F attribute value of the acquired state management information. If the I / F attribute value is “1” (Link is failure), it is estimated that the communication between the spare unit 20 and the router 40 (network) is interrupted, and the process proceeds to step 104. If the I / F attribute value is “2” (Disk is failure), a disk failure is assumed, and the process proceeds to step 107.

ステップ１０４）予備機２０の導通確認部２２３において、故障推定精度を向上させるため、ルータ４０に対してPINGコマンドによる導通確認を行い、導通が不可である場合はステップ１０７に移行する。導通できた場合にはステップ１０６に移行する。 Step 104) In the continuity confirmation unit 223 of the spare machine 20, in order to improve the fault estimation accuracy, the continuity confirmation by the PING command is performed on the router 40. If the continuity is impossible, the process proceeds to Step 107. If it can be conducted, the process proceeds to step 106.

ステップ１０５）ステップ１０２において予備機２０の強制電源断機能部２１６の故障であると判断された場合は、保守端末は現用機１０からエラーが表示されるので、保守者は、予備機２０のログインし、予備機２０に対して、現用機１０の電源制御部１５３に対する導通確認を指示する。これにより、予備機２０の導通確認部２２３は、故障推定精度を向上させるため、現用機１０の電源制御部１５３に対してPINGコマンドを実行し導通確認を行う。導通できた場合は、ハードウェア制御ボード１６０の故障ではなく、予備機側の強制電源断機能部２１６のプロセスの一時的な故障と推測されるため、ステップ１０６に移行する。導通が不可であった場合はステップ１０７に移行する。 Step 105) If it is determined in Step 102 that the forced power-off function unit 216 of the spare machine 20 is out of order, the maintenance terminal displays an error from the active machine 10, so the maintenance person logs in the spare machine 20 Then, the spare machine 20 is instructed to confirm the continuity with the power control unit 153 of the working machine 10. Thereby, the continuity confirmation unit 223 of the standby machine 20 performs a continuity confirmation by executing a PING command to the power supply control unit 153 of the active machine 10 in order to improve failure estimation accuracy. If it can be conducted, it is presumed not a failure of the hardware control board 160 but a temporary failure of the process of the forced power-off function unit 216 on the spare unit side. If conduction is not possible, the process proceeds to step 107.

ステップ１０６）ステップ１０４において、ルータ４０との導通に成功する、または、ステップ１０５において現用機１０の電源制御部１５３との導通に成功した場合は、現用機１０の状態管理部１１５及び予備機２０の状態管理部２１５は状態管理情報記憶部１１９の強制電源断機能監視リソースの故障数を０クリアし、ステップ１０８に移行する。この処理によって、故障などによる系切り替え時にリソース停止失敗が発生しても、強制電源断機能を利用できるようになる。 Step 106) If the connection to the router 40 is successful in Step 104 or the connection to the power control unit 153 of the active machine 10 is successful in Step 105, the state management unit 115 and the spare machine 20 of the active machine 10 are used. The state management unit 215 clears the number of failures of the forced power-off function monitoring resource in the state management information storage unit 119 to 0, and proceeds to step 108. This process makes it possible to use the forced power-off function even if a resource stop failure occurs during system switching due to a failure or the like.

ステップ１０７）エラーを保守端末に出力する。 Step 107) Output the error to the maintenance terminal.

ステップ１０８）ログアウトする。 Step 108) Log out.

上記のような処理を行うことにより、図７（ｃ）のように、予備機２０の状態管理部２１０が予備機２０状態管理情報記憶部２１９の状態を"ACT"へ遷移可能な状態にすることで、予備機２０では、現用機１０の系切り替え指示部１２７による系の切り替え処理に備えることが可能となる。 By performing the processing as described above, as shown in FIG. 7C, the state management unit 210 of the spare unit 20 makes the state of the spare unit 20 state management information storage unit 219 transitionable to “ACT”. As a result, the spare machine 20 can be prepared for a system switching process by the system switching instruction unit 127 of the active machine 10.

以下では、図８のフローチャートに沿って、図４に示す現用機１０の故障監視部１１１、リソース起動・停止部１１３、状態管理部１１５からなる構成を高可用性クラスタソフト１１０、及び、予備機２０の故障監視部２１１、リソース起動・停止部２１３、状態管理部２１５を、高可用性クラスタソフト２１０として説明する。 In the following, according to the flowchart of FIG. 8, the configuration including the failure monitoring unit 111, the resource start / stop unit 113, and the state management unit 115 of the active machine 10 shown in FIG. The failure monitoring unit 211, the resource start / stop unit 213, and the state management unit 215 will be described as high availability cluster software 210.

ステップ１０１）保守端末５０から現用機１０、予備機２０にログインする。双方へのログインが成功すればステップ１０２に移行し、予備機２０のログインに失敗した場合は、ステップ１０３に移行する。 Step 101) Log in to the active machine 10 and the spare machine 20 from the maintenance terminal 50. If login to both is successful, the process proceeds to step 102, and if login to the spare machine 20 fails, the process proceeds to step 103.

ステップ１０２）図９に示すように、制御実行部１２０の状態確認部１３１は、高可用性クラスタソフト１１０に状態確認コマンドを実行させ、予備機２０の故障回数、エラーステータスを要求する。 Step 102) As shown in FIG. 9, the state confirmation unit 131 of the control execution unit 120 causes the high availability cluster software 110 to execute a state confirmation command, and requests the number of failures and the error status of the spare unit 20.

これにより、高可用性クラスタソフト１１０は、状態管理情報記憶部１１９を参照し、予備機２０が、
・故障回数"１"；
・エラーステータス"２"；
の場合は、強制電源断機能の故障と考えられるため、ステップ１０５に移行する。一方、予備機２０が、
・故障回数"０"；
・エラーステータス"０"；
の場合は、故障原因が別原因であるため、ステップ１０３に移行する。 Thereby, the high availability cluster software 110 refers to the state management information storage unit 119, and the spare machine 20
・ Number of failures "1";
・ Error status “2”;
In this case, since it is considered that the forced power-off function has failed, the routine proceeds to step 105. On the other hand, the spare machine 20 is
・ Frequency count “0”;
・ Error status “0”;
In the case of, since the cause of failure is another cause, the process proceeds to step 103.

ステップ１０３）制御実行部１２０の状態確認部１３１は、図１０に示すように、高可用性クラスタソフト１１０に状態確認コマンドを実行させ、状態管理情報記憶部１１９に予備機２０のI/F属性値を要求する。予備機２０のI/F属性値が"１"（リンクエラー）の場合は通信エラーと推定できるため、ステップ１０４に移行する。一方、予備機２０のI/F属性値が"２"（ディスクエラー）の場合は、ディスク故障と推定できるため、保守端末５０に対してエラーを出力してログアウトする。 Step 103) As shown in FIG. 10, the state confirmation unit 131 of the control execution unit 120 causes the high availability cluster software 110 to execute a state confirmation command, and causes the state management information storage unit 119 to execute the I / F attribute value of the spare unit 20. Request. If the I / F attribute value of the spare machine 20 is “1” (link error), it can be estimated that the communication error has occurred, and the process proceeds to step 104. On the other hand, when the I / F attribute value of the spare machine 20 is “2” (disk error), it can be estimated that the disk is faulty, so an error is output to the maintenance terminal 50 and the user is logged out.

ステップ１０４）制御実行部１２０の導通確認部１２３は、図１１に示すように、OS機能にあるPINGを実行し、ルータ４０までの間の導通を確認する。導通が不可の場合は導線やルータ故障と考えられるため、保守端末５０にエラーを出力してログアウトする。 Step 104) As shown in FIG. 11, the continuity confirmation unit 123 of the control execution unit 120 executes PING in the OS function and confirms continuity to the router 40. If the continuity is not possible, it is considered that a conductor or router has failed, so an error is output to the maintenance terminal 50 and the user is logged out.

ステップ１０５）ステップ１０２において、強制電源断機能の故障と推定された場合には、保守端末５０は予備機２０にログインし、図１２に示すように、予備機２０の制御実行部２２０の導通確認部２２３より、OS機能であるPINGを実行し、現用機１０の電源制御部１５３までの導通を確認する。予備機２０からのPINGが成功した場合には、現用機１０の電源制御部１５３は正常であるため、予備機２０の高可用性クラスタソフト２１０内の強制電源断機能部２１６の故障が予測される。 Step 105) In the case where it is estimated in Step 102 that the forced power-off function has failed, the maintenance terminal 50 logs into the spare machine 20 and confirms the continuity of the control execution unit 220 of the spare machine 20 as shown in FIG. From the unit 223, PING which is an OS function is executed, and conduction to the power supply control unit 153 of the active machine 10 is confirmed. If the ping from the standby machine 20 is successful, the power control unit 153 of the active machine 10 is normal, and a failure of the forced power-off function unit 216 in the high availability cluster software 210 of the standby machine 20 is predicted. .

ステップ１０６）予備機２０の制御実行部２２０のコマンド実行部２３３は、図１３に示すように、高可用性クラスタソフト２１０に故障回数クリアコマンドを実行させる。クリアコマンドを実行することにより、高可用性クラスタソフト１１０、２１０で管理している状態管理情報記憶部１１９、２１９の故障回数"１"、エラーステータス"２"を故障回数"０"、エラーステータス"０"に更新する。 Step 106) The command execution unit 233 of the control execution unit 220 of the standby machine 20 causes the high availability cluster software 210 to execute a failure count clear command as shown in FIG. By executing the clear command, the failure count “1” and error status “2” of the status management information storage units 119 and 219 managed by the high availability cluster software 110 and 210 are changed to “0” and error status ”. Update to 0 ".

上記のように、故障回数及びエラーステータスの両方の値が"０"になることにより、現用機１０から予備機２０への系切り替え時に、リソース停止失敗が発生しても高可用性クラスタソフトの強制電源断機能が使用できるようになるため、現用機１０において、"SBY[遷移中]"（故障等による系切り替えが発生したが、リソース停止に失敗して系切り替えが終了していない状態）となることを回避することが可能となる。 As described above, if both the failure count and error status values are "0", high availability cluster software is forced even if a resource stop failure occurs during system switching from the active machine 10 to the standby machine 20. Since the power-off function can be used, in the active machine 10, "SBY [Transitioning]" (system switching has occurred due to a failure or the like, but resource switching has failed and system switching has not ended) It becomes possible to avoid becoming.

説明の便宜上、本発明の実施例に係るシステムは機能的なブロック図を用いて説明しているが、本発明のシステムは、ハードウェア、ソフトウェア又はそれらの組み合わせで実現されてもよい。例えば、サーバ（現用機及び予備機）の各機能部がソフトウェアで実現され、オペレーションシステム上にインストールされてもよい。また、各機能部が必要に応じて組み合わせて使用されてもよい。 For convenience of explanation, the system according to the embodiment of the present invention is described using a functional block diagram. However, the system of the present invention may be realized by hardware, software, or a combination thereof. For example, each functional unit of the server (active machine and spare machine) may be realized by software and installed on the operation system. In addition, the functional units may be used in combination as necessary.

以上、本発明の実施の形態及び実施例について説明したが、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々の変更・応用が可能である。 Although the embodiments and examples of the present invention have been described above, the present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims. is there.

１０サーバ（現用機）
２０サーバ（予備機）
３０共有ディスク
４０ルータ
５０保守端末
１０１，２０１リソース
１１０，２１０高可用性クラスタソフト
１１１，２１１故障監視部
１１３，２１３リソース・起動停止部
１１５、２１５状態管理部
１１６，２１６強制電源断機能部
１１７，２１７強制電源断監視部
１１９，２１９状態管理情報記憶部
１２０，２２０制御実行部
１２３，２２３導通確認部
１２５，２２５クラスタ構成起動部
１２７，２２７系切り替え指示部
１３１，２３１状態確認部
１３３，２３３コマンド実行部
１５１，２５１ＯＳ（オペレーティングシステム）
１５３，２５３電源制御部
１５５，２５５電源 10 servers (current machine)
20 servers (spare machine)
30 shared disk 40 router 50 maintenance terminal 101, 201 resource 110, 210 high availability cluster software 111, 211 failure monitoring unit 113, 213 resource / start / stop unit 115, 215 state management unit 116, 216 forced power off function unit 117, 217 Forced power-off monitoring unit 119, 219 State management information storage unit 120, 220 Control execution unit 123, 223 Continuity confirmation unit 125, 225 Cluster configuration activation unit 127, 227 System switching instruction unit 131, 231 State confirmation unit 133, 233 Command execution 151,251 OS (Operating System)
153, 253 Power control unit 155, 255 Power supply

Claims

Fault monitoring means for monitoring the fault status, status management means for managing the cluster status indicating the service operating status of the active machine and the spare machine based on the fault status, status in which the service is operating (ACT), and status that can be changed to ACT Or status management information storage means for storing cluster status and failure status information including an unknown status (SBY: online) whether or not it is possible to transit to ACT, respectively, and a current machine and a spare machine, and the current machine and the This is a cluster system recovery method in a cluster system composed of shared disks shared by a spare machine, in which the current machine is incorporated into the cluster configuration and it is unknown whether the spare machine can transition to ACT (SBY: online) . And
A failure status acquisition step in which the status confirmation means of the active machine acquires failure status information from the status management information storage means via the status management means of the active machine;
If the failure status information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output step for outputting an error to the maintenance terminal When,
When continuity confirmation is instructed from the maintenance terminal on the spare machine side in an unknown state (SBY: online) whether it is possible to transition to ACT, the continuity confirmation means of the spare machine controls the hardware control on the active machine side If the continuity is confirmed with respect to the means, and if the continuity is confirmed, the state management means of the spare machine is forced to turn off the state management information storage means as a failure of the forced power-off function on the spare machine side. Clearing the number of failures of the function monitor resource, and clearing the number of failures to transition to a state (SBY: online) that enables system switching from the active machine,
A cluster system recovery method comprising:

If the failure state information acquired in the failure state acquisition step indicates a network failure, check the continuity from the continuity confirmation means on the spare unit side to the router, and if continuity fails, The cluster system recovery method according to claim 1, further comprising a network error output step of outputting an error to the maintenance terminal.

The failure status information acquired in the failure status acquisition step further includes a disk error output step of outputting an error to the maintenance terminal when the failure of the shared disk or the built-in disk is indicated. The cluster system recovery method according to claim 1.

Fault monitoring means for monitoring the fault status, status management means for managing the cluster status indicating the service operating status of the active machine and the spare machine based on the fault status, status in which the service is operating (ACT), and status that can be changed to ACT Or status management information storage means for storing cluster status and failure status information including an unknown status (SBY: online) whether or not it is possible to transit to ACT, respectively, and a current machine and a spare machine, and the current machine and the A cluster system recovery system configured with a shared disk shared by a spare machine, the current machine is incorporated in a cluster configuration , and whether the spare machine can be changed to ACT or is in an unknown state (SBY: online) ,
The working machine is
Failure state acquisition means for acquiring failure state information from the state management information storage means via the state management means;
If the failure state information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output means for outputting an error to the maintenance terminal When,
Have
The spare machine is
When it is in an unknown state (SBY: online) whether or not it is possible to transition to ACT, when continuity confirmation is instructed from the maintenance terminal, continuity confirmation means for confirming continuity to the hardware control means on the working machine side When,
When continuity is confirmed by the continuity confirmation means, the number of failures of the forced power-off function monitoring resource of the state management information storage means is cleared as a failure of the forced power-off function on the spare machine side, and the working machine A failure frequency clearing means for transitioning to a state (SBY: online) that enables system switching from
A cluster system recovery system comprising:

The working machine is
If the failure state information acquired by the failure state acquisition means indicates a network failure, router conduction means for confirming conduction to the router;
A network error output means for outputting an error to the maintenance terminal when the connection by the router conduction means fails;
The cluster system recovery system according to claim 4, further comprising:

The working machine is
A disk error output means for outputting an error to the maintenance terminal when the failure status information acquired by the failure status acquisition means indicates a failure of a shared disk or a built-in disk;
The cluster system recovery system according to claim 4, further comprising:

Fault monitoring means for monitoring the fault status, status management means for managing the cluster status indicating the service operating status of the active machine and the spare machine based on the fault status, status in which the service is operating (ACT), and status that can be changed to ACT Or status management information storage means for storing cluster status and failure status information including an unknown status (SBY: online) whether or not it is possible to transit to ACT, respectively, and a current machine and a spare machine, and the current machine and the It is configured with a shared disk shared by a spare machine , and the active machine is incorporated into a cluster configuration, and operates as the active machine when it is unknown whether the spare machine can transition to ACT (SBY: online) A server,
Failure state acquisition means for acquiring failure state information via the state management means;
If the failure state information indicates a failure of the forced power-off function, a failure of the forced power-off function on the spare unit side is suspected, so a forced power-off function error output means for outputting an error to the maintenance terminal When,
If the failure state information acquired by the failure state acquisition means indicates a network failure, router conduction means for confirming conduction to the router;
A network error output means for outputting an error to the maintenance terminal when the connection by the router conduction means fails;
When the failure status information acquired by the failure status acquisition means indicates a failure of the shared disk or internal disk, disk error output means for outputting an error to the maintenance terminal;
The server characterized by having.

Fault monitoring means for monitoring the fault status, status management means for managing the cluster status indicating the service operating status of the active machine and the spare machine based on the fault status, status in which the service is operating (ACT), and status that can be changed to ACT Or status management information storage means for storing cluster status and failure status information including an unknown status (SBY: online) whether or not it is possible to transit to ACT, respectively, and a current machine and a spare machine, and the current machine and the A server that is configured with a shared disk shared by a spare machine, and that operates as a spare machine when the active machine is incorporated in a cluster configuration and the spare machine is in an unknown state (SBY: online) whether it can be changed to ACT Because
When continuity confirmation is instructed from the maintenance terminal, continuity confirmation means for confirming continuity with respect to the hardware control means on the active machine side,
When continuity is confirmed by the continuity confirmation means, the number of failures of the forced power-off function monitoring resource of the state management information storage means is cleared as a failure of the forced power-off function on the spare machine side, and the working machine A failure frequency clearing means for transitioning to a state (SBY: online) that enables system switching from
The server characterized by having.

The program for functioning a computer as each means which comprises the server of Claim 7 or 8.