JP6291326B2

JP6291326B2 - Redundant system and alarm management method

Info

Publication number: JP6291326B2
Application number: JP2014079339A
Authority: JP
Inventors: 見善竹中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-04-08
Filing date: 2014-04-08
Publication date: 2018-03-14
Anticipated expiration: 2034-04-08
Also published as: JP2015201031A

Description

本発明は、障害アラーム等のアラーム管理を対象とし、冗長システムとアラーム管理方法に適用し得るものである。 The present invention is intended for alarm management such as failure alarms, and can be applied to a redundant system and an alarm management method.

一般に、通信設備に重大な故障が発生しても、システムが提供するサービスの継続が求められる場合には、運用系と待機系とからなる冗長システムを採用することが知られている。通常時は運用系装置が動作し、待機系装置は待機しており、運用系装置の工事や故障時には待機系装置が運用系装置に切り替わって動作する。 In general, it is known to adopt a redundant system composed of an active system and a standby system when continuation of services provided by the system is required even if a serious failure occurs in communication equipment. During normal operation, the active system device operates and the standby system device is in standby mode. When the active system device is under construction or failure, the standby system device switches to the active system device and operates.

例えば、特許文献１には、システム規模及びシステムコストを増大させることなく、運用システムから待機システムへ切り替えることで、無停止メンテナンスを容易かつ確実に行う技術が開示されている。また、特許文献２には短時間で運用系サーバから待機系サーバへと切り替える技術、特許文献３には待機系システムへの切り替え時に、外部装置が、回線切断と誤判断することを防止する技術が開示されている。 For example, Patent Document 1 discloses a technique for easily and reliably performing non-stop maintenance by switching from an operation system to a standby system without increasing the system scale and system cost. Patent Document 2 discloses a technique for switching from an active server to a standby server in a short time. Patent Document 3 discloses a technique for preventing an external device from erroneously determining that the line is disconnected when switching to a standby system. Is disclosed.

また、特許文献４には運用系がダウンするなどの致命的障害を発生する前に、待機系へ切り替えを行う技術、特許文献５,６,７には、リアルタイム性が要求されるサービスの継続に必要なデータを運用系サーバから待機系サーバに転送してデータ同期させる技術が開示されている。また、共有メモリ上のデータを同期させることで運用系サーバから待機系サーバへとアプリケーションプロセスを停止させることなく切り替える技術等も提案されている。 Patent Document 4 discloses a technique for switching to a standby system before a fatal failure such as an operation system going down. Patent Documents 5, 6, and 7 describe continuation of services that require real-time performance. Discloses a technique for synchronizing data by transferring necessary data from an active server to a standby server. In addition, a technique for switching an application process from an active server to a standby server by synchronizing data on the shared memory without stopping is proposed.

このように、冗長システムに関する各種の技術が提案されているが、障害アラーム等のアラーム情報を対象としたものではない。冗長システムにおけるアラーム管理に関する技術は、未だ開示されていない。 As described above, various techniques related to the redundant system have been proposed, but are not intended for alarm information such as a failure alarm. A technique related to alarm management in a redundant system has not yet been disclosed.

特開２００９−１１８０６３号公報JP 2009-118063 A 特開２００９−９８７１５号公報JP 2009-98715 A 特開２００９−２２５１３６号公報JP 2009-225136 A 特開２００８−２２６１５３号公報JP 2008-226153 A 特開２００９−１１８０６３号公報JP 2009-118063 A 特開２００６−２８５４４８号公報JP 2006-285448 A 特開２０００−２２６９５号公報JP 2000-22695 A

図１６（ａ）と（ｂ）を参照して冗長システムではないシステム構成におけるアラーム管理手順を簡単に説明する。そのシステムは、ユーザ端末９００と、通信設備９１０と、監視サーバ９２０と、で構成される。ユーザ端末９００と通信設備９１０とは、図示しないネットワークを介して接続される。同様に通信設備９１０と監視サーバ９２０もネットワークを介して接続される。通信設備９１０は、例えば加入者制御を行うサーバである。ここでは、通信設備９１０を装置Ａと称する。監視サーバ９２０は、アラーム対応テーブル９２０１と、アラーム管理テーブル９２００と、を備える。 An alarm management procedure in a system configuration that is not a redundant system will be briefly described with reference to FIGS. The system includes a user terminal 900, a communication facility 910, and a monitoring server 920. The user terminal 900 and the communication facility 910 are connected via a network (not shown). Similarly, the communication facility 910 and the monitoring server 920 are also connected via a network. The communication facility 910 is a server that performs subscriber control, for example. Here, the communication facility 910 is referred to as device A. The monitoring server 920 includes an alarm correspondence table 9201 and an alarm management table 9200.

アラーム対応テーブル９２０１は、障害のアラーム番号と復旧のアラーム番号を対応付けるもので、復旧アラームを受信した際、どの障害アラームに対する復旧アラームであるかを判断するために用いる。つまり、復旧アラームに記載されている復旧アラーム番号をキーとしてアラーム対応テーブル９２０１を参照し、解除すべき障害アラームを抽出して障害アラームの解除を行う。例えば、アラーム対応テーブル９２０１に、障害アラーム番号００１と復旧アラーム番号００２が対応付けられていれば、復旧アラーム番号００２を受信するとアラーム対応テーブル９２０１を参照することで障害アラーム００１が復旧したと判断できる。 The alarm correspondence table 9201 associates a failure alarm number with a recovery alarm number, and is used to determine which failure alarm is a recovery alarm when a recovery alarm is received. That is, referring to the alarm correspondence table 9201 using the recovery alarm number described in the recovery alarm as a key, the failure alarm to be canceled is extracted and the failure alarm is canceled. For example, if a failure alarm number 001 and a recovery alarm number 002 are associated with the alarm correspondence table 9201, when the recovery alarm number 002 is received, it can be determined that the failure alarm 001 has been recovered by referring to the alarm correspondence table 9201. .

アラーム管理テーブル９２００は、受信した障害アラームと装置名を対応付けるテーブ
ルである。装置Ａから障害アラーム００１を受信すると、装置Ａと障害アラーム００１とを対応付けて管理する。アラーム管理手順を順に説明する。先ず障害発生時を、図１６（ａ）を参照して説明する。 The alarm management table 9200 is a table that associates a received failure alarm with a device name. When the failure alarm 001 is received from the device A, the device A and the failure alarm 001 are associated with each other and managed. The alarm management procedure will be described in order. First, when a failure occurs will be described with reference to FIG.

（１）例えばユーザ端末９００からのユーザ１１のユーザ情報の登録失敗の障害が発生したと仮定する。（数）の表記は図１６,１７の表記と対応している。（２）装置Ａは、障害アラームを監視サーバ９２０に対して発行する。障害アラームには、障害を特定する障害アラーム番号（例えば００１）と、障害アラームを発行した装置名（例えば装置Ａ）が記載されている。そして、装置Ａは、（３）障害アラーム番号を保持する。（４）障害アラームを受信した監視サーバ９２０は、障害アラーム番号とアラームを発行した装置名をアラーム管理テーブル９２００に登録する。 (1) For example, it is assumed that a failure of registration failure of user information of the user 11 from the user terminal 900 has occurred. The notation of (number) corresponds to the notations of FIGS. (2) Apparatus A issues a failure alarm to the monitoring server 920. In the failure alarm, a failure alarm number (for example, 001) that identifies the failure and the name of the device that issued the failure alarm (for example, device A) are described. Device A holds (3) a failure alarm number. (4) Upon receiving the failure alarm, the monitoring server 920 registers the failure alarm number and the name of the device that issued the alarm in the alarm management table 9200.

復旧時を図１６（ｂ）を参照して説明する。（５）例えば、ユーザの再試行によってユーザ登録が成功する。すると、（６）装置Ａは、復旧アラームを監視サーバ９２０へ送信する。復旧アラームには、復旧した障害を特定する復旧アラーム番号（例えば００２）と復旧アラームを発行した装置名が記載されている。 The time of recovery will be described with reference to FIG. (5) For example, the user registration is successful by the user's retry. Then, (6) Device A transmits a recovery alarm to the monitoring server 920. The recovery alarm describes a recovery alarm number (for example, 002) that identifies the recovered failure and the name of the device that issued the recovery alarm.

（７）復旧アラームを受信した監視サーバ９２０は、復旧アラームのアラーム番号をキーとして、アラーム対応テーブル９２０１より障害アラーム番号（００１）に対応する復旧であると判断し、更に、アラーム管理テーブル９２００から装置Ａの障害アラームに対する復旧であると判断し、アラーム管理テーブル９２００から装置Ａの障害アラーム（００１）を削除することでアラームを解除する。 (7) The monitoring server 920 that has received the recovery alarm determines that the recovery corresponds to the failure alarm number (001) from the alarm correspondence table 9201 using the alarm number of the recovery alarm as a key, and further from the alarm management table 9200. It is determined that the recovery is for the failure alarm of the device A, and the alarm is canceled by deleting the failure alarm (001) of the device A from the alarm management table 9200.

図１６に示したアラーム管理方法を、冗長構成をとっているシステムに、そのまま適用することはできない。図１７を参照して図１６に示したアラーム管理方法を、冗長構成をとるシステムにそのまま適用した場合の課題について説明する。 The alarm management method shown in FIG. 16 cannot be directly applied to a system having a redundant configuration. A problem when the alarm management method shown in FIG. 16 is directly applied to a system having a redundant configuration will be described with reference to FIG.

装置Ａ（９１０、以降参照符号省略）と装置Ｂ（９１０′、以降参照符号省略）が冗長構成をとり、装置Ａが運用系、装置Ｂが待機系であるとする。監視サーバ９２０とユーザ端末９００とは、上記した実システムと同じものである。また、上記した手順の（１）〜（４）までも同じである。ここでは、装置Ａにおいて故障が発生し、待機系の装置Ｂに運用が切り替わった場合を想定する。 Assume that the device A (910, reference numeral omitted hereinafter) and the device B (910 ', reference symbol omitted) have a redundant configuration, the device A is an active system and the device B is a standby system. The monitoring server 920 and the user terminal 900 are the same as the actual system described above. The same applies to (1) to (4) of the above procedure. Here, it is assumed that a failure occurs in the device A and the operation is switched to the standby device B.

（５）故障が発生すると、（６）装置Ａは、運用を装置Ｂに切り替える。運用切り替えの後、ユーザ端末９００は装置Ｂに接続される。 (5) When a failure occurs, (6) Device A switches operation to Device B. After the operation switching, the user terminal 900 is connected to the device B.

（７）装置Ｂに運用を切り替えた後に、例えばユーザの再試行によりユーザ登録が成功すると、装置Ｂは「装置Ａにおいてユーザ登録を失敗」した障害情報を持たないので、装置Ｂはその障害が復旧したことを判断できない。したがって、復旧アラームは監視サーバ９２０に送信されない。よって、復旧しているにも関わらず監視サーバ９２０では障害が継続しているものと判断してしまう。 (7) After the operation is switched to the device B, for example, when the user registration is successful due to the user's retry, the device B does not have the failure information indicating that “user registration failed in the device A”. It cannot be determined that it has recovered. Therefore, the recovery alarm is not transmitted to the monitoring server 920. Therefore, the monitoring server 920 determines that the failure continues despite the recovery.

このように、監視サーバ９２０の保持するアラーム管理テーブル９２００と運用系装置の保持する障害情報との間で不一致が生じる課題がある。この課題は、冗長システムを仮想化システムで構成した場合でも同様に発生する。 As described above, there is a problem that a mismatch occurs between the alarm management table 9200 held by the monitoring server 920 and the failure information held by the active device. This problem similarly occurs even when the redundant system is configured by a virtualized system.

この発明は、このような課題に鑑みてなされたものであり、監視サーバと運用系装置との間で障害情報の不一致が生じないようにした冗長システムとアラーム管理方法を提供することを目的とする。 The present invention has been made in view of such problems, and an object thereof is to provide a redundant system and an alarm management method in which a mismatch of failure information does not occur between a monitoring server and an operation system device. To do.

本発明の冗長システムは、ネットワークを介して通信する運用系装置と待機系装置と監視サーバと、を備える。運用系装置は、障害アラーム発行部と、障害イベント通知部と、装置アラーム発行部とを具備する。障害アラーム発行部は、障害発生時の障害アラーム情報を監視サーバに通知する。障害イベント通知部は、運用系装置を待機系に切り替える際に、保持している障害イベント情報を上記待機系装置に通知する。装置アラーム発行部は、運用系に系を切り替えた待機系装置の装置名を含む装置アラーム情報を監視サーバに通知する。待機系装置は復旧アラーム発行部を具備する。復旧アラーム発行部は、障害アラーム情報に含まれる識別子と当該待機系装置の装置名の情報を、復旧アラーム情報として監視サーバに通知する。監視サーバは、再登録部と障害アラーム解除部とを具備する。再登録部は、装置アラーム情報を受信した際に、障害アラーム情報に対応させて登録している装置名を、当該装置アラーム情報に含まれる待機系装置の装置名に書き換えて再登録する。障害アラーム解除部は、復旧アラーム情報に含まれる識別子と装置名とで対応する障害アラーム情報を削除することで解除する。 The redundant system of the present invention includes an active device, a standby device, and a monitoring server that communicate via a network. The operational system includes a failure alarm issuing unit, a failure event notification unit, and a device alarm issuing unit. The failure alarm issuing unit notifies the monitoring server of failure alarm information when a failure occurs. The failure event notification unit notifies the standby device of the stored failure event information when switching the active device to the standby system. The device alarm issuing unit notifies the monitoring server of device alarm information including the device name of the standby device that has switched the system to the active system. The standby system device includes a recovery alarm issuing unit. The recovery alarm issuing unit notifies the monitoring server of the identifier included in the failure alarm information and the information on the device name of the standby system device as recovery alarm information. The monitoring server includes a re-registration unit and a failure alarm cancellation unit. When receiving the device alarm information, the re-registration unit re-registers the device name registered corresponding to the failure alarm information with the device name of the standby device included in the device alarm information. The failure alarm cancellation unit is canceled by deleting the failure alarm information corresponding to the identifier and device name included in the recovery alarm information.

また、本発明の冗長システムのアラーム管理方法は、運用系装置が、障害アラーム発行過程と障害イベント通知過程と装置アラーム発行過程とを備える。また、待機系装置は、復旧アラーム発行過程を備える。また、監視サーバは、再登録過程と障害アラーム解除過程とを備える。障害アラーム発行過程は、障害発生時の障害アラーム情報を監視サーバに通知する。障害イベント通知過程は、当該運用系装置を待機系に替える際に、保持している障害イベント情報を待機系装置に通知する。装置アラーム発行過程は、運用系に系を切り替えた待機系装置の装置名を含む装置アラーム情報を監視サーバに通知する。復旧アラーム発行過程は、障害復旧時に、障害イベント情報に含まれる識別子と当該待機系の装置名を、復旧アラーム情報として監視サーバに通知する。再登録過程は、装置アラーム情報を受信した際に、障害アラーム情報に対応させて登録している装置名を、当該装置アラーム情報に含まれる待機系装置の装置名に書き換えて再登録する。障害アラーム解除過程は、復旧アラーム情報に含まれる識別子と装置名とで対応する障害アラーム情報を削除する。 Also, in the redundant system alarm management method according to the present invention, the operating system device includes a failure alarm issuing process, a failure event notifying process, and a device alarm issuing process. The standby system device includes a recovery alarm issuing process. The monitoring server also includes a re-registration process and a failure alarm cancellation process. The failure alarm issuing process notifies the monitoring server of failure alarm information when a failure occurs. The failure event notification process notifies the standby device of the stored failure event information when the active device is switched to the standby system. In the device alarm issuing process, device alarm information including the device name of the standby device that has switched the system to the active system is notified to the monitoring server. In the recovery alarm issuing process, at the time of failure recovery, the identifier included in the failure event information and the standby device name are notified to the monitoring server as recovery alarm information. In the re-registration process, when device alarm information is received, the device name registered corresponding to the failure alarm information is rewritten and re-registered with the device name of the standby device included in the device alarm information. In the failure alarm release process, the failure alarm information corresponding to the identifier and the device name included in the recovery alarm information is deleted.

この発明の冗長システムとアラーム管理方法は、故障や保守等の理由で運用系を切り替える時に、旧運用系装置が保持する障害イベント情報を、新運用系装置（旧待機系装置）に通知するので、新運用系装置において旧運用系装置で生じた障害の復旧を検出することが可能になる。また、運用系を切り替える際に、新運用系装置の装置名が装置アラーム情報として監視サーバに通知されているので、監視サーバは新運用系装置で旧運用系装置で生じた障害が復旧したことを判断でき、監視サーバと運用系装置との間で障害情報の不一致が生じさせない。つまり、監視サーバと運用系装置とが保持する障害情報に乖離を生じさせない効果を奏する。 In the redundant system and alarm management method of the present invention, when the operating system is switched due to a failure or maintenance, the fault event information held by the old operating system is notified to the new operating system (old standby system). Thus, it becomes possible to detect the recovery of a failure that has occurred in the old operation system device in the new operation system device. Also, when the active system is switched, the device name of the new active device is reported to the monitoring server as device alarm information, so that the monitoring server has recovered from the failure that occurred in the old active device on the new active device Therefore, the failure information does not become inconsistent between the monitoring server and the active device. In other words, there is an effect of not causing a difference in the failure information held by the monitoring server and the active device.

本発明の実施形態１の冗長システム１００の構成例を示す図。1 is a diagram showing a configuration example of a redundant system 100 according to a first embodiment of the present invention. 障害発生時の冗長システム１００を示す図。The figure which shows the redundant system 100 at the time of failure occurrence. 故障発生時の冗長システム１００を示す図。The figure which shows the redundant system 100 at the time of failure occurrence. 障害復旧時の冗長システム１００を示す図。The figure which shows the redundant system 100 at the time of failure recovery. 冗長システム１００の動作シーケンスを示す図。The figure which shows the operation | movement sequence of the redundant system. 運用系装置２０のより具体的な機能構成例を示す図。The figure which shows the more specific function structural example of the operation type | system | group apparatus 20. FIG. 待機系装置３０のより具体的な機能構成例を示す図。The figure which shows the more concrete function structural example of the standby type | system | group apparatus 30. 監視サーバ４０のより具体的な機能構成例を示す。A more specific functional configuration example of the monitoring server 40 is shown. 本発明の実施形態２の冗長システム２００の障害発生時を示す図。The figure which shows the time of a failure generation of the redundant system 200 of Embodiment 2 of this invention. 運用系装置２０で故障が発生した時の冗長システム２００を示す図。The figure which shows the redundant system 200 when a failure generate | occur | produces in the active system apparatus 20. FIG. 新運用系装置３０（旧待機系装置）で故障が発生した時の冗長システム２００を示す図。The figure which shows the redundant system 200 when a failure generate | occur | produces in the new operation system apparatus 30 (old standby system apparatus). 待機系装置３１において障害が復旧した時の冗長システム２００を示す図。The figure which shows the redundant system 200 when a failure is recovered in the standby system device 31. 障害発生時の冗長システム３００を示す図。The figure which shows the redundant system 300 at the time of failure occurrence. 故障発生時の冗長システム３００を示す図。The figure which shows the redundant system 300 at the time of failure occurrence. 障害復旧時の冗長システム３００を示す図。The figure which shows the redundant system 300 at the time of failure recovery. 冗長システムではないシステム構成におけるアラーム管理手順を簡単に説明する図であり、（ａ）は障害発生時、（ｂ）は障害が復旧した障害復旧時を示す図。It is a figure which illustrates simply the alarm management procedure in the system configuration which is not a redundant system, (a) is a figure which shows the time of failure recovery when a failure has occurred, and (b). 図１６に示すアラーム管理方法を、冗長構成をとるシステムにそのまま適用した場合の課題について説明する図。The figure explaining the subject at the time of applying the alarm management method shown in FIG. 16 as it is to the system which takes a redundant structure.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには
同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

〔実施形態１〕
図１に、実施形態１の冗長システム１００の構成例を示す。冗長システム１００は、ネットワーク９３０を介して通信する運用系装置２０と待機系装置３０と監視サーバ４０と、を備える。 Embodiment 1
FIG. 1 shows a configuration example of the redundant system 100 according to the first embodiment. The redundant system 100 includes an active system device 20, a standby system device 30, and a monitoring server 40 that communicate via a network 930.

運用系装置２０は、障害アラーム発行部２２３と、障害イベント通知部２２４と、装置アラーム発行部２２６と、を具備する。障害アラーム発行部２２３は、障害発生時の障害イベント情報を障害アラーム情報として監視サーバ４０に通知する。障害イベント通知部２２４は、当該運用系装置を待機系に切り替える際に、保持している障害イベント情報を
待機系装置３０に通知する。障害アラーム情報と障害イベント情報とは、例えば、ユーザ情報登録失敗等の論理的なエラー情報のことである。 The operational system device 20 includes a failure alarm issuing unit 223, a failure event notification unit 224, and a device alarm issuing unit 226. The failure alarm issuing unit 223 notifies the monitoring server 40 of failure event information when a failure occurs as failure alarm information. The failure event notification unit 224 notifies the standby device 30 of the stored failure event information when switching the active device to the standby system. The failure alarm information and the failure event information are logical error information such as user information registration failure.

装置アラーム発行部２２６は、運用系に系を切り替えた待機系装置３０の装置名を含む装置アラーム情報を監視サーバ４０に通知する。装置アラーム情報とは、例えば、装置のハードウェアの故障等の物理的なエラー情報のことである。 The device alarm issuing unit 226 notifies the monitoring server 40 of device alarm information including the device name of the standby system device 30 that has switched the system to the active system. The device alarm information is physical error information such as a hardware failure of the device, for example.

待機系装置３０は、復旧アラーム発行部３２１を具備する。復旧アラーム発行部３２１は、障害復旧時に、障害イベント情報に含まれる識別子と当該待機系装置の装置名を、復旧アラーム情報として監視サーバに通知する。 The standby system device 30 includes a recovery alarm issuing unit 321. The recovery alarm issuing unit 321 notifies an identifier included in the failure event information and the device name of the standby device to the monitoring server as recovery alarm information at the time of failure recovery.

監視サーバ４０は、再登録部４２４と、障害アラーム解除部４２５と、を具備する。再登録部４２４は、装置アラームを受信した際に、障害イベント情報に対応させて記憶している装置名を、当該装置アラーム情報に含まれる待機系装置の装置名に書き換えて再登録する。障害アラーム解除部４２５は、復旧アラーム情報に含まれる識別子と装置名とで対応する障害アラーム情報を削除することで解除する。 The monitoring server 40 includes a re-registration unit 424 and a failure alarm cancellation unit 425. When the re-registration unit 424 receives a device alarm, the re-registration unit 424 re-registers the device name stored in association with the failure event information by rewriting the device name of the standby device included in the device alarm information. The failure alarm cancellation unit 425 cancels the failure alarm information corresponding to the identifier and device name included in the recovery alarm information.

以上説明した実施形態１の冗長システム１００は、運用系装置２０を、故障又は保守等の必要性から待機系装置に切り替える場合に、運用系装置２０が保持している障害イベント情報を新運用系装置（待機系装置）３０に通知するので、新運用系装置３０において旧運用系装置２０で生じた障害情報を得ることができる。また、新運用系装置３０の装置名も監視サーバ４０に通知されるので、過去の障害と新運用系装置３０とを対応付けることができる。 In the redundant system 100 according to the first embodiment described above, when the active device 20 is switched to a standby device from the necessity of failure or maintenance, the failure event information held by the active device 20 is changed to the new active system. Since the notification is made to the device (standby device) 30, it is possible to obtain failure information generated in the old active device 20 in the new active device 30. Further, since the device name of the new active device 30 is also notified to the monitoring server 40, a past failure and the new active device 30 can be associated with each other.

その結果、新運用系装置３０において旧運用系装置２０で生じた障害が復旧された場合、その復旧を監視サーバ４０に通知することができ、監視サーバ４０は、復旧アラーム情報に含まれる識別子と新運用系装置３０の装置名とで、障害アラーム情報を削除して解除することができる。このように、冗長システム１００では、監視サーバ４０で保持する障害情報と運用系装置３０の保持する障害情報との間に乖離を生じさせない効果を奏する。また、以上説明したアラーム管理方法によれば、監視サーバはどの装置が運用系／待機系として運用されているかを管理する必要がない。 As a result, when the failure that occurred in the old active device 20 is recovered in the new active device 30, the recovery can be notified to the monitoring server 40, and the monitoring server 40 can identify the identifier included in the recovery alarm information. With the device name of the new active system device 30, the failure alarm information can be deleted and canceled. As described above, the redundant system 100 has an effect of not causing a divergence between the failure information held by the monitoring server 40 and the failure information held by the active device 30. Further, according to the alarm management method described above, the monitoring server does not need to manage which device is operated as an active / standby system.

次に、図２〜図４を参照して冗長システム１００の時系列的動作を説明する。図２は、障害発生時の冗長システム１００を示す。運用系装置２０と待機系装置３０とが冗長構成をとり、監視サーバ４０はアラーム管理テーブル４２０とアラーム対応テーブル４２２を持つ。アラーム管理テーブル４２０は障害アラーム情報に含まれる障害を特定するアラーム番号と装置名を対応付けるテーブルである。アラーム対応テーブル４２２は、アラーム番号と当該アラーム番号に対応する復旧アラーム番号とを対応付けるテーブルである。アラーム対応テーブル４２２のアラーム番号と障害／復旧とペア情報は、予め監視サーバ４０が持っている情報である。 Next, a time-series operation of the redundant system 100 will be described with reference to FIGS. FIG. 2 shows the redundant system 100 when a failure occurs. The active system device 20 and the standby system device 30 have a redundant configuration, and the monitoring server 40 has an alarm management table 420 and an alarm correspondence table 422. The alarm management table 420 is a table that associates an alarm number for identifying a failure included in the failure alarm information with a device name. The alarm correspondence table 422 is a table that associates an alarm number with a recovery alarm number corresponding to the alarm number. The alarm number, failure / recovery, and pair information in the alarm correspondence table 422 are information that the monitoring server 40 has in advance.

運用系装置４０は、例えば加入者制御を行うサーバであり、ユーザ情報を登録するものである。ここでは、ユーザ端末９００から入力されるユーザ１１のユーザ情報の登録に障害が発生した状況を想定している。図５に示す動作シーケンス図も参照してその動作を説明する。 The active device 40 is a server that performs subscriber control, for example, and registers user information. Here, it is assumed that a failure has occurred in the registration of the user information of the user 11 input from the user terminal 900. The operation will be described with reference to the operation sequence diagram shown in FIG.

運用系装置２０は、運用系装置として動作している（ステップＳ０）。その状態で、例えばユーザ情報のフォーマット違い等が原因で正常にユーザ１１のユーザ情報の登録が出来ない障害が発生したと仮定する（ステップＳ１）。運用系装置２０は、障害発生時の障害アラーム情報を、監視サーバ４０に通知する（ステップＳ２）。障害アラーム情報には、障害を特定するアラーム番号（例えば００１）と障害アラーム情報を発行した装置名（２０）とが、含まれている。 The active system device 20 operates as an active system device (step S0). In this state, it is assumed that a failure has occurred in which the user information of the user 11 cannot be registered normally due to, for example, the format difference of the user information (step S1). The operational system device 20 notifies the monitoring server 40 of failure alarm information when a failure occurs (step S2). The failure alarm information includes an alarm number (for example, 001) that identifies the failure and the name of the device (20) that issued the failure alarm information.

障害アラーム情報を受信した監視サーバ４０は、障害アラーム情報に含まれるアラーム番号（００１）をアラーム対応テーブル４２２に照会することで、運用系装置２０に障害が発生したことを検知する（ステップＳ４）。そしてアラーム管理テーブル４２０に、アラーム番号（００１）とアラームを発行した装置名（２０）を登録する（ステップＳ５）。 The monitoring server 40 that has received the failure alarm information refers to the alarm correspondence table 422 for the alarm number (001) included in the failure alarm information, thereby detecting that a failure has occurred in the active device 20 (step S4). . Then, the alarm number (001) and the name of the device that issued the alarm (20) are registered in the alarm management table 420 (step S5).

図３に、故障発生時の冗長システム１００を示す。運用系装置２０において例えば故障が発生（ステップＳ６）し、待機系装置３０に運用を切り替える事態が発生したとする（ステップＳ７）。あるいは、運用系装置２０を保守する必要から運用系装置２０を停止し待機系装置３０に切り替えたとする（ステップＳ７）。切り替えない場合、運用系装置２０は運用を継続する（ステップＳ６のなし）。 FIG. 3 shows the redundant system 100 when a failure occurs. For example, it is assumed that a failure has occurred in the active system device 20 (step S6) and a situation has occurred in which operation is switched to the standby system device 30 (step S7). Alternatively, it is assumed that the active device 20 is stopped and switched to the standby device 30 because it is necessary to maintain the active device 20 (step S7). If not switched, the active device 20 continues operation (no step S6).

待機系装置３０に運用を切り替える場合、運用系装置２０は、保持している障害イベント情報を、待機系装置３０に通知する（ステップＳ８）。待機系装置３０は、受信した障害イベント情報を保持する（ステップＳ９）。障害イベント情報は、上記した障害アラーム情報と基本的に同じものである。また、運用系装置２０は、新たに運用系装置に切り替えた新運用系装置３０の装置名と当該切り替え処理を特定するアラーム番号（００３）とを含む装置アラーム情報を、監視サーバ４０に通知する（ステップＳ１０）。装置アラーム情報を発行した後の運用系装置２０は待機系装置として動作する。装置アラーム情報は、監視サーバ４０のみに通知される。 When switching the operation to the standby system device 30, the active system device 20 notifies the standby system device 30 of the fault event information it holds (step S8). The standby device 30 holds the received failure event information (step S9). The failure event information is basically the same as the failure alarm information described above. Further, the active device 20 notifies the monitoring server 40 of device alarm information including the device name of the new active device 30 that has been newly switched to the active device and the alarm number (003) that identifies the switching process. (Step S10). The operational device 20 after issuing the device alarm information operates as a standby device. The device alarm information is notified only to the monitoring server 40.

装置アラーム情報を受信した監視サーバ４０は、装置アラーム情報に含まれるアラーム番号（００３）をアラーム対応テーブル４２２に照会することで、運用系装置２０に故障が発生したことを検知する（ステップＳ１１）。そして、運用系装置が装置３０に切り替わったことを受信した監視サーバ４０は、ステップＳ４で登録したアラーム番号（００１）とアラームを発行した装置名（２０）を、アラーム番号（００１）と装置名（３０）とで再登録する（ステップＳ１２）。この再登録の処理は、運用系装置２０で発生した障害アラーム情報を待機系装置３０に引き継いだことを意味する。 The monitoring server 40 that has received the device alarm information refers to the alarm correspondence table 422 for the alarm number (003) included in the device alarm information, and detects that a failure has occurred in the active device 20 (step S11). . Then, the monitoring server 40 that has received the fact that the active device has been switched to the device 30 has the alarm number (001) registered in step S4 and the name of the device (20) that issued the alarm, the alarm number (001) and the device name. (30) and re-register (step S12). This re-registration process means that the failure alarm information generated in the active system device 20 has been taken over by the standby system device 30.

図４に、障害復旧時の冗長システム１００を示す。図４は、新運用系装置３０（旧待機系装置）において、旧運用系装置２０で生じた障害が復旧した場合を示している。例えばユーザ１１の再試行によってユーザ１１のユーザ情報の登録が成功した場合を想定する。 FIG. 4 shows the redundant system 100 at the time of failure recovery. FIG. 4 shows a case where a failure that has occurred in the old active system device 20 is recovered in the new active system device 30 (old standby system device). For example, it is assumed that the user information of the user 11 is successfully registered by the user 11 retry.

待機系装置３０は、ユーザ１１のユーザ登録が成功すると、保持している障害イベント情報から、アラーム番号（００１）に対応する復旧であると判断して障害の復旧を検出する（ステップＳ１３）。そして、障害の復旧を検出した新運用系装置３０は、障害イベント情報に含まれるアラーム番号（００１）に対応する復旧アラーム番号（００２）と新運用系装置の装置名３０を、復旧アラーム情報として監視サーバ４０に通知する（ステップＳ１４）。 When the user registration of the user 11 is successful, the standby device 30 determines from the stored failure event information that the recovery corresponds to the alarm number (001), and detects the recovery of the failure (step S13). Then, the new active device 30 that has detected the failure recovery uses the recovery alarm number (002) corresponding to the alarm number (001) included in the failure event information and the device name 30 of the new active device as the recovery alarm information. The monitoring server 40 is notified (step S14).

復旧アラーム情報を受信した監視サーバ４０は、復旧アラーム番号（００２）からアラーム番号（００１）の障害が、装置名３０において復旧したと判断する（ステップＳ１５）。そして、監視サーバ４０は、アラーム管理テーブル４２０から、アラーム番号（００１）と装置名３０とで登録されている障害アラーム情報を削除することで障害アラームの解除を行う（ステップＳ１６）。 The monitoring server 40 that has received the recovery alarm information determines that the failure from the recovery alarm number (002) to the alarm number (001) has been recovered in the device name 30 (step S15). Then, the monitoring server 40 cancels the failure alarm by deleting the failure alarm information registered with the alarm number (001) and the device name 30 from the alarm management table 420 (step S16).

このように、故障や保守等の理由で運用系装置２０が待機系装置２０に運用が切り替わった時、運用系装置２０から待機系装置に通知される障害イベント情報によって、障害の情報を待機系装置３０へ引き継ぐことで、運用系装置２０で発生した障害の復旧を待機系装置（新運用系装置）で検出することができる。また、装置アラーム情報によって、運用系装置２０が切り替わったことを、監視サーバ４０に通知するので、監視サーバ４０は復旧アラーム番号と装置名の両方で、障害の復旧を検出することが可能である。 As described above, when the operation system device 20 is switched to the standby system device 20 due to a failure or maintenance, the failure information is notified from the operation system device 20 to the standby system device. By taking over to the device 30, recovery of a failure that has occurred in the active device 20 can be detected by the standby device (new active device). Further, since the device alarm information notifies the monitoring server 40 that the active device 20 has been switched, the monitoring server 40 can detect the recovery of the failure by using both the recovery alarm number and the device name. .

以上説明したように、冗長システム１００によれば、監視サーバと運用系装置との間で障害情報の不一致を生じさせない。つまり、監視サーバと運用系装置とが保持する障害情報に乖離を生じさせない効果を奏する。また、監視サーバ４０は、どの装置が運用系／待機系で運転されているかを管理する必要がない。 As described above, according to the redundant system 100, failure information mismatch does not occur between the monitoring server and the active device. In other words, there is an effect of not causing a difference in the failure information held by the monitoring server and the active device. Further, the monitoring server 40 does not need to manage which device is operated in the active / standby system.

次に、冗長システム１００を構成する各装置のより具体的な機能構成例を示して更に詳しく実施形態１を説明する。 Next, the first embodiment will be described in more detail by showing a more specific functional configuration example of each device constituting the redundant system 100.

〔運用系装置〕
図６に、運用系装置２０のより具体的な機能構成例を示す。運用系装置２０は、通信インターフェース２１と、制御部２２と、を備える。制御部２２は、ユーザ情報登録部２２０、ユーザ情報記録部２２１、障害検出保持部２２２、障害アラーム発行部２２３、障害イベント通知部２２４、系切替え信号生成部２２５、装置アラーム発行部２２６、障害復旧検出部２２７、復旧アラーム発行部３２１、の機能構成部を具備する。各機能構成部は、通信インターフェース２１とネットワーク９３０を介して接続される。運用系装置２０は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 [Operational equipment]
FIG. 6 shows a more specific functional configuration example of the operational system device 20. The operational system device 20 includes a communication interface 21 and a control unit 22. The control unit 22 includes a user information registration unit 220, a user information recording unit 221, a failure detection holding unit 222, a failure alarm issue unit 223, a failure event notification unit 224, a system switching signal generation unit 225, a device alarm issue unit 226, a failure recovery It has a functional configuration unit including a detection unit 227 and a recovery alarm issuing unit 321. Each functional component is connected to the communication interface 21 via the network 930. The operational system device 20 is realized when a predetermined program is read into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and the CPU executes the program.

ユーザ情報登録部２２０は、運用系装置２０を、例えば加入者制御を行うサーバ（装置）とした場合に必要になる機能構成部であり、ユーザ端末９００から入力されるユーザ情報をユーザ情報記録部２２１に登録するものである。運用系装置２０を、他の用途で用いる装置とした場合、ユーザ情報登録部２２０とユーザ情報記録部２２１は、無くても良い機能構成部である。つまり、ユーザ情報登録部２２０とユーザ情報記録部２２１は、実施形態１に必須な機能構成部ではない。 The user information registration unit 220 is a functional configuration unit that is required when the active device 20 is a server (device) that performs subscriber control, for example. User information input from the user terminal 900 is a user information recording unit. 221 is registered. When the active device 20 is a device used for other purposes, the user information registration unit 220 and the user information recording unit 221 are functional component units that may be omitted. That is, the user information registration unit 220 and the user information recording unit 221 are not essential function components in the first embodiment.

障害検出保持部２２２は、図５に示したステップＳ１の障害を検出する機能構成部である。障害検出保持部２２２は、例えばユーザ１１のユーザ情報が入力ミスによってフォーマット違いであるような場合を検出する。障害アラーム発行部２２３は、障害発生時の障害アラーム情報を監視サーバ４０に通知する（ステップＳ２）。 The failure detection holding unit 222 is a functional configuration unit that detects the failure in step S1 illustrated in FIG. The failure detection holding unit 222 detects a case where the user information of the user 11 has a format difference due to an input error, for example. The failure alarm issuing unit 223 notifies the monitoring server 40 of failure alarm information when a failure occurs (step S2).

障害イベント通知部２２４は、運用系装置２０を待機系に切り替える際に、障害検出保持部２２２が保持している障害イベント情報を待機系装置３０に通知する（ステップＳ８）。待機系の切り替えは、例えば運用系装置２０を保守するために停止させる場合に、外部から入力される信号に基づいて行われる。又は、運用系装置２０の内部の故障を検出する系切替え信号生成部２２５を具備し、その出力信号に基づいて待機系に切り替えても良い。 The failure event notification unit 224 notifies the standby device 30 of the failure event information held by the failure detection holding unit 222 when switching the active device 20 to the standby system (step S8). Switching of the standby system is performed based on a signal input from the outside when the operation system device 20 is stopped for maintenance, for example. Alternatively, a system switching signal generation unit 225 that detects a failure inside the active system apparatus 20 may be provided, and the system may be switched to a standby system based on the output signal.

装置アラーム発行部２２６は、運用系に切り替えた待機系装置３０の装置名を含む装置アラーム情報を監視サーバ４０に通知する（ステップＳ１０）。 The device alarm issuing unit 226 notifies the monitoring server 40 of device alarm information including the device name of the standby system device 30 switched to the active system (step S10).

障害復旧検出部２２７は、運用系装置２０が運転中に、障害検出保持部２２２に保持されている障害アラーム情報に対応する復旧が検出された場合の障害復旧を検出する。また、復旧アラーム発行部３２１は、障害復旧検出部２２７が検出した障害復旧を、復旧アラームとして監視サーバ４０に通知する。障害復旧検出部２２７と復旧アラーム発行部３２１の機能構成部の処理は、上記した図５においては省略している。これらの処理ステップは、運用系装置２０が、故障／保守等を理由とする系の切り替えが不要の場合（ステップＳ６のなしのループ）に、図５のステップＳ３とステップＳ６との間に挿入される。 The failure recovery detection unit 227 detects failure recovery when recovery corresponding to the failure alarm information held in the failure detection holding unit 222 is detected while the active system device 20 is in operation. Further, the recovery alarm issuing unit 321 notifies the monitoring server 40 of the failure recovery detected by the failure recovery detection unit 227 as a recovery alarm. The processing of the functional components of the failure recovery detection unit 227 and the recovery alarm issuing unit 321 is omitted in FIG. These processing steps are inserted between step S3 and step S6 in FIG. 5 when the active system device 20 does not need to switch the system due to failure / maintenance or the like (loop without step S6). Is done.

〔待機系装置〕
図７に、待機系装置３０のより具体的な機能構成例を示す。図６に示した運用系装置２０の機能構成部と同じ参照符号が付けられた機能部は、同じものである。運用系装置２０の、障害イベント通知部２２４と系切替え信号生成部２２５と装置アラーム発行部２２６と、は具備しない。これらを具備する構成は実施形態２で説明する。 [Standby system]
FIG. 7 shows a more specific functional configuration example of the standby system device 30. The functional units to which the same reference numerals as the functional components of the operational apparatus 20 illustrated in FIG. 6 are attached are the same. The failure event notification unit 224, the system switching signal generation unit 225, and the device alarm issuing unit 226 of the active system device 20 are not provided. A configuration including these will be described in a second embodiment.

逆に、運用系装置２０が具備しない障害イベント情報記憶部３２０を、待機系装置３０は具備する。障害イベント情報記憶部３２０は、運用系装置２０から通知される障害イベント情報を記憶する（ステップＳ９）。上記したように障害イベント情報は、障害アラーム情報と同じものであるため、障害イベント情報記憶部３２０に運用系装置２０で生じた障害の情報が保持される。 Conversely, the standby system device 30 includes a failure event information storage unit 320 that the active system device 20 does not include. The failure event information storage unit 320 stores failure event information notified from the active system device 20 (step S9). As described above, since the failure event information is the same as the failure alarm information, the failure event information storage unit 320 holds information on a failure that has occurred in the active system device 20.

待機系装置３０の障害復旧検出部２２７′は、待機系装置３０が運用系装置として運用されている時に生じた障害検出保持部２２２に保持されている障害と、障害イベント情報記憶部３２０に保持されている障害イベント情報の両者の復旧を検出する点で異なる。よって、待機系装置３０の復旧アラーム発行部３２１は、旧運用系装置２０で発生した障害が、待機系装置３０において復旧したことを、監視サーバ４０に通知することができる。 The failure recovery detection unit 227 ′ of the standby device 30 stores the failure held in the failure detection holding unit 222 that occurs when the standby device 30 is operated as an active device, and the failure event information storage unit 320. The difference is that the recovery of both of the failure event information being detected is detected. Therefore, the recovery alarm issuing unit 321 of the standby system device 30 can notify the monitoring server 40 that a failure that has occurred in the old active system device 20 has been recovered in the standby system device 30.

〔監視サーバ〕
図８に、監視サーバ４０のより具体的な機能構成例を示す。監視サーバ４０は、通信インターフェース４１と、制御部４２と、を備える。制御部４２は、アラーム管理テーブル４２０、アラーム管理テーブル登録部４２１、アラーム対応テーブル４２２、アラーム番号照会部４２３、再登録部４２４、障害アラーム解除部４２５、の機能構成部を具備する。各機能構成部は、通信インターフェース４１とネットワーク９３０を介して接続される。監視サーバ４０は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 [Monitoring server]
FIG. 8 shows a more specific functional configuration example of the monitoring server 40. The monitoring server 40 includes a communication interface 41 and a control unit 42. The control unit 42 includes functional components such as an alarm management table 420, an alarm management table registration unit 421, an alarm correspondence table 422, an alarm number inquiry unit 423, a re-registration unit 424, and a failure alarm release unit 425. Each functional component is connected to the communication interface 41 via the network 930. The monitoring server 40 is realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

アラーム対応テーブル４２２は、障害アラーム情報に含まれる障害アラーム番号（例えば００１）と、装置アラーム情報に含まれるアラーム番号（例えば００３）との対応と、を記録する。アラーム対応テーブル４２２は、障害アラーム情報と装置アラーム情報を、区別して管理する。 The alarm correspondence table 422 records the correspondence between the failure alarm number (for example, 001) included in the failure alarm information and the alarm number (for example, 003) included in the device alarm information. The alarm correspondence table 422 separately manages failure alarm information and device alarm information.

アラーム管理テーブル４２０は、障害アラーム情報に含まれるアラーム番号（例えば００１）と、運用系装置２０の装置名（２０）又は装置アラーム情報に含まれる待機系装置３０の装置名（３０）とを対応付けて記録する。 The alarm management table 420 associates the alarm number (for example, 001) included in the failure alarm information with the device name (20) of the active device 20 or the device name (30) of the standby device 30 included in the device alarm information. Add and record.

アラーム番号照会部４２３は、アラーム情報に含まれるアラーム番号（例えば００１〜００４）をアラーム対応テーブル４２２に照会することで、その意味を確定させる（ステップＳ４,Ｓ１１）。 The alarm number inquiry unit 423 inquires the alarm number included in the alarm information (for example, 001 to 004) in the alarm correspondence table 422 to determine its meaning (steps S4 and S11).

アラーム管理テーブル登録部４２１は、障害アラーム情報に含まれる障害を特定する識別子と装置名の組を上記アラーム管理テーブルに登録する（ステップＳ５,Ｓ１２）。障害を特定する識別子とは、アラーム番号（例えば００１）のことである。 The alarm management table registration unit 421 registers a pair of an identifier and a device name that specify a failure included in the failure alarm information in the alarm management table (steps S5 and S12). The identifier that identifies the failure is an alarm number (for example, 001).

再登録部４２４は、装置アラーム情報を受信した際に、障害アラーム情報に対応させて登録されている装置名を、当該装置アラーム情報に含まれる待機系装置の装置名に書き換えて再登録する（ステップＳ１２）。この例では、「００１：装置名２０」の登録情報を、「００１：装置名３０」に再登録する。 When the re-registration unit 424 receives the device alarm information, the re-registration unit 424 re-registers the device name registered corresponding to the failure alarm information with the device name of the standby device included in the device alarm information. Step S12). In this example, the registration information of “001: device name 20” is re-registered in “001: device name 30”.

障害アラーム解除部４２５は、復旧アラーム情報に含まれる復旧アラーム番号と装置名とで対応するアラーム管理テーブル４２０の登録情報を削除することで障害アラームの解除を行う（ステップＳ１６）。この例では、「００１：装置名３０」で登録されている登録情報を削除することで障害アラームを解除する。 The failure alarm cancellation unit 425 cancels the failure alarm by deleting the registration information in the alarm management table 420 corresponding to the recovery alarm number and the device name included in the recovery alarm information (step S16). In this example, the failure alarm is canceled by deleting the registration information registered with “001: device name 30”.

〔実施形態２〕
以上説明した実施形態１の冗長システム１００は、運用系装置と待機系装置とが１台ずつの２重構成の例で説明を行ったが、冗長システム１００に示した考えは、３重構成、４重構成といったｎ重構成（ｎ≧２）の冗長システムに適用することも可能である。図９〜図１２を参照してｎ＝３の場合について説明する。 [Embodiment 2]
The redundant system 100 according to the first embodiment described above has been described with an example of a dual configuration in which one active system device and one standby system device each. However, the idea shown in the redundant system 100 is a triple configuration, It is also possible to apply to a redundant system having an n-fold configuration (n ≧ 2) such as a 4-layer configuration. The case where n = 3 will be described with reference to FIGS.

図９に、実施形態２の冗長システム２００の障害発生時を示す。冗長システム２００は、冗長システム１００の構成に、待機系装置３１を１台追加して３重構成にしたものである。図９は、図２に示した実施形態１の冗長システム１００に対応するものであり、障害アラーム情報に含まれるアラーム番号（００１）をアラーム対応テーブル４２２に照会することで、運用系装置２０に障害が発生したことを検知して（ステップＳ４）、アラーム管理テーブル４２０に、アラーム番号（００１）とアラームを発行した装置名（２０）を登録する（ステップＳ５）。このステップＳ５の処理まで、図２と同じである。よって、同じステップ番号を図９に示すことで、その説明を省略する。 FIG. 9 illustrates a failure occurrence time of the redundant system 200 according to the second embodiment. The redundant system 200 is configured by adding one standby system device 31 to the configuration of the redundant system 100 to form a triple configuration. FIG. 9 corresponds to the redundant system 100 of the first embodiment shown in FIG. 2, and the active system device 20 is inquired by referring to the alarm correspondence table 422 for the alarm number (001) included in the failure alarm information. Detecting the occurrence of a failure (step S4), the alarm number (001) and the name of the device that issued the alarm (20) are registered in the alarm management table 420 (step S5). The processing up to this step S5 is the same as in FIG. Therefore, the same step number is shown in FIG.

図１０に、運用系装置２０で故障が発生した時の冗長システム２００を示す。図３と同様に、運用系装置が装置３０に切り替わったことを受信した監視サーバ４０が、ステップＳ４で登録したアラーム番号（００１）とアラームを発行した装置名（２０）を、アラーム番号（００１）と装置名（３０）とで再登録（ステップＳ１２）する処理まで、図３と同じである。よって、同じステップ番号を図１０に示すことで、その説明を省略する。 FIG. 10 shows the redundant system 200 when a failure occurs in the active system device 20. As in FIG. 3, the monitoring server 40 that has received the fact that the active device has been switched to the device 30 has the alarm number (001) registered in step S4 and the name of the device (20) that issued the alarm, the alarm number (001). ) And the device name (30), the process up to re-registration (step S12) is the same as in FIG. Therefore, the same step number is shown in FIG.

図１１に、新運用系装置３０（旧待機系装置）で故障が発生した時の冗長システム２００を示す。ここでは、旧運用系装置２０で生じた障害が復旧しない状態で、新運用系装置３０が故障した場合を想定する。以降の動作は、ステップ番号を新たにＳ２６から付して説明する。 FIG. 11 shows the redundant system 200 when a failure occurs in the new active system device 30 (old standby system device). Here, it is assumed that the new active system device 30 has failed in a state where the failure that has occurred in the old active system device 20 is not recovered. Subsequent operations will be described by newly assigning step numbers from S26.

運用系装置３０において例えば故障が発生（ステップＳ２６）し、待機系装置３１に運用を切り替える事態が発生したとする。あるいは、運用系装置３０を保守する必要から運用系装置３０を停止し待機系装置３１に切り替えたとする（ステップＳ２７）。 For example, it is assumed that a failure has occurred in the active system device 30 (step S26), and a situation has occurred in which the operation is switched to the standby system device 31. Alternatively, it is assumed that the active device 30 is stopped and switched to the standby device 31 because the active device 30 needs to be maintained (step S27).

運用系装置３０は、保持している障害イベント情報を、待機系装置３１に通知する（ステップＳ２８）。待機系装置３１は、受信した障害イベント情報を保持する（ステップＳ２９）。また、運用系装置３０は、新たに運用系装置に切り替えた新運用系装置３１の装置名と当該切り替え処理を特定するアラーム番号（００３）と当該故障アラーム番号に対応する装置復旧アラーム番号（００４）とを含む装置アラーム情報を、監視サーバ４０に通知する（ステップＳ３０）。装置アラーム情報を発行した後の運用系装置３０は待機系装置として動作する。 The active system device 30 notifies the standby system device 31 of the held fault event information (step S28). The standby device 31 holds the received failure event information (step S29). In addition, the active device 30 receives the device name of the new active device 31 that has been newly switched to the active device, the alarm number (003) that identifies the switching process, and the device recovery alarm number (004) that corresponds to the failure alarm number. ) Is notified to the monitoring server 40 (step S30). The operational device 30 after issuing the device alarm information operates as a standby device.

装置アラーム情報を受信した監視サーバ４０は、装置アラーム情報に含まれるアラーム番号（００３）をアラーム対応テーブル４２２に照会することで、運用系装置３１に故障が発生したことを検知する（（ステップＳ３１）。そして、運用系装置が装置３１に切り替わったことを受信した監視サーバ４０は、ステップＳ１２で再登録したアラーム番号（００１）とアラームを発行した装置名（３０）を、アラーム番号（００１）と装置名（３１）とで再登録する（ステップＳ３２）。この再登録の処理は、運用系装置３０で発生した障害アラーム情報を待機系装置３１に引き継いだことを意味する。 The monitoring server 40 that has received the device alarm information refers to the alarm correspondence table 422 for the alarm number (003) included in the device alarm information, thereby detecting that a failure has occurred in the active device 31 ((step S31). Then, the monitoring server 40 that has received that the active device has been switched to the device 31 uses the alarm number (001) re-registered in step S12 and the device name (30) that issued the alarm as the alarm number (001). And the device name (31) are re-registered (step S32) This re-registration process means that the failure alarm information generated in the active device 30 has been taken over by the standby device 31.

図１２に、待機系装置３１において障害が復旧した時の冗長システム２００を示す。図１２は、新運用系装置３１（旧待機系装置）において、旧運用系装置２０で生じた障害が復旧した場合を示している。例えばユーザ１１の再試行によってユーザ１１のユーザ情報の登録が成功した場合を想定する。待機系装置３１は、ユーザ１１のユーザ登録が成功すると、保持している障害イベント情報から、障害アラーム番号（００１）に対応する復旧であると判断して障害の復旧を検出する（ステップＳ３３）。そして、障害の復旧を検出した新運用系装置３１は、障害アラーム情報に含まれるアラーム番号（００１）に対応する復旧アラーム番号（００２）と新運用系装置の装置名３１を、復旧アラーム情報として監視サーバ４０に通知する（ステップＳ３４）。 FIG. 12 shows the redundant system 200 when the failure is recovered in the standby system device 31. FIG. 12 shows a case where a failure that has occurred in the old active system device 20 is recovered in the new active system device 31 (old standby system device). For example, it is assumed that the user information of the user 11 is successfully registered by the user 11 retry. When the user registration of the user 11 is successful, the standby device 31 determines from the stored failure event information that the recovery corresponds to the failure alarm number (001) and detects the failure recovery (step S33). . Then, the new active device 31 that has detected the failure recovery uses the recovery alarm number (002) corresponding to the alarm number (001) included in the failure alarm information and the device name 31 of the new active device as the recovery alarm information. The monitoring server 40 is notified (step S34).

復旧アラーム情報を受信した監視サーバ４０は、復旧アラーム番号（００２）からアラーム番号（００１）の障害が、装置名３１において復旧したと判断する（ステップＳ３５）。そして、監視サーバ４０は、アラーム管理テーブル４２０から、アラーム番号（００１）と装置名３１とで登録されている障害アラーム情報を削除することで障害アラームの解除を行う（ステップＳ３６）。 The monitoring server 40 that has received the recovery alarm information determines that the failure from the recovery alarm number (002) to the alarm number (001) has been recovered in the device name 31 (step S35). Then, the monitoring server 40 cancels the failure alarm by deleting the failure alarm information registered with the alarm number (001) and the device name 31 from the alarm management table 420 (step S36).

以上説明したように冗長システム１００に示した考えは、ｎ重構成（ｎ≧２）の冗長システム２００に適用することが可能である。 As described above, the idea shown in the redundant system 100 can be applied to the redundant system 200 having an n-layer configuration (n ≧ 2).

冗長システム２００における待機系装置３０,３１は、上記した運用系装置２０（図６）と待機系装置３０（図７）とがそれぞれ具備する機能構成部を、全て具備する装置であって、当該装置をｎ個具備することで冗長システム２００が構成される。 The standby system devices 30 and 31 in the redundant system 200 are devices that include all the functional components included in each of the active system device 20 (FIG. 6) and the standby system device 30 (FIG. 7). The redundant system 200 is configured by providing n devices.

つまり当該装置は、ユーザ情報登録部２２０、ユーザ情報記録部２２１、障害検出保持部２２２、障害アラーム発行部２２３、障害イベント通知部２２４、系切替え信号生成部２２５、装置アラーム発行部２２６、障害復旧検出部２２７、復旧アラーム発行部３２１、障害イベント情報記憶部３２０、を具備する。 That is, the apparatus includes a user information registration unit 220, a user information recording unit 221, a failure detection holding unit 222, a failure alarm issuing unit 223, a failure event notification unit 224, a system switching signal generation unit 225, a device alarm issuing unit 226, and a failure recovery. A detection unit 227, a recovery alarm issuing unit 321, and a failure event information storage unit 320 are provided.

〔実施形態３〕
実施形態１と２の冗長システム１００，２００は、仮想化システムで構成することもできる。仮想化システムで構成した実施形態３の冗長システム３００は、運用系仮想化装置３２０と待機系仮想化装置３３０,３３１と、監視サーバ４０とで構成される。冗長システム３００の運用系装置と待機系装置は、仮想装置として仮想化システムで構成される。仮想化システム及び仮想化装置は、周知の技術である。 [Embodiment 3]
The redundant systems 100 and 200 according to the first and second embodiments can be configured as a virtualization system. The redundant system 300 according to the third embodiment configured with a virtualization system includes an active virtualization device 320, standby virtualization devices 330 and 331, and a monitoring server 40. The active system device and the standby system device of the redundant system 300 are configured as virtual devices in a virtual system. The virtualization system and the virtualization apparatus are well-known technologies.

実施形態３の運用系仮想化装置３２０と実施形態１又は２の運用系装置２０、実施形態３の待機系仮想化装置３３０,３３１と実施形態１又は２の待機系装置３０,３１とは、仮想化装置が仮想化システムで構成されている点のみが異なるだけで、各装置（実施形態３）が行う処理は、実施形態１,２で説明したものと同じである。図１３に、実施形態３の冗長システム３００の障害発生時を示す。図１４に故障発生時の冗長システム３００を示す。図１５に障害復旧時の冗長システム３００を示す。 The active system virtualization device 320 of the third embodiment, the active system device 20 of the first or second embodiment, the standby system virtualization devices 330 and 331 of the third embodiment, and the standby system devices 30 and 31 of the first or second embodiment are: The processing performed by each device (Embodiment 3) is the same as that described in Embodiments 1 and 2 except that the virtualization device is configured by a virtualization system. FIG. 13 illustrates a failure occurrence time of the redundant system 300 according to the third embodiment. FIG. 14 shows the redundant system 300 when a failure occurs. FIG. 15 shows the redundant system 300 at the time of failure recovery.

図１３は図２、図１４は図３、図１５は図４に、それぞれ対応するものである。図１３〜図１５のそれぞれに、上記したのと同じステップ番号を表記することで、その説明を省略する。 13 corresponds to FIG. 2, FIG. 14 corresponds to FIG. 3, and FIG. 15 corresponds to FIG. The same step numbers as described above are shown in FIGS. 13 to 15, and the description thereof is omitted.

以上説明したように本実施形態の冗長システム１００,２００,３００は、運用系装置を待機系装置に切り替える際に運用系装置が保持している障害イベント情報を障害イベント通知として待機系装置に通知する点、及び、その際に新たに運用系装置となる待機系装置の装置名を、監視サーバに装置アラーム情報として通知する点、及び、監視サーバが障害アラーム情報に対応させて登録している装置名を、装置アラーム情報に含まれる装置名に書き換えて再登録する点、に特徴がある。これらの構成によって、本実施形態の冗長システムは、監視サーバと運用系装置とが保持する障害情報に乖離を生じさせない効果を奏する。 As described above, the redundant systems 100, 200, and 300 according to the present embodiment notify the standby system device of failure event information held by the active system device as a fault event notification when the active device is switched to the standby system device. And the device name of the standby device that will be the new active device at that time is notified to the monitoring server as device alarm information, and the monitoring server is registered corresponding to the failure alarm information. It is characterized in that the device name is rewritten and re-registered with the device name included in the device alarm information. With these configurations, the redundant system according to the present embodiment has an effect of not causing a difference in the failure information held by the monitoring server and the active device.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

このように本願発明は、上記した実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 As described above, the present invention is not limited to the above-described embodiment, and various modifications can be made within the scope of the gist thereof.

１００：冗長システム
９３０：ネットワーク
９００：ユーザ端末
２０：運用系装置
２２３：障害アラーム発行部
２２４：障害イベント通知部
２２６：装置アラーム発行部
３０：待機系装置
３２１：復旧アラーム発行部
４０：監視サーバ
４２４：再登録部
４２５：障害アラーム解除部 100: Redundant system 930: Network 900: User terminal 20: Active device 223: Failure alarm issuing unit 224: Failure event notifying unit 226: Device alarm issuing unit 30: Standby device 321: Recovery alarm issuing unit 40: Monitoring server 424 : Re-registration unit 425: Fault alarm cancellation unit

Claims

A redundant system comprising an active device, a standby device, and a monitoring server that communicate via a network,
The above operational system
A failure alarm issuing unit for notifying the monitoring server of failure alarm information when a failure occurs;
A fault event notification unit for notifying the standby system of the fault event information held when switching the active system to the standby system;
A device alarm issuing unit for notifying the monitoring server of device alarm information including the device name of the standby device that has switched the system to the active system,
The standby system device
At the time of failure recovery, comprising a recovery alarm issuing unit for notifying the monitoring server as the recovery alarm information of the identifier included in the failure event information and the device name of the standby system device,
The monitoring server
A re-registration unit that, when receiving the device alarm information, rewrites and re-registers the device name registered corresponding to the failure alarm information with the device name of the standby device included in the device alarm information; ,
A failure alarm canceling unit that cancels the failure alarm information corresponding to the identifier and device name included in the recovery alarm information,
A redundant system characterized by that.

The redundant system according to claim 1,
The above operational system
Furthermore,
A failure detection holding unit that detects and holds failure event information when a failure occurs,
The standby system device
Furthermore,
A failure event information storage unit for storing failure event information notified from the active device;
The failure event information stored in the failure event information storage unit, and a failure recovery detection unit that detects recovery of a newly generated failure,
The monitoring server
Furthermore,
An alarm correspondence table for recording the correspondence between the alarm number and the recovery alarm number included in the failure alarm information, and the correspondence between the alarm number and the device recovery alarm number included in the device alarm information;
An alarm management table for recording the alarm number included in the failure alarm information and the device name of the active device or the device name of the standby device included in the device alarm information in association with each other;
An alarm number inquiry unit for confirming the meaning of the alarm number by referring to the alarm number included in the failure alarm information and the alarm number included in the device alarm information;
An alarm management table registration unit for registering a pair of an identifier and a device name for identifying a failure included in the failure alarm information in the alarm management table;
A redundant system characterized by that.

In the redundant system according to claim 1 or 2,
A plurality of the standby system devices are provided ,
Each of the standby devices includes the failure alarm issuing unit, the failure event notifying unit, the device alarm issuing unit, the recovery alarm issuing unit, the failure detection holding unit, the failure event information storage unit, and the failure recovery detection. Part
A redundant system comprising:

In the redundant system according to claim 1 or 2,
The redundant system characterized in that the operational system device and the standby system device are configured as virtual devices in a virtualized system.

An alarm management method executed by a redundant system comprising an active device, a standby device, and a monitoring server that communicate via a network,
The above operational system
A failure alarm issuing process for notifying the monitoring server of failure alarm information when a failure occurs;
A failure event notification process for notifying the standby device of the failure event information held when the active device is switched to the standby system;
Carried out, and the device alarm issue process of notification to the monitoring server the device alarm information, including the device name of the standby system device switching the system to the operation system,
The standby system device
At the time of failure recovery, perform the recovery alarm issuing process of notifying the monitoring server as the recovery alarm information of the identifier included in the failure event information and the device name of the standby system device,
The monitoring server
A re-registration process in which when the device alarm information is received, the device name registered in correspondence with the failure alarm information is rewritten and re-registered with the device name of the standby device included in the device alarm information; ,
Performing a failure alarm cancellation process that is canceled by deleting the failure alarm information corresponding to the identifier and device name included in the recovery alarm information,
Features and be a luer alarm management method that.

In the alarm management method executed by the redundant system according to claim 5,
The above operational system
Furthermore,
Perform failure detection holding process to detect and hold failure event information at the time of failure occurrence,
The standby system device
Furthermore,
A failure event information storage process for storing the failure event information notified from the active device in the failure event information storage unit ;
Performing the failure event information stored in the failure event information storage unit and the failure recovery detection process for detecting the recovery of a newly generated failure,
The monitoring server
Furthermore,
The meaning of the alarm number is determined by referring to the alarm management table, which is a table in which the alarm number and the device name are associated with each other, with the alarm number included in the failure alarm information and the alarm number included in the device alarm information. Alarm number inquiry process,
Performing an alarm management table registration process for registering a pair of an identifier and a device name for identifying a fault included in the fault alarm information in the alarm management table;
Features and be a luer alarm management method that.

In the alarm management method executed by the redundant system according to claim 5 or 6,
A plurality of the standby system devices are provided ,
Each of the standby devices includes the failure alarm issuing process, the failure event notification process, the device alarm issuing process, the recovery alarm issuing process, the failure detection holding process, the failure event information storing process, and the failure recovery detection. Do the process
Features and be a luer alarm management method that.

In the alarm management method executed by the redundant system according to claim 5 or 6,
The active device and the standby device are configured as a virtual device in a virtualization system, and each process performed by the active device and the standby device is performed by the virtualization system.
An alarm management method for a redundant system.