JP5343863B2

JP5343863B2 - Monitoring manager, general manager and node monitoring system

Info

Publication number: JP5343863B2
Application number: JP2009553409A
Authority: JP
Inventors: 能史小角; 寛達大崎; 貴裕曽小川; 隆寿岩間; 宏順須賀田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-02-13
Filing date: 2009-02-06
Publication date: 2013-11-13
Anticipated expiration: 2029-02-06
Also published as: WO2009101908A1; JPWO2009101908A1

Description

本発明は、ネットワークを介して複数のマネージャによりノードのデータ処理を監視するシステムに関し、ノードの監視マネージャ、統括マネージャ、およびこれらのデータ処理方法、これらの監視マネージャおよび統括マネージャのためのコンピュータプログラム、その監視マネージャと統括マネージャとがネットワークを介して接続しているノード監視システムに関する。 The present invention relates to a system for monitoring data processing of a node by a plurality of managers over a network, a node monitoring manager, a general manager, and a data processing method thereof, a computer program for these monitoring manager and general manager, The present invention relates to a node monitoring system in which the monitoring manager and the general manager are connected via a network.

近年の状態監視およびフェイルオーバーを行うシステムの一例が、特開２０００−０４７８９４号公報に記載されている。特開２０００−０４７８９４号公報に記載の状態監視およびフェイルオーバーを行うシステムは、監視エージェントを含むノードと、監視情報リポジトリを含む共有ディスクとから構成されている。このような構成を有する状態監視およびフェイルオーバーを行う従来のシステムは次のように動作する。 An example of a system that performs state monitoring and failover in recent years is described in Japanese Patent Application Laid-Open No. 2000-047894. A system for performing state monitoring and failover described in Japanese Patent Application Laid-Open No. 2000-047894 includes a node including a monitoring agent and a shared disk including a monitoring information repository. A conventional system that performs state monitoring and failover having such a configuration operates as follows.

監視エージェントが定期的に各ノードのＣＰＵの負荷情報等を監視し、監視情報リポジトリに全てのノードの負荷情報を集約して保存する。そして、いずれかのノードに故障が発生した場合に負荷情報を利用してフェイルオーバー先のノードを決定する。 The monitoring agent periodically monitors the CPU load information of each node and collects and stores the load information of all nodes in the monitoring information repository. Then, when a failure occurs in any one of the nodes, the failover destination node is determined using the load information.

現在、上述のような状態監視およびフェイルオーバーを行うシステムとして、特開２００６−０７９１６１号公報や特開平０９−１６０８８４号公報に開示されたものがある。 Currently, there are systems disclosed in Japanese Patent Application Laid-Open No. 2006-079161 and Japanese Patent Application Laid-open No. 09-160884 as systems for performing state monitoring and failover as described above.

しかしながら、特開２０００−０４７８９４号公報に記載の技術は、１つの監視マネージャの処理能力に対してノードの数が多すぎるため、複数の監視マネージャに処理を分割する場合に、以下の問題を有していた。 However, the technique described in Japanese Patent Laid-Open No. 2000-047894 has the following problems when the process is divided into a plurality of monitoring managers because the number of nodes is too large for the processing capacity of one monitoring manager. Was.

第１の問題点は、各ノードの負荷情報を各監視マネージャ間で共有するための通信を定期的に行なっていないと、フェイルオーバーに要する時間が長くなる可能性があるということである。その理由は、ノードに故障が発生した場合に、存在する全ての監視マネージャに対して負荷が小さいノードの存在を問い合わせる必要があるためである。 The first problem is that if the communication for sharing the load information of each node between the monitoring managers is not performed regularly, the time required for failover may be increased. The reason is that when a failure occurs in a node, it is necessary to inquire of all existing monitoring managers about the presence of a node with a low load.

第２の問題点は、故障発生時に各監視マネージャに問い合わせをしない場合には、ノードが正常に動作している間のネットワークの通信量が大きくなるということである。その理由は、各監視マネージャが管理しているノードのうち、負荷が小さいノードの情報を各監視マネージャ間で共有するための通信が定期的に発生するためである。 The second problem is that if the monitoring manager is not inquired when a failure occurs, the network traffic increases while the node is operating normally. The reason is that communication for sharing information on a node with a low load among nodes managed by each monitoring manager periodically occurs between the monitoring managers.

本発明は、複数の監視マネージャで複数のノードの処理を監視する場合においても、ネットワークの負荷を低減させつつ、ノードの故障時のフェイルオーバーに必要な処理時間を軽減することができる監視マネージャ、統括マネージャ、そのデータ処理方法、そのデータ処理装置のためのコンピュータプログラム、その監視システムを提供することを目的とする。 The present invention provides a monitoring manager capable of reducing the processing time required for failover in the event of a node failure while reducing the load on the network even when monitoring the processing of a plurality of nodes with a plurality of monitoring managers. It is an object to provide a general manager, a data processing method thereof, a computer program for the data processing device, and a monitoring system thereof.

上記目的を達成するために本発明は、
データ処理を実行するノードから、前記データ処理の実行にかかる負荷を示す負荷情報を、当該ノードを識別するノード識別子とともに受け付ける受付手段と、
前記受付手段にて受け付けた前記負荷情報が所定の閾値以上か否かを判断する判断手段と、
前記判断手段にて前記負荷情報が前記閾値未満であると判断された場合、ネットワークを介して複数の監視マネージャと接続している統括マネージャに対して、前記判断手段にて前記閾値未満であると判断された前記負荷情報と、前記受付手段にて前記負荷情報とともに受け付けられた前記ノード識別子とを関連づけて送信する情報通信手段とを有する。In order to achieve the above object, the present invention provides:
Receiving means for receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining means for determining whether or not the load information received by the receiving means is equal to or greater than a predetermined threshold;
When the determination means determines that the load information is less than the threshold value, the determination means determines that the load information is less than the threshold value for a general manager connected to a plurality of monitoring managers via a network. Information communication means for associating and transmitting the determined load information and the node identifier received together with the load information by the accepting means;

また、ネットワークを介して接続している第一の監視マネージャが監視しているノードの負荷を示す負荷情報を、前記第一の監視マネージャから前記ノードを識別するノード識別子と対応づけて受信する受信手段と、
前記受信手段にて受信した前記負荷情報と前記ノード識別子とを記憶する受信情報記憶手段と、
前記ネットワークを介して接続している第二の監視マネージャから、所定の閾値を満たす前記負荷情報を有する前記ノードがあるか否かの判断要求を受け付ける要求受付手段と、
前記要求受付手段にて受け付けた要求に応じて、前記受信情報記憶手段に記憶された前記負荷情報と、前記所定の閾値とを比較する検索手段と、
前記所定の閾値を満たす前記負荷情報があった場合、当該負荷情報に対応する前記ノード識別子を前記第二の監視マネージャに送信する応答通信手段とを有する。Further, the reception of the load information indicating the load of the node monitored by the first monitoring manager connected via the network in association with the node identifier for identifying the node from the first monitoring manager. Means,
Received information storage means for storing the load information and the node identifier received by the receiving means;
Request accepting means for accepting a judgment request as to whether or not there is the node having the load information satisfying a predetermined threshold from the second monitoring manager connected via the network;
In response to the request received by the request receiving means, search means for comparing the load information stored in the received information storage means with the predetermined threshold value;
Response communication means for transmitting the node identifier corresponding to the load information to the second monitoring manager when there is the load information satisfying the predetermined threshold.

また、ノードを監視している監視マネージャと、統括マネージャとをネットワークを介して接続しているノード監視システムであって、
前記監視マネージャは、
データ処理を実行するノードから、前記データ処理の実行にかかる負荷を示す負荷情報を、当該ノードを識別するノード識別子とともに受け付ける受付手段と、
前記受付手段にて受け付けた前記負荷情報が所定の閾値以上か否かを判断する判断手段と、
前記判断手段にて前記負荷情報が前記閾値未満であると判断された場合、ネットワークを介して複数の監視マネージャと接続している統括マネージャに対して、前記判断手段にて前記閾値未満であると判断された前記負荷情報と、前記受付手段にて前記負荷情報とともに受け付けられた前記ノード識別子とを関連づけて送信する情報通信手段とを有し、
前記統括マネージャは、
前記負荷情報を前記ノードごとに受信する受信手段と、
前記受信手段にて受信した前記負荷情報を、前記ノードを識別するノード識別子と対応づけて記憶する受信情報記憶手段とを有する。A node monitoring system in which a monitoring manager that monitors a node and a general manager are connected via a network,
The monitoring manager
Receiving means for receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining means for determining whether or not the load information received by the receiving means is equal to or greater than a predetermined threshold;
When the determination means determines that the load information is less than the threshold value, the determination means determines that the load information is less than the threshold value for a general manager connected to a plurality of monitoring managers via a network. Information communication means for transmitting the determined load information in association with the node identifier received together with the load information by the receiving means;
The general manager is
Receiving means for receiving the load information for each node;
Receiving information storage means for storing the load information received by the receiving means in association with a node identifier for identifying the node;

また、データ処理を実行するノードから、前記データ処理の実行にかかる負荷を示す負荷情報を、前記ノードを識別するノード識別子とともに受け付けるステップと、
受け付けた前記負荷情報が所定の閾値以上か否かを判断するステップと、
受け付けた前記負荷情報が前記閾値未満であると判断された場合、ネットワークを介して複数の監視マネージャと接続している統括マネージャに対して、前記閾値未満であると判断された前記負荷情報と、該負荷情報とともに受け付けた前記ノード識別子とを関連づけて送信するステップとを含む。A step of receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining whether the received load information is greater than or equal to a predetermined threshold;
When it is determined that the received load information is less than the threshold value, the load information determined to be less than the threshold value for the general manager connected to a plurality of monitoring managers via a network; And a step of associating and transmitting the received node identifier together with the load information.

また、監視マネージャのためのコンピュータプログラムであって、
コンピュータに、
データ処理を実行するノードから前記データ処理の実行にかかるノードの負荷を示す負荷情報を、前記ノードを識別するノード識別子とともに受け付ける受付手順と、
前記ノードから受け付けた前記負荷情報が所定の閾値以上か否かを判断する判断手順と、
受け付けた前記負荷情報が前記閾値未満であると判断された場合、ネットワークを介して複数の監視マネージャと接続している統括マネージャに対して、前記負荷情報を前記ノード識別子と対応づけて送信する情報通信手順とを実行させる。A computer program for a monitoring manager,
On the computer,
A reception procedure for receiving load information indicating a load of a node related to execution of the data processing from a node executing the data processing together with a node identifier for identifying the node;
A determination procedure for determining whether the load information received from the node is greater than or equal to a predetermined threshold;
When it is determined that the received load information is less than the threshold value, information for transmitting the load information in association with the node identifier to a general manager connected to a plurality of monitoring managers via a network The communication procedure is executed.

また、ネットワークを介して接続している第一の監視マネージャが監視しているノードの負荷を示す負荷情報を、前記第一の監視マネージャから前記ノードを識別するノード識別子と対応づけて受信するステップと、
受信した前記負荷情報と前記ノード識別子とを記憶するステップと、
前記ネットワークを介して接続している第二の監視マネージャから、所定の閾値を満たす前記負荷情報を有する前記ノードがあるか否かの判断要求を受け付けるステップと、
前記要求に応じて、記憶された前記負荷情報と、前記所定の閾値とを比較するステップと、
前記所定の閾値を満たす前記負荷情報があった場合、当該負荷情報に対応する前記ノード識別子を前記第二の監視マネージャに送信するステップとを含む。A step of receiving, from the first monitoring manager, load information indicating the load of the node monitored by the first monitoring manager connected via the network in association with a node identifier for identifying the node; When,
Storing the received load information and the node identifier;
Receiving a determination request as to whether or not there is the node having the load information satisfying a predetermined threshold from a second monitoring manager connected via the network;
In response to the request, comparing the stored load information with the predetermined threshold;
Transmitting the node identifier corresponding to the load information to the second monitoring manager when there is the load information satisfying the predetermined threshold.

また、統括マネージャのためのコンピュータプログラムであって、
コンピュータに、
ネットワークを介して接続している第一の監視マネージャが監視しているノードの負荷を示す負荷情報を、前記第一の監視マネージャから前記ノードを識別するノード識別子と対応づけて受信する受信手順と、
受信した前記負荷情報と前記ノード識別子とを記憶する受信情報記憶手順と、
前記ネットワークを介して接続している第二の監視マネージャから、所定の閾値を満たす前記負荷情報を有する前記ノードがあるか否かの判断要求を受け付ける要求受付手順と、
前記要求に応じて、記憶された前記負荷情報と、前記所定の閾値とを比較する検索手順と、
前記所定の閾値を満たす前記負荷情報があった場合、該負荷情報に対応する前記ノード識別子を前記第二の監視マネージャに送信する応答通信手順とを実行させる。A computer program for the general manager,
On the computer,
A reception procedure for receiving, from the first monitoring manager, load information indicating the load of the node monitored by the first monitoring manager connected via the network in association with the node identifier for identifying the node; ,
A received information storage procedure for storing the received load information and the node identifier;
A request reception procedure for receiving a determination request as to whether or not there is the node having the load information satisfying a predetermined threshold from the second monitoring manager connected via the network;
A search procedure for comparing the stored load information with the predetermined threshold in response to the request;
When there is the load information satisfying the predetermined threshold, a response communication procedure for transmitting the node identifier corresponding to the load information to the second monitoring manager is executed.

なお、本発明の各種の構成要素は、その機能を実現するように形成されていればよく、例えば、所定の機能を発揮する専用のハードウェア、所定の機能がコンピュータプログラムにより付与された各マネージャ、コンピュータプログラムにより各マネージャに実現された所定の機能、これらの任意の組み合わせ、等として実現することができる。 The various components of the present invention need only be formed so as to realize their functions. For example, dedicated hardware that exhibits a predetermined function, each manager provided with a predetermined function by a computer program It can be realized as a predetermined function realized in each manager by a computer program, any combination thereof, or the like.

また、本発明の各種の構成要素は、個々に独立した存在である必要もなく、複数の構成要素が一個の部材として形成されていること、一つの構成要素が複数の部材で形成されていること、ある構成要素が他の構成要素の一部であること、ある構成要素の一部と他の構成要素の一部とが重複していること、等でよい。 In addition, the various components of the present invention do not have to be individually independent, a plurality of components are formed as a single member, and a single component is formed of a plurality of members. It may be that a certain component is a part of another component, a part of a certain component overlaps a part of another component, and the like.

また、本発明のデータ処理方法には複数の工程を順番に記載してあるが、その記載の順番は複数の工程を実行する順番を限定するものではない。このため、本発明のデータ処理方法を実施するときには、その複数の工程の順番は内容的に支障しない範囲で変更することができる。 Moreover, although the several process is described in order in the data processing method of this invention, the order of the description does not limit the order which performs a several process. For this reason, when implementing the data processing method of this invention, the order of the some process can be changed in the range which does not interfere in content.

また、本発明のデータ処理方法の複数の工程は個々に相違するタイミングで実行されることに限定されない。このため、ある工程の実行中に他の工程が発生すること、ある工程の実行タイミングと他の工程の実行タイミングとの一部ないし全部が重複していること、等でもよい。 Further, the plurality of steps of the data processing method of the present invention are not limited to being executed at different timings. For this reason, another process may occur during execution of a certain process, or a part or all of the execution timing of a certain process and the execution timing of another process may overlap.

また、本発明でいう監視マネージャおよび統括マネージャは、コンピュータプログラムを読み取って対応するデータ処理を実行できるように、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）ユニット、等の汎用デバイスで構築されたハードウェア、所定のデータ処理を実行するように構築された専用の論理回路、これらの組み合わせ、等として実施することができる。 In addition, the monitoring manager and the general manager according to the present invention can read a computer program and execute corresponding data processing, so that a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), an I It can be implemented as hardware constructed with a general-purpose device such as an / F (Interface) unit, a dedicated logic circuit constructed so as to execute predetermined data processing, a combination thereof, and the like.

本発明によれば、複数の監視マネージャで複数のノードの処理を監視する場合においても、ネットワークの負荷を低減させつつ、ノードの故障時のフェイルオーバーに必要な処理時間を軽減する。 According to the present invention, even when the processes of a plurality of nodes are monitored by a plurality of monitoring managers, the processing time required for failover in the event of a node failure is reduced while reducing the load on the network.

本実施形態のノード監視システムの構成を説明する図である。It is a figure explaining the structure of the node monitoring system of this embodiment. 図１に示した監視マネージャの論理構造を示す模式的なブロック図である。It is a typical block diagram which shows the logical structure of the monitoring manager shown in FIG. 図１に示した統括マネージャの論理構造を示す模式的なブロック図である。FIG. 2 is a schematic block diagram illustrating a logical structure of a general manager illustrated in FIG. 1. 図１に示した監視マネージャの論理構造を示す模式的なブロック図である。It is a typical block diagram which shows the logical structure of the monitoring manager shown in FIG. 図４に示したノード情報保持部で保持されるデータ構造の一例である。5 is an example of a data structure held in a node information holding unit shown in FIG. 実施の形態に係るノード監視システムの構成を説明する図である。It is a figure explaining the structure of the node monitoring system which concerns on embodiment. 本実施形態の監視マネージャのデータ処理方法について説明するフローチャートである。It is a flowchart explaining the data processing method of the monitoring manager of this embodiment. 本実施形態の監視マネージャのデータ処理方法について説明するフローチャートである。It is a flowchart explaining the data processing method of the monitoring manager of this embodiment. 第２の実施形態のノード監視システムの構成を説明する図である。It is a figure explaining the structure of the node monitoring system of 2nd Embodiment. 受信情報記憶部に保存されるデータ構造の一例である。It is an example of the data structure preserve | saved at a received information storage part. 本実施形態におけるデータ処理方法を説明するフローチャートである。It is a flowchart explaining the data processing method in this embodiment.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.

（第１の実施形態）
図１は、本実施形態のノード監視システムの構成を説明する図である。(First embodiment)
FIG. 1 is a diagram illustrating the configuration of the node monitoring system according to the present embodiment.

本実施形態のノード監視システムは、図１に示すように、処理ノード２，３を監視している監視マネージャ１と、処理ノード５を監視している監視マネージャ４と、統括マネージャ６とをネットワーク１０００を介して接続している。 As shown in FIG. 1, the node monitoring system according to this embodiment includes a monitoring manager 1 that monitors processing nodes 2 and 3, a monitoring manager 4 that monitors processing nodes 5, and an overall manager 6. 1000 is connected.

ネットワーク１０００とは、監視マネージャ１，４と統括マネージャ６とのデータ通信を仲介できるものであればよく、有線、無線、これらの組み合わせでよい。 The network 1000 only needs to be able to mediate data communication between the monitoring managers 1 and 4 and the general manager 6, and may be wired, wireless, or a combination thereof.

図２は、図１に示した監視マネージャ４の論理構造を示す模式的なブロック図である。 FIG. 2 is a schematic block diagram showing a logical structure of the monitoring manager 4 shown in FIG.

図１に示した監視マネージャ４は、図２に示すように、データ処理を監視する処理ノード５からデータ処理の実行にかかる負荷を示す負荷情報を、処理ノード５を識別するノード識別子とともに受け付ける受付部４０１と、受け付けた負荷情報が所定の閾値以上か否かを判断する判断部４０５と、判断部４０５において受け付けた負荷情報が閾値未満であると判断された場合、ネットワーク１０００を介して監視マネージャ１，４と接続している統括マネージャ６に対して、所定の閾値未満であると判断された負荷情報と、この負荷情報とともに受付部４０１にて受け付けられたノード識別子とを関連づけて送信する情報通信部４０７と、ノード５の故障を検知するノード故障検知部４０９と、ノードにおけるデータ処理の実行を制御するノード制御部４１１と、ノード情報保持部４０３とを備える。ノード制御部４１１は、制御識別子で識別される。そのため、情報通信部４０７は、受け付けた負荷情報と、対応する処理ノード５を制御しているノード制御部４１１の制御識別子とを関連づけて送信してもよい。 As shown in FIG. 2, the monitoring manager 4 shown in FIG. 1 accepts load information indicating a load on execution of data processing from a processing node 5 that monitors data processing together with a node identifier that identifies the processing node 5. Unit 401, a determination unit 405 for determining whether or not the received load information is equal to or greater than a predetermined threshold, and a monitoring manager via network 1000 when the determination unit 405 determines that the received load information is less than the threshold. Information that is transmitted to the general manager 6 connected to 1 and 4 in association with the load information determined to be less than the predetermined threshold and the node identifier received by the receiving unit 401 together with the load information. A communication unit 407, a node failure detection unit 409 that detects a failure of the node 5, and a node that controls execution of data processing in the node It comprises a de control unit 411, and a node information storage unit 403. The node control unit 411 is identified by a control identifier. Therefore, the information communication unit 407 may transmit the received load information in association with the control identifier of the node control unit 411 that controls the corresponding processing node 5.

監視マネージャ４は、判断部４０５が処理ノード５から受け付けた負荷情報が所定の閾値以上と判断した場合は、処理を終了する。 When the determination unit 405 determines that the load information received from the processing node 5 is equal to or greater than a predetermined threshold, the monitoring manager 4 ends the process.

負荷情報とは、処理ノード２，３，５がデータ処理を実行する際、ハードウェアにかかる負荷を示す情報である。具体的には、負荷情報は、ＣＰＵ、メモリ、ディスク容量等にかかる負荷を示す情報である。たとえば、ＣＰＵの負荷は種々の方法により算出されるが、たとえば、ＣＰＵの使用率（Ｐｒｏｃｅｓｓｏｒ／％ＰｒｏｃｅｓｓｏｒＴｉｍｅ）およびＣＰＵのＩｄｌｅ率（Ｐｒｏｃｅｓｓｏｒ／％ＩｄｌｅＴｉｍｅ）から求めることができる。負荷情報を求める機能はノード自身に備えられている。 The load information is information indicating a load applied to hardware when the processing nodes 2, 3, and 5 execute data processing. Specifically, the load information is information indicating the load on the CPU, memory, disk capacity, and the like. For example, the CPU load is calculated by various methods, and can be obtained from, for example, the CPU usage rate (Processor /% Processor Time) and the CPU idle rate (Processor /% Idle Time). A function for obtaining load information is provided in the node itself.

図３は、図１に示した統括マネージャ６の論理構造を示す模式的なブロック図である。 FIG. 3 is a schematic block diagram showing a logical structure of the overall manager 6 shown in FIG.

図１に示した統括マネージャ６は、図３に示すように、ネットワーク１０００を介して接続している監視マネージャ４が監視している処理ノード５の負荷を示す負荷情報を、監視マネージャ４から処理ノード５を識別するノード識別子と対応づけて受信する受信部６０１と、受信部６０１にて受信した負荷情報とノード識別子とを記憶する受信情報記憶部６０３と、ネットワーク１０００を介して接続している監視マネージャ１から、所定の閾値を満たす負荷情報を有するノードがあるか否かの判断要求を受け付ける要求受付部６０５と、要求受付部６０２にて受け付けた要求に応じて、受信情報記憶部６０３に記憶された負荷情報と、所定の閾値とを比較する検索部６０７と、所定の閾値を満たす負荷情報があった場合、その負荷情報に対応するノード識別子を監視マネージャ１に送信する応答通信部６０９とを備える。 The overall manager 6 shown in FIG. 1 processes load information indicating the load of the processing node 5 monitored by the monitoring manager 4 connected via the network 1000 from the monitoring manager 4 as shown in FIG. A reception unit 601 that receives the node identifier associated with the node 5 is connected to the reception information storage unit 603 that stores the load information and node identifier received by the reception unit 601 via the network 1000. A request receiving unit 605 that receives a request for determining whether there is a node having load information that satisfies a predetermined threshold from the monitoring manager 1, and a request received by the request receiving unit 602 in the received information storage unit 603. If there is a search unit 607 that compares the stored load information with a predetermined threshold value and load information that satisfies the predetermined threshold value, The node identifier and a response communication unit 609 for transmitting to the monitoring manager 1.

受信部６０１は、処理ノード２の負荷を示す負荷情報を、監視マネージャ１から処理ノード２を識別するノート識別子と対応づけて受信する。また、受信部６０１は、処理ノード３の負荷を示す負荷情報を、監視マネージャ１から処理ノード３を識別するノート識別子と対応づけて受信する。さらに、受信部６０１は、処理ノード５の負荷を示す負荷情報を、監視マネージャ４から処理ノード５を識別するノート識別子と対応づけて受信する。 The receiving unit 601 receives load information indicating the load of the processing node 2 in association with a note identifier for identifying the processing node 2 from the monitoring manager 1. The receiving unit 601 receives load information indicating the load of the processing node 3 in association with a note identifier for identifying the processing node 3 from the monitoring manager 1. Further, the receiving unit 601 receives load information indicating the load on the processing node 5 in association with a note identifier for identifying the processing node 5 from the monitoring manager 4.

受信情報記憶部６０３は、受信部６０１にて受信した負荷情報とノード識別子とを記憶する。受信情報記憶部６０３のデータ構造は、ノード情報保持部１０３およびノード情報保持部４０３と同様な構成をしている。したがって、受信情報記憶部６０３もまた、後述する図５で示すデータ構造を有することができる。 The reception information storage unit 603 stores the load information and node identifier received by the reception unit 601. The data structure of the reception information storage unit 603 has the same configuration as that of the node information holding unit 103 and the node information holding unit 403. Therefore, the reception information storage unit 603 can also have a data structure shown in FIG.

要求受付部６０５は、監視マネージャ１から、所定の閾値とともに所定の閾値を満たす負荷情報を有するノードがあるか否かの判断要求を受け付ける。また、要求受付部６０５は、監視マネージャ４から、所定の閾値とともに所定の閾値を満たす負荷情報を有するノードがあるか否かの判断要求を受け付けてもよい。 The request reception unit 605 receives a determination request from the monitoring manager 1 as to whether there is a node having load information that satisfies a predetermined threshold together with the predetermined threshold. In addition, the request reception unit 605 may receive a determination request from the monitoring manager 4 as to whether there is a node having load information that satisfies a predetermined threshold together with the predetermined threshold.

検索部６０７は、要求受付部６０５にて受け付けた要求に応じて、受信情報記憶部６０３を参照し、記憶された負荷情報と、受け付けた所定の閾値とを比較する。 The search unit 607 refers to the reception information storage unit 603 according to the request received by the request reception unit 605, and compares the stored load information with the received predetermined threshold value.

応答通信部６０９は、検索部６０７にて所定の閾値を満たす負荷情報が抽出された場合、その負荷情報に対応するノード識別子を、判断要求のあった監視マネージャに送信する。 When the search unit 607 extracts load information that satisfies a predetermined threshold, the response communication unit 609 transmits a node identifier corresponding to the load information to the monitoring manager that requested the determination.

図４は、図１に示した監視マネージャ１の論理構造を示す模式的なブロック図である。 FIG. 4 is a schematic block diagram showing a logical structure of the monitoring manager 1 shown in FIG.

図１に示した監視マネージャ１は、図４に示すように、受付部１０１と、ノード情報保持部１０３と、判断部１０５と、情報通信部１０７と、ノード故障検知部１０９と、ノード制御部１１１とから構成されている。 As shown in FIG. 4, the monitoring manager 1 shown in FIG. 1 includes a reception unit 101, a node information holding unit 103, a determination unit 105, an information communication unit 107, a node failure detection unit 109, and a node control unit. 111.

受付部１０１は、処理ノード２から処理ノード２の負荷情報を、処理ノード２を識別するノード識別子とともに受け付ける。また、受付部１０１は、処理ノード３から処理ノード３の負荷情報を、処理ノード３を識別するノード識別子とともに受け付ける。ノード識別子とは、処理ノード２，３を個々に識別する情報である。処理ノード２，３は、監視マネージャ１によってデータ処理が監視されている。負荷情報とは、処理ノードにおいて、データ処理の実行の結果消耗される計算機の資源量である。 The accepting unit 101 accepts the load information of the processing node 2 from the processing node 2 together with a node identifier for identifying the processing node 2. The accepting unit 101 accepts load information of the processing node 3 from the processing node 3 together with a node identifier for identifying the processing node 3. The node identifier is information for individually identifying the processing nodes 2 and 3. The processing nodes 2 and 3 are monitored for data processing by the monitoring manager 1. The load information is the amount of computer resources consumed as a result of execution of data processing in the processing node.

ノード情報保持部１０３は、受け付けた負荷情報をノード識別子と対応づけて保持する。 The node information holding unit 103 holds the received load information in association with the node identifier.

図５は、図４に示したノード情報保持部１０３で保持されるデータ構造の一例である。 FIG. 5 shows an example of a data structure held by the node information holding unit 103 shown in FIG.

図５中、「処理ノード名」とは、ノード識別子の一例である。 In FIG. 5, “processing node name” is an example of a node identifier.

ノード制御部１１１は、処理ノード２，３の処理を制御する。ノード制御部１１１は、外部からの命令に従って処理ノード２，３の起動終了制御を行う。図５に示すように、ノード情報保持部１０３は、各ノード制御部１１１を識別して処理ノードおよびその負荷情報と関連づけて保持している。なお、図５では、「処理ノード制御手段名」が制御識別子の役割を果たしている。 The node control unit 111 controls the processing of the processing nodes 2 and 3. The node control unit 111 performs start / end control of the processing nodes 2 and 3 in accordance with an external command. As shown in FIG. 5, the node information holding unit 103 identifies each node control unit 111 and holds it in association with the processing node and its load information. In FIG. 5, “processing node control means name” serves as a control identifier.

判断部１０５は、処理ノード５から受け付けた負荷情報が所定の閾値以上か否かを判断する。閾値とは、閾値を定める手段にはあらかじめ固定の値を利用する手段と、動的に指定する手段とがあるが、本実施形態では、あらかじめ固定の値が決められているとする。また、フェイルオーバーで利用する処理ノードには負荷情報が閾値よりも小さいという条件がある。 The determination unit 105 determines whether the load information received from the processing node 5 is greater than or equal to a predetermined threshold value. With regard to the threshold value, there are means for using a fixed value in advance and means for dynamically specifying the threshold value. In this embodiment, it is assumed that a fixed value is determined in advance. In addition, there is a condition that the load information is smaller than the threshold value for the processing node used for failover.

情報通信部１０７は、受け付けた負荷情報が閾値未満であると判断された場合、統括マネージャ６に、負荷情報をノード識別子と対応づけて送信する。 When it is determined that the received load information is less than the threshold, the information communication unit 107 transmits the load information to the overall manager 6 in association with the node identifier.

ノード故障検知部１０９は、処理ノード２、３の故障を検知する。 The node failure detection unit 109 detects a failure of the processing nodes 2 and 3.

図４に示した監視マネージャ１と図２に示した監視マネージャ４とは、受付部１０１が受付部４０１と、ノード情報保持部１０３がノード情報保持部４０３と、判断部１０５が判断部４０５と、情報通信部１０７が情報通信部４０７と、ノード故障検知部１０９がノード故障検知部４０９と、ノード制御部１１１がノード制御部４１１とにそれぞれ対応している。 The monitoring manager 1 shown in FIG. 4 and the monitoring manager 4 shown in FIG. 2 are configured such that the receiving unit 101 receives the receiving unit 401, the node information holding unit 103 uses the node information holding unit 403, and the judging unit 105 uses the judging unit 405. The information communication unit 107 corresponds to the information communication unit 407, the node failure detection unit 109 corresponds to the node failure detection unit 409, and the node control unit 111 corresponds to the node control unit 411.

ノード故障検知部１０９が監視する処理ノード２の故障を検知した場合、判断部１０５は、保持された負荷情報と、記憶された閾値とを比較する。ノード情報保持部４０３には、処理ノード２、３の負荷情報が保持されており、たとえば、処理ノード２の負荷情報は８０％、処理ノード３の処理情報は７０％とする。閾値を５０％とすると、保持されたすべての処理ノードの負荷情報が閾値以上と判断される。このとき、情報通信部１０７が、所定の閾値（５０％）を送信するとともに、送信する所定の閾値（５０％）を満たす負荷情報を有するノードがあるか否かの判断要求を統括マネージャ６に送信する。 When the failure of the processing node 2 monitored by the node failure detection unit 109 is detected, the determination unit 105 compares the stored load information with the stored threshold value. The node information holding unit 403 holds the load information of the processing nodes 2 and 3. For example, the load information of the processing node 2 is 80% and the processing information of the processing node 3 is 70%. If the threshold value is 50%, it is determined that the load information of all stored processing nodes is greater than or equal to the threshold value. At this time, the information communication unit 107 transmits a predetermined threshold value (50%) and sends a determination request to the general manager 6 as to whether there is a node having load information that satisfies the predetermined threshold value (50%) to be transmitted. Send.

一方、保持された処理ノード３の負荷情報が閾値未満と判断された場合、ノード制御部１１１は処理ノード２で実行されていたデータ処理を処理ノード３に実行させる。 On the other hand, when it is determined that the stored load information of the processing node 3 is less than the threshold value, the node control unit 111 causes the processing node 3 to execute the data processing that has been executed in the processing node 2.

統括マネージャ６の受信情報記憶部６０３は、監視マネージャ４から受け付けた負荷情報をノード識別子と対応づけて保持する。応答通信部６０９は、要求に応じて、保持された負荷情報と、所定の閾値とを比較して、所定の閾値を満たす負荷情報があった場合、対応するノード識別子を監視マネージャ１に送信する。 The reception information storage unit 603 of the overall manager 6 holds the load information received from the monitoring manager 4 in association with the node identifier. In response to the request, the response communication unit 609 compares the stored load information with a predetermined threshold, and when there is load information that satisfies the predetermined threshold, transmits a corresponding node identifier to the monitoring manager 1. .

監視マネージャ１の受付部１０１が、統括マネージャ６からノード識別子を受信すると、ノード制御部１１１は、ノード故障検知部１０９による命令によって、ノード故障検知部１０９にて故障を検出した処理ノード２のデータ処理を、受け付けたノード識別子に対応する処理ノード５に実行させる。 When the reception unit 101 of the monitoring manager 1 receives the node identifier from the overall manager 6, the node control unit 111 receives data of the processing node 2 in which the failure is detected by the node failure detection unit 109 according to a command from the node failure detection unit 109. The processing is executed by the processing node 5 corresponding to the received node identifier.

監視マネージャ１は、ノード故障検知部１０９が一のノードの故障を検知した場合、判断部１０５は、保持された負荷情報と、閾値とを比較する。保持されたすべての負荷情報が閾値以上と判断された場合、情報通信部１０７は、閾値を送信して、統括マネージャ６に閾値を満たす他のノードの負荷情報を問い合わせる。なお、閾値がシステム全体で固定であれば閾値自体は送信しなくてよい。 When the node failure detection unit 109 detects a failure of one node, the monitoring manager 1 compares the stored load information with a threshold value. When it is determined that all the stored load information is equal to or greater than the threshold, the information communication unit 107 transmits the threshold and inquires the general manager 6 about the load information of other nodes that satisfy the threshold. If the threshold is fixed throughout the system, the threshold itself does not have to be transmitted.

一方、保持されたいずれかの負荷情報が閾値未満であると判断された場合、閾値未満と判断された他のノードの閾値に、故障を検知した一のノードで実行されているデータ処理を実行させる。 On the other hand, if it is determined that any of the stored load information is less than the threshold value, the data processing that is being executed on the one node that detected the failure is executed on the threshold value of the other node determined to be less than the threshold value. Let

上述のような監視マネージャの各部は、必要により各種のハードウェアを利用して実現される。しかし、監視マネージャが実装されているコンピュータプログラムに対応して機能することにより実現されている。 Each part of the monitoring manager as described above is realized by using various kinds of hardware as necessary. However, it is realized by functioning in correspondence with a computer program in which the monitoring manager is installed.

このようなコンピュータプログラムは、例えば、データ処理を実行するノードからデータ処理の実行にかかるノードの負荷を示す負荷情報を、ノードを識別するノード識別子とともに受け付ける受付処理、ノードから受け付けた負荷情報が所定の閾値以上か否かを判断する判断処理、受け付けた負荷情報が閾値未満であると判断された場合、ネットワークを介して複数の監視マネージャと接続している統括マネージャに、負荷情報をノード識別子と対応づけて送信する情報通信処理、等の処理動作をＣＰＵ等に実行させるためのソフトウェアとしてＲＡＭ等の情報記憶媒体に格納されている。 Such a computer program is, for example, a receiving process that receives load information indicating a load of a node related to execution of data processing from a node that executes data processing together with a node identifier that identifies the node, and load information received from the node is predetermined. If it is determined that the received load information is less than the threshold value, the load information is sent to the general manager connected to a plurality of monitoring managers via the network as node identifiers. It is stored in an information storage medium such as a RAM as software for causing a CPU or the like to execute processing operations such as information communication processing to be transmitted in association with each other.

また、上述のような統括マネージャの各部は、必要により各種のハードウェアを利用して実現される。しかし、統括マネージャが実装されているコンピュータプログラムに対応して機能することにより実現されている。 Each unit of the general manager as described above is realized by using various kinds of hardware as necessary. However, it is realized by functioning corresponding to the computer program in which the general manager is installed.

このようなコンピュータプログラムは、例えば、ネットワークを介して接続している第一および第二の監視マネージャが監視しているノードの負荷を示す負荷情報を、第一の監視マネージャからノードを識別するノード識別子と対応づけて受信する受信処理、受信した負荷情報とノード識別子とを記憶する受信情報記憶処理、第二の監視マネージャから、所定の閾値を満たす負荷情報を有するノードがあるか否かの判断要求を受け付ける要求受付処理、要求に応じて、記憶された負荷情報と、所定の閾値とを比較する検索処理、所定の閾値を満たす負荷情報があった場合、その負荷情報に対応するノード識別子を第二の監視マネージャに送信する応答通信処理、等の処理動作をＣＰＵ等に実行させるためのソフトウェアとしてＲＡＭ等の情報記憶媒体に格納されている。 Such a computer program is, for example, a node that identifies load information indicating the load of a node monitored by the first and second monitoring managers connected via a network and identifies the node from the first monitoring manager. A reception process that is received in association with an identifier, a reception information storage process that stores received load information and a node identifier, and a determination as to whether there is a node having load information that satisfies a predetermined threshold from the second monitoring manager A request acceptance process for accepting a request, a search process for comparing stored load information with a predetermined threshold in response to the request, and load information satisfying the predetermined threshold, if there is a load identifier that corresponds to the load information, Information such as RAM as software for causing the CPU or the like to execute processing operations such as response communication processing transmitted to the second monitoring manager Stored in 憶媒 body.

以下、本実施形態のノード監視システムについてより詳細に説明する。 Hereinafter, the node monitoring system of this embodiment will be described in more detail.

図６は、実施の形態に係るノード監視システムの構成を説明する図である。 FIG. 6 is a diagram illustrating the configuration of the node monitoring system according to the embodiment.

図６を参照すると、監視マネージャ１と、監視マネージャ１の監視対象である処理ノード２，３と、監視マネージャ１と同じ構成である監視マネージャ４と、監視マネージャ４の監視対象である処理ノード５と、統括マネージャ６とから構成される。監視マネージャ１は処理ノード制御手段１１（ノード制御部１１１に対応）とノード情報保存手段１２（ノード情報保持部１０３に対応）とノード故障検知手段１３（ノード故障検知部１０９に対応）を含む。監視マネージャ４は処理ノード制御手段４１（ノード制御部４１１に対応）とノード情報保存手段４２（ノード情報保持部４０３に対応）とノード故障検知手段４３（ノード故障検知部４０９に対応）とを含む。統括マネージャ６はノード情報保存手段６１（受信情報記憶部６０３に対応）を含む。 Referring to FIG. 6, the monitoring manager 1, the processing nodes 2 and 3 that are monitored by the monitoring manager 1, the monitoring manager 4 that has the same configuration as the monitoring manager 1, and the processing node 5 that is monitored by the monitoring manager 4. And a general manager 6. The monitoring manager 1 includes a processing node control unit 11 (corresponding to the node control unit 111), a node information storage unit 12 (corresponding to the node information holding unit 103), and a node failure detection unit 13 (corresponding to the node failure detection unit 109). The monitoring manager 4 includes a processing node control unit 41 (corresponding to the node control unit 411), a node information storage unit 42 (corresponding to the node information holding unit 403), and a node failure detection unit 43 (corresponding to the node failure detection unit 409). . The overall manager 6 includes node information storage means 61 (corresponding to the reception information storage unit 603).

これらの手段はそれぞれ概略次のように動作する。 Each of these means generally operates as follows.

処理ノード２と処理ノード３と処理ノード５はそれぞれを制御するノード制御部１１１，４１１により決められた処理を実行する。 The processing node 2, the processing node 3, and the processing node 5 execute processing determined by the node control units 111 and 411 that control them.

ノード制御部１１１は、外部からの命令に従い処理ノード２，３の起動終了制御を行う。 The node control unit 111 performs start / end control of the processing nodes 2 and 3 in accordance with an external command.

ノード情報保持部１０３は、定期的または任意のタイミングで処理ノード２，３の負荷情報を取得して保存する。各処理ノードの負荷が閾値よりも小さい場合はノード情報保存手段６１に同一の負荷情報を送信する。 The node information holding unit 103 acquires and stores the load information of the processing nodes 2 and 3 periodically or at an arbitrary timing. When the load of each processing node is smaller than the threshold value, the same load information is transmitted to the node information storage unit 61.

また、ノード情報保持部１０３は、ノード故障検知部１０９からの問い合わせに従い、負荷が閾値よりも小さい処理ノードが存在する場合はその処理ノードの情報を返す。 Further, in response to the inquiry from the node failure detection unit 109, the node information holding unit 103 returns information on the processing node when there is a processing node having a load smaller than the threshold value.

ノード故障検知部１０９は、処理ノード２，３を監視して、どちらかの処理ノードに故障が発生した場合にノード情報保持部１０３に問い合わせを行う。ノード情報保持部１０３に負荷が閾値よりも小さい処理ノードの情報が存在する場合には、その処理ノードで故障が発生した処理ノードで実行していた処理を続行するようにノード制御部１１１に命令する。ノード故障検知部１０９は、ノード情報保持部１０３に問い合わせた結果、負荷が閾値よりも小さい処理ノードが存在しない場合に、受信情報記憶部６０３に問い合わせを行う。負荷が閾値よりも小さい処理ノードが存在する場合は、その処理ノードを監視している監視マネージャに含まれる処理ノード制御手段に対して、その処理ノードで故障が発生した処理ノードで実行していた処理を続行するように命令する。 The node failure detection unit 109 monitors the processing nodes 2 and 3 and makes an inquiry to the node information holding unit 103 when a failure occurs in one of the processing nodes. If the node information holding unit 103 has information on a processing node whose load is smaller than the threshold value, the node control unit 111 is instructed to continue the processing executed in the processing node in which the failure has occurred in the processing node. To do. As a result of the inquiry to the node information holding unit 103, the node failure detection unit 109 makes an inquiry to the reception information storage unit 603 when there is no processing node having a load smaller than the threshold value. When there is a processing node whose load is smaller than the threshold value, the processing node control means included in the monitoring manager that monitors the processing node is executed on the processing node in which the processing node has failed Instructs processing to continue.

監視マネージャ４と、監視マネージャ４に含まれるノード制御部４１１と、ノード情報保持部４０３と、ノード故障検知部４０９とは、それぞれ監視マネージャ１と、ノード制御部１１１と、ノード情報保持部１０３と、ノード故障検知部１０９と同じ動作をする。 The monitoring manager 4, the node control unit 411 included in the monitoring manager 4, the node information holding unit 403, and the node failure detection unit 409 are the monitoring manager 1, the node control unit 111, and the node information holding unit 103, respectively. The same operation as the node failure detection unit 109 is performed.

受信情報記憶部６０３は、各監視マネージャ内に含まれるノード情報保持部１０３，４０３から送信された処理ノードの負荷情報を保存し、各監視マネージャ内のノード故障検知部１０９，４０９から問い合わせがあった場合に、負荷が閾値よりも低い処理ノードの情報を問い合わせ元のノード故障検知部１０９，４０９に送信する。 The reception information storage unit 603 stores the processing node load information transmitted from the node information holding units 103 and 403 included in each monitoring manager, and receives an inquiry from the node failure detection units 109 and 409 in each monitoring manager. In this case, the information of the processing node whose load is lower than the threshold is transmitted to the node failure detection units 109 and 409 that are the inquiry sources.

次に、図７及び図８のフローチャートを参照して本実施の形態のデータ処理方法について詳細に説明する。 Next, the data processing method of this embodiment will be described in detail with reference to the flowcharts of FIGS.

図７のフローチャートでは、処理ノード５の負荷情報を統括マネージャ６に含まれるノード情報保存手段６１に通知するまでの処理を表している。図８のフローチャートでは、処理ノード２に故障が発生した場合にフェイルオーバーして処理を続行させるまでの処理ノードを特定するまでの処理を表している。 In the flowchart of FIG. 7, the processing until the load information of the processing node 5 is notified to the node information storage unit 61 included in the overall manager 6 is shown. The flowchart in FIG. 8 represents processing until a processing node is identified until a processing node 2 is failed over and processing is continued when a failure occurs in the processing node 2.

図７は、本実施形態の監視マネージャ４のデータ処理方法について説明するフローチャートである。 FIG. 7 is a flowchart for explaining the data processing method of the monitoring manager 4 of this embodiment.

処理ノード５は、ノード情報保持部４０３に対して処理ノード５の負荷情報を送信する（ステップＳ１）。次に、ノード情報保持部４０３は、処理ノード５の負荷情報を内部に保存する（ステップＳ２）。さらに、ノード情報保持部４０３は処理ノード５の負荷情報が閾値よりも小さいか否かを判断する（ステップＳ３）。処理ノード５の負荷情報が閾値以上の場合（ステップＳ３のＮｏ）には処理を終了する（ステップＳ６）。 The processing node 5 transmits the load information of the processing node 5 to the node information holding unit 403 (step S1). Next, the node information holding unit 403 stores therein the load information of the processing node 5 (Step S2). Further, the node information holding unit 403 determines whether or not the load information of the processing node 5 is smaller than the threshold value (step S3). If the load information of the processing node 5 is greater than or equal to the threshold value (No in step S3), the process ends (step S6).

一方、処理ノード５の負荷情報が閾値未満の場合（ステップＳ３のＹｅｓ）には、ノード情報保持部４０３は受信情報記憶部６０３に対して処理ノード５の負荷情報を送信する（ステップＳ４）。 On the other hand, when the load information of the processing node 5 is less than the threshold value (Yes in step S3), the node information holding unit 403 transmits the load information of the processing node 5 to the reception information storage unit 603 (step S4).

送信する負荷情報は、ノード名と、処理ノード制御手段名と、負荷情報とから構成される。 The load information to be transmitted includes a node name, a processing node control means name, and load information.

処理ノード５の負荷情報を受け取った受信情報記憶部６０３は、処理ノード５の負荷情報を内部に保存する（ステップＳ５）。 The reception information storage unit 603 that has received the load information of the processing node 5 stores the load information of the processing node 5 therein (step S5).

図８は、本実施形態の監視マネージャ１のデータ処理方法について説明するフローチャートである。 FIG. 8 is a flowchart for explaining the data processing method of the monitoring manager 1 of this embodiment.

処理ノード２に故障が発生すると（ステップＳ７）、ノード故障検知部１０９は、処理ノード２の故障を検知する（ステップＳ８）。ノード故障検知部１０９は、判断部１０５を介してフェイルオーバーにより処理を続行させるために負荷が閾値未満の処理ノードが存在するか否かをノード情報保持部１０３に問い合わせる（ステップＳ９）。判断部１０５は、負荷が閾値よりも小さい処理ノードが存在するかどうかを判断する（ステップＳ１０）。負荷情報が閾値よりも小さい処理ノード３が存在する場合（ステップＳ１０のＹｅｓ）、判断部１０５は、ノード故障検知部１０９に負荷が閾値よりも小さい処理ノード３の存在を通知する（ステップ１６）。ノード故障検知部１０９は、処理ノード２で実行していた処理を処理ノード３で続行させるようにノード制御部１１１に命令する（ステップ１７）。一方、ノード情報保持部１０３の中に負荷情報が閾値よりも小さい処理ノードの負荷情報が存在しない場合（ステップＳ１０のＮｏ）、判断部１０５は、情報通信部１０７を介して受信情報記憶部６０３に負荷情報が閾値より小さい処理ノードが存在するか否かを問い合わせる（ステップＳ１１）。受信情報記憶部６０３に負荷情報が閾値未満の処理ノードの負荷情報が存在しない場合（ステップＳ１２のＮｏ）、負荷情報が閾値よりも小さい処理ノードを利用したフェイルオーバーをあきらめる（ステップＳ１５）。受信情報記憶部６０３に負荷情報が閾値よりも小さい処理ノード５の負荷情報が存在する場合（ステップＳ１２のＹｅｓ）、検索部６０７は、受信情報記憶部６０３から負荷情報が閾値よりも小さい処理ノード５のノード識別子と処理ノード５を制御する処理ノード制御手段４１とを抽出し、応答通信部６０９から受付部１０１に処理ノード５の存在を通知する（ステップＳ１３）。最後に、ノード故障検知部１０９は処理ノード２で実行していた処理を処理ノード５で続行するようにノード制御部４１１に対して命令する（ステップＳ１４）。 When a failure occurs in the processing node 2 (step S7), the node failure detection unit 109 detects a failure in the processing node 2 (step S8). The node failure detection unit 109 inquires of the node information holding unit 103 whether or not there is a processing node having a load less than the threshold value in order to continue the processing by failover via the determination unit 105 (step S9). The determination unit 105 determines whether there is a processing node whose load is smaller than the threshold (step S10). When there is a processing node 3 whose load information is smaller than the threshold (Yes in Step S10), the determination unit 105 notifies the node failure detection unit 109 of the existence of the processing node 3 whose load is smaller than the threshold (Step 16). . The node failure detection unit 109 instructs the node control unit 111 to continue the processing executed by the processing node 2 at the processing node 3 (step 17). On the other hand, when there is no load information of a processing node whose load information is smaller than the threshold value in the node information holding unit 103 (No in step S10), the determination unit 105 receives the received information storage unit 603 via the information communication unit 107. Is inquired whether there is a processing node whose load information is smaller than the threshold (step S11). When the load information of the processing node whose load information is less than the threshold does not exist in the reception information storage unit 603 (No in Step S12), the failover using the processing node whose load information is smaller than the threshold is given up (Step S15). When the load information of the processing node 5 whose load information is smaller than the threshold exists in the reception information storage unit 603 (Yes in step S12), the search unit 607 receives the processing node whose load information is smaller than the threshold from the reception information storage unit 603. 5 and the processing node control means 41 for controlling the processing node 5 are extracted, and the presence of the processing node 5 is notified from the response communication unit 609 to the receiving unit 101 (step S13). Finally, the node failure detection unit 109 instructs the node control unit 411 to continue the processing executed by the processing node 2 at the processing node 5 (step S14).

次に、本実施形態のデータ処理方法の動作をさらに具体的に説明する。 Next, the operation of the data processing method of this embodiment will be described more specifically.

図１に示すように、本実施例では監視マネージャ１と、監視マネージャ４と、統括マネージャ６がネットワーク１０００により結合しており、監視マネージャ１は処理ノード２と処理ノード３を監視し、監視マネージャ４は処理ノード５を監視している。 As shown in FIG. 1, in this embodiment, a monitoring manager 1, a monitoring manager 4, and an overall manager 6 are connected by a network 1000. The monitoring manager 1 monitors the processing nodes 2 and 3, and the monitoring manager 4 monitors the processing node 5.

監視マネージャ１，４と統括マネージャ６とはネットワークで接続されたコンピュータであり、処理ノード２，３，５はそれぞれ任意のプログラムをノード制御部１１１またはノード制御部４１１の命令に従って実行することができる。 The monitoring managers 1 and 4 and the general manager 6 are computers connected by a network, and the processing nodes 2, 3, and 5 can execute arbitrary programs according to instructions of the node control unit 111 or the node control unit 411, respectively. .

処理の流れは負荷情報を収集する処理と処理ノードに故障が発生した場合の処理に分けられる。まず初めに負荷情報を収集する処理について説明する。 The process flow is divided into a process for collecting load information and a process when a failure occurs in a processing node. First, processing for collecting load information will be described.

ノード情報保持部１０３には定期的に処理ノード２および処理ノード３から各処理ノードの負荷情報が送信される。負荷情報の送信には各処理ノード内で動作するエージェント機能から一定の間隔で送信される場合や、一定の間隔で各処理ノードに対してノード情報保持部１０３から問い合わせを行う場合がある。同様にノード情報保持部４０３には定期的に処理ノード５の負荷情報が送信される。 The node information holding unit 103 periodically transmits load information of each processing node from the processing nodes 2 and 3. The load information may be transmitted from an agent function operating in each processing node at regular intervals, or an inquiry may be made to each processing node from the node information holding unit 103 at regular intervals. Similarly, the load information of the processing node 5 is periodically transmitted to the node information holding unit 403.

ノード情報保持部１０３，４０３では、受信した監視対象のノードの負荷情報を内部に保存する。ノード情報保持部１０３、４０３の内部に保存する情報には、処理ノードを一意に認識するためのノード識別子と、処理ノードの制御を行う処理ノード制御手段名と、実際の処理ノードの負荷となる負荷情報が少なくとも含まれる。 The node information holding units 103 and 403 store the received load information of the monitored node inside. Information stored in the node information holding units 103 and 403 includes a node identifier for uniquely identifying a processing node, a processing node control means for controlling the processing node, and an actual processing node load. At least load information is included.

ノード情報保持部１０３，４０３はそれぞれ受信した負荷情報が何らかの手段により定められた閾値よりも大きいか小さいかを判断し、負荷情報が閾値よりも小さい場合はその処理ノードの情報を統括マネージャ６に含まれる受信情報記憶部６０３に送信する。閾値を定める手段にはあらかじめ固定の値を利用する手段と、動的に指定する手段とがありうる。例えば、閾値を５０％とした場合に、各処理ノードの負荷情報が図５に示したとおりだとすると、処理ノード５の情報のみが統括マネージャ６に送信される。 Each of the node information holding units 103 and 403 determines whether the received load information is larger or smaller than a threshold value determined by some means. If the load information is smaller than the threshold value, the node manager information is sent to the general manager 6. The received information is stored in the received information storage unit 603. The means for determining the threshold value may be a means for using a fixed value in advance or a means for dynamically specifying the threshold value. For example, assuming that the threshold value is 50% and the load information of each processing node is as shown in FIG. 5, only the information on the processing node 5 is transmitted to the overall manager 6.

受信部６０１は、受信した各処理ノードの負荷情報を受信情報記憶部６０３に内部に保存する。ここまでの処理により受信情報記憶部６０３は全ての監視マネージャが監視している全ての処理ノードのうち、負荷情報が閾値よりも小さい全ての処理ノードの情報を内部に保存することができ、各監視マネージャ１，４内のノード情報保持部１０３，４０３には各監視マネージャ１，４が監視対象としている処理ノードの内、負荷情報が閾値よりも小さい全ての処理ノードの負荷情報を保存することができる。 The reception unit 601 stores the received load information of each processing node in the reception information storage unit 603 inside. Through the processing so far, the reception information storage unit 603 can internally store the information of all the processing nodes whose load information is smaller than the threshold among all the processing nodes monitored by all the monitoring managers. The node information holding units 103 and 403 in the monitoring managers 1 and 4 store the load information of all the processing nodes whose load information is smaller than the threshold among the processing nodes monitored by the monitoring managers 1 and 4. Can do.

つづいて処理ノード２に故障が発生した場合の処理を説明する。 Next, processing when a failure occurs in the processing node 2 will be described.

処理ノード２に故障が発生すると、処理ノード２を監視する監視マネージャ１に含まれるノード故障検知部１０９が故障を検知する。故障を検知するには、定期的に問い合わせを行い、一定時間以内に反応が無いことで故障を検知する方法と、処理ノード２から一定の間隔で生存信号をノード故障検知部１０９に対して送信し、ノード故障検知部１０９が前の生存信号を受信してから一定以上の時間を待っても次の生存信号を受信できないことで故障を検知する方法などがある。ノード故障検知部１０９は、処理ノード２の故障を検知すると、フェイルオーバーを行い処理ノード２で実行していた処理を続行するための、別の処理ノードを探す。 When a failure occurs in the processing node 2, the node failure detection unit 109 included in the monitoring manager 1 that monitors the processing node 2 detects the failure. In order to detect a failure, an inquiry is made periodically, a failure is detected when there is no response within a certain time, and a survival signal is transmitted from the processing node 2 to the node failure detection unit 109 at regular intervals. In addition, there is a method of detecting a failure by not receiving the next survival signal even after waiting for a certain period of time after the node failure detection unit 109 receives the previous survival signal. When the node failure detection unit 109 detects a failure of the processing node 2, the node failure detection unit 109 performs a failover and searches for another processing node for continuing the processing executed in the processing node 2.

フェイルオーバーで利用する処理ノードには負荷情報が閾値よりも小さいという条件があり、まずはノード故障検知部１０９が含まれる監視マネージャ１内に存在するノード情報保持部１０３に対して負荷情報が閾値よりも小さい処理ノードが存在するか問い合わせる。 There is a condition that the load information is smaller than the threshold value for the processing node used in the failover. First, the load information is less than the threshold value for the node information holding unit 103 existing in the monitoring manager 1 including the node failure detection unit 109. Queries whether there is a smaller processing node.

閾値が５０％であり、各処理ノードの負荷情報が図５に示したとおりの場合には、ノード情報保持部１０３には条件を満たす処理ノードの情報が含まれていないこととなる。そのため、フェイルオーバー可能な処理ノードが存在しないという情報が判断部１０５に送出される。 When the threshold value is 50% and the load information of each processing node is as shown in FIG. 5, the node information holding unit 103 does not include processing node information that satisfies the condition. Therefore, information indicating that there is no processing node that can be failed over is sent to the determination unit 105.

判断部１０５は、ノード情報保持部１０３にフェイルオーバー可能な処理ノードが存在しないことを知ると、続いて情報通信部１０７を介し、上位の統括マネージャ６に対して前出の条件を満たす処理ノードが存在するかを問い合わせる。 When determining that the node information holding unit 103 has no processing node capable of failing over, the determining unit 105 continues to the processing node that satisfies the above conditions with respect to the upper overall manager 6 via the information communication unit 107. Ask if there exists.

すると、条件を満たす処理ノード５の負荷情報が受信情報記憶部６０３の中に存在するため、受信情報記憶部６０３は処理ノード５の情報と、処理ノード５を制御するノード制御部４１１の情報を応答通信部６０９を介して監視マネージャ１に送信する。 Then, since the load information of the processing node 5 that satisfies the condition exists in the reception information storage unit 603, the reception information storage unit 603 displays the information of the processing node 5 and the information of the node control unit 411 that controls the processing node 5. The response is transmitted to the monitoring manager 1 via the response communication unit 609.

受付部１０１が統括マネージャ６の応答通信部６０９から受信した前出の情報により判断部１０５は処理ノード５を利用してフェイルオーバーすることを決定する。判断部１０５は、ノード故障検知部１０９を介して処理ノード５を制御するノード制御部４１１に対して、処理ノード５を利用して処理ノード２で実行していた処理を続行するように命令を出す。ノード制御部４１１は受信した命令に従い、処理ノード５で指定された処理を実行させる。 Based on the above-mentioned information received by the reception unit 101 from the response communication unit 609 of the overall manager 6, the determination unit 105 determines to perform failover using the processing node 5. The determination unit 105 instructs the node control unit 411 that controls the processing node 5 via the node failure detection unit 109 to continue the processing that has been executed on the processing node 2 by using the processing node 5. put out. The node control unit 411 causes the processing designated by the processing node 5 to be executed in accordance with the received command.

以上の処理により処理を実行中の処理ノード２に故障が発生して、実行中の処理を続行できなくなった場合に、処理ノード５を利用してその処理を続行できるようになる。 If a failure occurs in the processing node 2 that is executing the processing due to the above processing and the current processing cannot be continued, the processing can be continued using the processing node 5.

次に、本実施の形態の効果について説明する。 Next, the effect of this embodiment will be described.

本実施形態のノード監視システムによれば、データ処理を監視するノードから負荷情報をノード識別子とともに受け付け、受け付けた負荷情報が所定の閾値未満である場合、統括マネージャに記憶させる。これにより、ノードの負荷情報を監視し、負荷情報が閾値よりも小さいノードの情報のみを統括マネージャに管理させることができる。したがって、複数の監視マネージャで複数のノードの処理を監視する場合においても、ネットワークの負荷を低減させつつ、ノードの故障時のフェイルオーバーに必要な処理時間を軽減する。 According to the node monitoring system of this embodiment, load information is received together with a node identifier from a node that monitors data processing, and when the received load information is less than a predetermined threshold, it is stored in the overall manager. As a result, it is possible to monitor the load information of the node and allow the general manager to manage only the information of the node whose load information is smaller than the threshold value. Therefore, even when the processes of a plurality of nodes are monitored by a plurality of monitoring managers, the processing time required for failover in the event of a node failure is reduced while reducing the network load.

本実施の形態では、監視マネージャ１のノード情報保持部１０３または監視マネージャ４のノード情報保持部４０３と、統括マネージャ６の受信情報記憶部６０３とで、階層的に構成されている。したがって、下位階層のノード情報保持部１０３，４０３で各監視マネージャ１，４が監視している処理ノードの負荷情報のみを管理し、上位階層の受信情報記憶部６０３でシステム全体の負荷情報のうち負荷情報が閾値よりも小さい処理ノードの負荷情報を管理することができる。よって、フェイルオーバー時の問い合わせ回数が最大２回までにしながら、処理を続行させる処理ノードを特定することができる。 In this embodiment, the node information holding unit 103 of the monitoring manager 1 or the node information holding unit 403 of the monitoring manager 4 and the reception information storage unit 603 of the overall manager 6 are hierarchically configured. Therefore, only the load information of the processing nodes monitored by the monitoring managers 1 and 4 is managed by the lower layer node information holding units 103 and 403, and the received information storage unit 603 of the upper layer includes the load information of the entire system. It is possible to manage the load information of the processing node whose load information is smaller than the threshold value. Therefore, it is possible to specify a processing node for continuing processing while the number of inquiries at the time of failover is up to two times.

（第２の実施形態）
次に、本発明の第２の実施形態について図面を参照して詳細に説明する。(Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

図９は、第２の実施形態のノード監視システムの構成を説明する図である。 FIG. 9 is a diagram illustrating the configuration of the node monitoring system according to the second embodiment.

本発明の第２の発明を実施するための最良の形態は、監視マネージャ４の監視対象となる処理ノードに処理ノード７が追加されていることが第１の実施形態と異なり、その他の構成要素については第１の実施の形態と同様である。第１の実施の形態と同様の構成要素については図１と同一の符号を付し、詳細な説明を省略する。 The best mode for carrying out the second invention of the present invention is different from the first embodiment in that the processing node 7 is added to the processing node to be monitored by the monitoring manager 4, and other components. This is the same as in the first embodiment. Constituent elements similar to those in the first embodiment are denoted by the same reference numerals as those in FIG. 1, and detailed description thereof is omitted.

本実施形態において、ノード情報保存手段１２（図４のノード情報保持部１０３に対応）は、ノードの属性を示す属性情報としてノードグループ名と、ノードの識別子としてノード識別子とを対応づけて記憶する。情報通信部１０７は、受け付けた負荷情報と、対応する属性情報とを対応づけて送信する。 In the present embodiment, the node information storage unit 12 (corresponding to the node information holding unit 103 in FIG. 4) stores a node group name as an attribute information indicating a node attribute and a node identifier as a node identifier in association with each other. . The information communication unit 107 transmits the received load information and the corresponding attribute information in association with each other.

第２の実施の形態の全体の動作については、図８に示した負荷情報の構成にノードグループ名が追加されていることのみが第１の実施の形態と異なり、その他の動作内容については第１の実施の形態と同様である。第１の実施の形態と同様の動作については、図７のフローチャート、図８のフローチャートと同一の符号を付し、第１の実施の形態と同一の動作詳細な説明を省略する。 The overall operation of the second embodiment is different from the first embodiment only in that the node group name is added to the configuration of the load information shown in FIG. This is the same as the first embodiment. About the operation | movement similar to 1st Embodiment, the code | symbol same as the flowchart of FIG. 7 and the flowchart of FIG. 8 is attached | subjected, and the operation | movement detailed description same as 1st Embodiment is abbreviate | omitted.

第２の実施の形態では各処理ノードの負荷情報をノード情報保存手段に保存する処理として、図７のフローチャートにおいて処理ノード５を処理ノード７で置き換えた処理が行われる。これにより、処理ノード５と処理ノード７の負荷情報が閾値よりも小さい場合、受信情報記憶部６０３には処理ノード５と処理ノード７の負荷情報が保存される。 In the second embodiment, as processing for storing the load information of each processing node in the node information storage unit, processing in which the processing node 5 is replaced with the processing node 7 in the flowchart of FIG. 7 is performed. Thereby, when the load information of the processing node 5 and the processing node 7 is smaller than the threshold value, the received information storage unit 603 stores the load information of the processing node 5 and the processing node 7.

ノードグループ名とは、１つの装置を動作させるための複数のノードのグループや同一の特性をもつノードのグループの名称を示す。 The node group name indicates a name of a group of a plurality of nodes for operating one device or a group of nodes having the same characteristics.

図１０は、受信情報記憶部６０３に保存されるデータ構造の一例である。 FIG. 10 is an example of a data structure stored in the reception information storage unit 603.

図１０に示すように、負荷情報は、ノード識別子であるノード名と、制御識別子である処理ノード制御手段名と、属性情報であるノードグループ名と、負荷情報から構成される。 As shown in FIG. 10, the load information includes a node name that is a node identifier, a processing node control means name that is a control identifier, a node group name that is attribute information, and load information.

図１１は、本実施形態におけるデータ処理方法を説明するフローチャートであり、処理ノード２に障害が発生した後の処理を示している。 FIG. 11 is a flowchart for explaining the data processing method in the present embodiment, and shows processing after a failure has occurred in the processing node 2.

図１１のフローチャートではステップＳ９’と、ステップＳ１１’と、ステップＳ１３’と、ステップＳ１６’が図８のフローチャートと異なり、そのほかのステップは図８に示した第１の実施の形態と同様である。 In the flowchart of FIG. 11, step S9 ′, step S11 ′, step S13 ′, and step S16 ′ are different from the flowchart of FIG. 8, and other steps are the same as those in the first embodiment shown in FIG. .

ステップＳ９’では、判断部１０５がノード情報保持部１０３に負荷情報が閾値よりも小さく、故障が発生した処理ノード２と同じノードグループ名である処理ノードの負荷情報が保存されているかを問い合わせる。 In step S <b> 9 ′, the determination unit 105 inquires of the node information holding unit 103 whether the load information of the processing node having the same node group name as that of the processing node 2 in which the load information is smaller than the threshold and the failure has been stored.

処理ノード３の情報が負荷情報が閾値よりも小さく、処理ノード２と同じノードグループ名であるという条件を満たす場合（ステップＳ１０のＹｅｓ）、ノード情報保持部１０３は処理ノード３の存在と処理ノード３を制御するノード制御部１１１を判断部１０５に通知する（ステップＳ１６’）。 When the information of the processing node 3 satisfies the condition that the load information is smaller than the threshold and is the same node group name as the processing node 2 (Yes in step S10), the node information holding unit 103 determines that the processing node 3 exists and the processing node 3 is notified to the determining unit 105 (step S16 ′).

負荷情報が閾値よりも小さく、処理ノード２と同じノードグループ名であるという処理ノードの負荷情報がノード情報保持部１０３に保存されていない場合（ステップＳ１０’のＮｏ）、前記条件を満たす処理ノードが存在するかを、情報通信部１０７を介して統括マネージャ６に問い合わせる（ステップＳ１１’）。 When the load information of the processing node that the load information is smaller than the threshold and has the same node group name as the processing node 2 is not stored in the node information holding unit 103 (No in step S10 ′), the processing node that satisfies the above condition Is inquired to the general manager 6 via the information communication unit 107 (step S11 ′).

受信情報記憶部６０３に条件を満たす処理ノード７の負荷情報が保存されている場合、処理ノード７の存在と処理ノード７を制御するノード制御部４１１を監視マネージャ１に通知する（ステップＳ１３’）。その他の処理は第１の実施の形態と同様であるので詳細な説明を省略する。 When the load information of the processing node 7 that satisfies the condition is stored in the reception information storage unit 603, the presence of the processing node 7 and the node control unit 411 that controls the processing node 7 are notified to the monitoring manager 1 (step S13 ′). . Since other processes are the same as those of the first embodiment, detailed description thereof is omitted.

次に、本発明を実施するための第２の実施の形態の効果について説明する。 Next, the effect of the second embodiment for carrying out the present invention will be described.

本発明を実施するための第２の実施の形態では、第１の実施の形態に対して、負荷情報に属性情報としてノードグループ名という要素が加えられている。したがって、負荷情報は閾値よりも小さいのに加えて、特定の処理を行うプログラムがインストールされているという条件を追加する。これにより、フェイルオーバーで処理を続行するために利用する処理ノードを特定することができる。 In the second embodiment for carrying out the present invention, an element called a node group name is added to the load information as attribute information, compared to the first embodiment. Therefore, in addition to the load information being smaller than the threshold value, a condition that a program for performing a specific process is installed is added. As a result, it is possible to specify a processing node to be used for continuing processing by failover.

以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As mentioned above, although embodiment of this invention was described with reference to drawings, these are the illustrations of this invention, Various structures other than the above are also employable.

たとえば、本発明は以下の構成も適用可能である。
（１）処理ノードの負荷情報の監視を行う監視マネージャであり、かつ、前記負荷情報を前記監視マネージャから受信する統括マネージャにより、処理ノードの制御を行う計算機監視システムのための監視マネージャであって、
処理ノードから負荷情報を受信する受信部と、
前記負荷情報が閾値よりも大きい場合に前記統括マネージャに対して前記負荷情報を送信しない送信部とを備えたことを特徴とする監視マネージャ。
（２）処理ノードに故障が発生した場合に故障を検知するノード故障検知手段と、
問い合わせに応じて負荷情報が閾値よりも小さいノード情報を送信するノード情報保存手段とを備えたことを特徴とする（１）記載の監視マネージャ。
（３）前記ノード情報保存手段において、
取り扱う負荷情報にノードを識別するための識別子と、
処理ノード制御手段を識別する識別子と、
負荷情報とを含むことを特徴とする（２）記載の監視マネージャ。
（４）前記ノード情報保存手段において、
取り扱う負荷情報にノードごとまたはプログラムごとのフェイルオーバーが可能かどうかを表わす識別子を備えたことを特徴とする（３）記載の監視マネージャ。
（５）下位の監視マネージャから通知された処理ノードの負荷情報を内部に保存し、
指定された条件を満たす処理ノードの存在の問い合わせを受けたときに対応する情報を返却するノード情報保存手段、を備えたことを特徴とする統括マネージャ。For example, the following configurations can be applied to the present invention.
(1) A monitoring manager for monitoring load information of a processing node, and a monitoring manager for a computer monitoring system that controls a processing node by a general manager that receives the load information from the monitoring manager. ,
A receiver for receiving load information from the processing node;
A monitoring manager comprising: a transmission unit that does not transmit the load information to the overall manager when the load information is greater than a threshold value.
(2) node failure detection means for detecting a failure when a failure occurs in the processing node;
The monitoring manager according to (1), further comprising: node information storing means for transmitting node information whose load information is smaller than a threshold value in response to an inquiry.
(3) In the node information storage means,
An identifier for identifying a node in the load information to be handled;
An identifier for identifying the processing node control means;
The monitoring manager according to (2), further comprising load information.
(4) In the node information storage means,
(3) The monitoring manager according to (3), wherein the load information to be handled includes an identifier indicating whether failover for each node or each program is possible.
(5) The load information of the processing node notified from the lower monitoring manager is stored internally,
A general manager comprising node information storage means for returning corresponding information when receiving an inquiry about the existence of a processing node that satisfies a specified condition.

なお、上記の構成は、（１）〜（４）と（５）を組み合わせたシステム、各方法およびプログラムとして、用いることができる。 In addition, said structure can be used as a system, each method, and program which combined (1)-(4) and (5).

上記の構成によれば、分散計算機環境における状態監視およびフェイルオーバーを行うシステムおよび方法が提供される。上記の発明は分散計算機システムに関し、特に大量の計算機を利用する場合における計算機の監視方法に関するものを提供することができる。 According to the above configuration, a system and method for performing state monitoring and failover in a distributed computer environment are provided. The above invention relates to a distributed computer system, and in particular, can provide a computer monitoring method when a large number of computers are used.

かかる構成により、複数の監視マネージャの上位に統括マネージャを配置して、監視マネージャが扱う各処理ノードの中から、負荷が閾値よりも小さい処理ノードの負荷情報のみを監視マネージャにも保存することができる。したがって、定常的なネットワーク負荷を低減することができる。また、実際に監視対象ノードに故障が発生して、フェイルオーバーを行う必要が発生した場合にも、高々２回の問い合わせのみでフェイルオーバーで利用する処理ノードを特定することができる。よってフェイルオーバーに必要な処理時間も低減する。 With such a configuration, it is possible to arrange a general manager above a plurality of monitoring managers and store only load information of processing nodes whose loads are smaller than a threshold among the processing nodes handled by the monitoring manager in the monitoring manager. it can. Therefore, a steady network load can be reduced. Further, even when a failure occurs in the monitoring target node and it is necessary to perform a failover, it is possible to specify a processing node to be used for the failover by only two inquiries at most. Therefore, the processing time required for failover is also reduced.

本実施の形態では監視マネージャや統括マネージャの各部がコンピュータプログラムにより各種機能として論理的に実現されることを例示した。しかし、このような各部の各々を固有のハードウェアとして形成することもでき、ソフトウェアとハードウェアとの組み合わせとして実現することもできる。 In the present embodiment, it has been exemplified that each part of the monitoring manager and the general manager is logically realized as various functions by a computer program. However, each of these units can be formed as unique hardware, or can be realized as a combination of software and hardware.

また、上記形態ではネットワークとして現状のインターネットを例示したが、これが次世代のインターネットであるＮＧＮ（ＮｅｘｔＧｅｎｅｒａｔｉｏｎＮｅｔｗｏｒｋ）でもよい。 Moreover, although the present Internet was illustrated as a network in the said form, this may be NGN (Next Generation Network) which is the next generation internet.

なお、当然ながら、上述した実施の形態および複数の変形例は、その内容が相反しない範囲で組み合わせることができる。また、上述した実施の形態および変形例では、各部の構造などを具体的に説明したが、その構造などは本願発明を満足する範囲で各種に変更することができる。 Needless to say, the above-described embodiment and a plurality of modifications can be combined within a range in which the contents do not conflict with each other. Further, in the above-described embodiments and modifications, the structure of each part has been specifically described, but the structure and the like can be changed in various ways within a range that satisfies the present invention.

以上、実施例を参照して本願発明を説明したが、本願発明は上記実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

この出願は、２００８年２月１３日に出願された日本出願特願２００８−０３２０４１を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2008-032041 for which it applied on February 13, 2008, and takes in those the indications of all here.

Claims

Receiving means for receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining means for determining whether or not the load information received by the receiving means is equal to or greater than a predetermined threshold;
When the determination means determines that the load information is less than the threshold value, the determination means determines that the load information is less than the threshold value for a general manager connected to a plurality of monitoring managers via a network. Information communication means for transmitting the determined load information in association with the node identifier received together with the load information by the receiving means;
Node failure detection means for detecting a failure of the node;
When the node failure detection unit detects a failure of one node, the determination unit compares the stored load information with the threshold value, and determines whether all the stored load information is equal to or greater than the threshold value. Determine whether
The information communication means, when it is determined that all the load information held by the determination means is greater than or equal to the threshold, the monitoring manager that inquires the general manager about the load information of other nodes that satisfy the threshold .

The monitoring manager according to claim 1,
Controlling the execution of the data processing in the node, comprising node control means identified by a control identifier;
When the node failure detection unit detects a failure of one node, the determination unit compares the stored load information with the stored threshold value, and any of the stored load information is less than the threshold value. Whether or not
When it is determined that any of the load information held by the determination unit is less than the threshold, the node control unit may detect other nodes determined to be less than the threshold as the one node that has detected the failure. A monitoring manager that performs the data processing being performed.

In the monitoring manager according to claim 1 or 2,
Node information holding means for storing attribute information indicating an attribute of the node and an identifier of the node in association with each other;
The information communication means is a monitoring manager that transmits the received load information and the corresponding attribute information in association with each other.

Receiving means for receiving, from the first monitoring manager, load information indicating the load of the node monitored by the first monitoring manager connected via the network in association with a node identifier for identifying the node; ,
Received information storage means for storing the load information and the node identifier received by the receiving means;
Request accepting means for accepting a judgment request as to whether or not there is the node having the load information satisfying a predetermined threshold from the second monitoring manager connected via the network;
In response to the request received by the request receiving means, search means for comparing the load information stored in the received information storage means with the predetermined threshold value;
A general manager comprising response communication means for transmitting the node identifier corresponding to the load information to the second monitoring manager when there is the load information satisfying the predetermined threshold.

A node monitoring system in which a monitoring manager that monitors a node and a general manager are connected via a network,
The monitoring manager
Receiving means for receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining means for determining whether or not the load information received by the receiving means is equal to or greater than a predetermined threshold;
When the determination means determines that the load information is less than the threshold value, the determination means determines that the load information is less than the threshold value for a general manager connected to a plurality of monitoring managers via a network. Information communication means for transmitting the determined load information in association with the node identifier received together with the load information by the receiving means;
The general manager is
Receiving means for receiving the load information for each node;
Receiving information storage means for storing the load information received by the receiving means in association with a node identifier for identifying the node;
The monitoring manager comprises a first monitoring manager and a second monitoring manager,
The first monitoring manager is:
The information communication means transmits the load information together with the corresponding node identifier to the general manager,
The second monitoring manager is
A node failure detecting means for detecting a failure of the node;
When the determination unit detects a failure of the node monitored by the node failure detection unit, the stored load information is compared with a predetermined threshold, and all the stored load information is equal to or greater than the threshold. Determine whether or not
When the information communication unit determines that all the load information held by the determination unit is greater than or equal to the threshold value, the determination request whether or not there is the node having the load information that satisfies the threshold value Send
The general manager is
The reception information storage means holds the load information received from the first monitoring manager in association with a node identifier;
In response to the request, the load information held in the reception information storage means is compared with the predetermined threshold value, and if there is the load information satisfying the predetermined threshold value, the load information corresponds to the load information. A node monitoring system comprising response communication means for transmitting the node identifier to the second monitoring manager.

The node monitoring system according to claim 5 ,
The monitoring manager comprises a first monitoring manager and a second monitoring manager,
The first monitoring manager is:
The information communication means transmits the load information together with the corresponding node identifier to the general manager,
The second monitoring manager is
A node failure detecting means for detecting a failure of the node;
When the determination unit detects a failure of the node monitored by the node failure detection unit, the stored load information is compared with a predetermined threshold, and all the stored load information is equal to or greater than the threshold. Determine whether or not
When the information communication unit determines that all the load information held by the determination unit is greater than or equal to the threshold value, the determination request whether or not there is the node having the load information that satisfies the threshold value Send
The general manager is
The reception information storage means holds the load information received from the first monitoring manager in association with a node identifier;
In response to the request, the load information held in the reception information storage means is compared with the predetermined threshold value, and if there is the load information satisfying the predetermined threshold value, the load information corresponds to the load information. Response communication means for transmitting the node identifier to the second monitoring manager;
The second monitoring manager is
The accepting means accepts the node identifier from the general manager;
A node monitoring system comprising node control means for causing the node corresponding to the node identifier received by the receiving means to execute data processing of the node where the node failure detecting means has detected a failure.

Receiving, from a node executing data processing, load information indicating a load required to execute the data processing together with a node identifier for identifying the node;
Determining whether the received load information is greater than or equal to a predetermined threshold;
When it is determined that the received load information is less than the threshold value, the load information determined to be less than the threshold value for the general manager connected to a plurality of monitoring managers via a network; Associating and transmitting the received node identifier together with the load information;
Detecting a failure of the node;
When a failure of one node is detected, comparing the stored load information with the threshold value, and determining whether or not all the stored load information is equal to or greater than the threshold value;
A query manager data processing method including a step of inquiring the overall manager of the load information of another node satisfying the threshold when it is determined that all of the stored load information is equal to or greater than the threshold.

A computer program for a monitoring manager,
On the computer,
A reception procedure for receiving load information indicating a load of a node related to execution of the data processing from a node executing the data processing together with a node identifier for identifying the node;
A first determination procedure for determining whether or not the load information received from the node is equal to or greater than a predetermined threshold;
When it is determined that the received load information is less than the threshold value, information for transmitting the load information in association with the node identifier to a general manager connected to a plurality of monitoring managers via a network Communication procedure;
A failure detection procedure for detecting a failure of the node;
A second determination procedure for comparing, when the failure of one node is detected, comparing the stored load information with the threshold, and determining whether all the stored load information is equal to or greater than the threshold;
A computer program for causing the overall manager to execute an inquiry procedure for inquiring the load information of another node satisfying the threshold when it is determined that all the stored load information is equal to or greater than the threshold.

Receiving load information indicating a load of a node monitored by a first monitoring manager connected via a network in association with a node identifier identifying the node from the first monitoring manager;
Storing the received load information and the node identifier;
Receiving a determination request as to whether or not there is the node having the load information satisfying a predetermined threshold from a second monitoring manager connected via the network;
In response to the request, comparing the stored load information with the predetermined threshold;
A data processing method of a general manager, including the step of transmitting the node identifier corresponding to the load information to the second monitoring manager when there is the load information satisfying the predetermined threshold.

A computer program for a general manager,
On the computer,
A reception procedure for receiving, from the first monitoring manager, load information indicating the load of the node monitored by the first monitoring manager connected via the network in association with the node identifier for identifying the node; ,
A received information storage procedure for storing the received load information and the node identifier;
A request reception procedure for receiving a determination request as to whether or not there is the node having the load information satisfying a predetermined threshold from the second monitoring manager connected via the network;
A search procedure for comparing the stored load information with the predetermined threshold in response to the request;
A computer program for executing a response communication procedure for transmitting the node identifier corresponding to the load information to the second monitoring manager when there is the load information satisfying the predetermined threshold.