JP2014049045A

JP2014049045A - Counter-failure system for job management system and program therefor

Info

Publication number: JP2014049045A
Application number: JP2012193458A
Authority: JP
Inventors: Motosuke Murai; 基祐村井; Yasunori Hayashida; 安規林田
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2012-09-03
Filing date: 2012-09-03
Publication date: 2014-03-17

Abstract

【課題】本発明は、ジョブ管理システムにおいて過去に発生した障害の情報や対象方法の情報を蓄積し、障害発生時にその対処方法を提示する技術を提供する。
【解決手段】本発明は、ジョブ管理システムにおける障害対応システムである。当該障害対応システムは、入出力手段と、記憶手段と、少なくとも１つの情報処理装置にジョブを実行させるジョブ実行制御手段と、少なくとも１つの情報処理装置の稼働状況を監視する稼働状況監視手段と、ジョブ実行結果情報に基づいて前記少なくとも１つの情報処理装置で発生した障害を判定する障害監視手段と、障害に対して実施した対処に関する情報を障害対応情報として記憶手段に格納する障害対応手段と、を備え、障害対応手段が、障害対応情報の中から、障害監視手段によって障害と判定された少なくとも１つの情報処理装置の障害の情報と一致する情報を取得し、取得した情報を入出力手段に表示させる。
【選択図】図１The present invention provides a technique for accumulating information on a failure that occurred in the past and information on a target method in a job management system and presenting a countermeasure method when a failure occurs.
The present invention is a failure handling system in a job management system. The failure handling system includes an input / output unit, a storage unit, a job execution control unit that causes at least one information processing device to execute a job, an operation status monitoring unit that monitors an operation status of at least one information processing device, A fault monitoring unit that determines a fault that has occurred in the at least one information processing apparatus based on job execution result information; a fault handling unit that stores information on a countermeasure taken for the fault in the storage unit as fault handling information; The failure handling means obtains information that matches the failure information of at least one information processing device determined to be a failure by the failure monitoring means from the failure handling information, and uses the obtained information as an input / output means. Display.
[Selection] Figure 1

Description

本発明は、指定された日時に指定された業務プログラムを実行するジョブ管理システムにおける障害対応システム及びそのプログラムに関するものである。 The present invention relates to a failure handling system in a job management system that executes a specified business program at a specified date and time, and the program.

各種業務を遂行するコンピューターシステムにおいて、日々の業務を計画的にスケジューリングして自動運用を行うために、ジョブ管理システムが利用されている。ジョブ管理システムは、ジョブ毎に実行スケジュールを指定することができたり、ジョブを複数のコンピュータに分散させてジョブを実行させることができる。 2. Description of the Related Art A job management system is used in a computer system that performs various tasks in order to perform daily operations in a scheduled manner for automatic operation. The job management system can specify an execution schedule for each job, or can distribute a job to a plurality of computers and execute the job.

一方で、ジョブ管理システムの構成は、規模に比例して複雑化し、何らかの障害が発生した場合の対処には、障害個所の特定や影響範囲の調査、回復方法の検討など、非常に大きな手間が掛かっていた。したがって、障害に対処するために、一般的に、システムの性能情報などを収集し、日々監視するシステムが利用されている。 On the other hand, the configuration of the job management system has become more complex in proportion to the scale, and in the event of a failure, it takes a lot of effort, such as identifying the location of the failure, investigating the scope of influence, and examining the recovery method. It was hanging. Therefore, in order to cope with a failure, generally, a system that collects system performance information and monitors it daily is used.

特開２００１−１７５４９２号公報JP 2001-175492 A 特開２００４−２８７９８０号公報JP 2004-287980 A 特開２００７−１１４９０８号公報JP 2007-114908 A

ジョブ管理システムで発生する障害の多くは、環境の変化に起因して発生している。ここで、「障害」とは、ジョブの異常終了や遅延を指し、「環境の変化」とは、ジョブ量の増加、ディスク容量不足、実メモリ不足、通信障害などを指している。また、実際に障害が発生した場合、表面的には、ジョブの異常終了や遅延という現象しかわからず、根本原因である環境の変化を調査することが困難なケースも多い。したがって、障害要因である環境の変化の調査などは、ユーザ自身が行い、その障害に対する対処方法もユーザ自身によって判断していた。その結果、障害の発生時の対応に時間がかかっていた。 Many failures that occur in the job management system are caused by environmental changes. Here, “failure” refers to abnormal termination or delay of a job, and “environmental change” refers to an increase in job amount, insufficient disk capacity, insufficient actual memory, communication failure, or the like. In addition, when a failure actually occurs, only the phenomenon of abnormal termination or delay of the job is known on the surface, and it is often difficult to investigate the environmental change that is the root cause. Therefore, the user himself / herself investigates the environmental change, which is the cause of the failure, and the user himself / herself also determines how to deal with the failure. As a result, it took a long time to respond when a failure occurred.

本発明はこのような状況に鑑みてなされたものであり、ジョブ管理システムにおいて過去に発生した障害の情報や対象方法の情報を蓄積し、障害発生時にその対処方法を提示する技術を提供する。 The present invention has been made in view of such a situation, and provides a technique for accumulating information on failures that have occurred in the past and information on target methods in a job management system, and presenting countermeasures when failures occur.

上記課題を解決するために、本発明のある実施形態によれば、少なくとも１つの情報処理装置にジョブを実行させるジョブ管理システムの障害対応システムが提供される。当該障害対応システムは、入出力手段と、前記ジョブの実行結果を示すジョブ実行結果情報と、前記少なくとも１つの情報処理装置の稼働状況を示す稼働情報と、前記少なくとも１つの情報処理装置において発生した障害の情報を示す障害履歴情報と、前記少なくとも１つの情報処理装置において発生した障害に対して実施した対処に関する情報を示す障害対応情報とを格納する記憶手段と、前記少なくとも１つの情報処理装置に前記ジョブを実行させ、前記ジョブの実行結果を前記ジョブ実行結果情報として前記記憶手段に格納するジョブ実行制御手段と、前記少なくとも１つの情報処理装置の稼働状況を監視し、前記監視した稼働状況の情報を前記稼働情報として前記記憶手段に格納する稼働状況監視手段と、前記ジョブ実行結果情報に基づいて前記少なくとも１つの情報処理装置で発生した障害を判定し、前記障害の情報を前記障害履歴情報として前記記憶手段に格納する障害監視手段と、前記障害に対して実施した対処に関する情報を前記障害対応情報として前記記憶手段に格納する障害対応手段と、を備え、前記障害対応手段が、前記障害対応情報の中から、前記障害監視手段によって障害と判定された前記少なくとも１つの情報処理装置の前記障害の情報と一致する情報を取得し、前記取得した情報を前記入出力手段に表示させる。 In order to solve the above-described problems, according to an embodiment of the present invention, there is provided a failure management system for a job management system that causes at least one information processing apparatus to execute a job. The failure handling system occurred in the input / output means, job execution result information indicating the execution result of the job, operation information indicating an operation status of the at least one information processing apparatus, and the at least one information processing apparatus. Storage means for storing failure history information indicating failure information and failure handling information indicating information relating to a countermeasure taken for a failure that has occurred in the at least one information processing device; and at least one information processing device. The job execution control means for executing the job and storing the job execution result in the storage means as the job execution result information and the operating status of the at least one information processing apparatus are monitored, and the monitored operating status Operation status monitoring means for storing information in the storage means as the operation information, and the job execution result information And determining faults occurring in the at least one information processing apparatus, storing fault information in the storage unit as the fault history information, and information relating to countermeasures taken for the faults. Fault handling means for storing in the storage means as fault handling information, wherein the fault handling means is one of the at least one information processing apparatus determined to be faulty by the fault monitoring means from among the fault handling information. Information that matches the failure information is acquired, and the acquired information is displayed on the input / output means.

また、本発明の別の実施形態によれば、演算手段と記憶手段と入出力手段とを備えるコンピュータに、少なくとも１つの情報処理装置にジョブを実行させるジョブ管理システムの障害対応処理を実行させるためのプログラムが提供される。当該プログラムは、前記演算手段に、前記少なくとも１つの情報処理装置に前記ジョブを実行させ、前記ジョブの実行結果をジョブ実行結果情報として前記記憶手段に格納するジョブ実行制御処理と、前記少なくとも１つの情報処理装置の稼働状況を監視し、前記監視した稼働状況の情報を稼働情報として前記記憶手段に格納する稼働状況監視処理と、前記ジョブ実行結果情報に基づいて前記少なくとも１つの情報処理装置で発生した障害を判定し、前記障害の情報を障害履歴情報として前記記憶手段に格納する障害監視処理と、前記障害に対して実施した対処に関する情報を障害対応情報として前記記憶手段に格納する障害対応処理と、前記障害対応情報の中から、前記障害監視処理によって障害と判定された前記少なくとも１つの情報処理装置の前記障害の情報と一致する情報を取得し、前記取得した情報を前記入出力手段に表示させる表示処理と、を実行させる。 According to another embodiment of the present invention, in order to cause a computer including a calculation unit, a storage unit, and an input / output unit to execute failure handling processing of a job management system that causes at least one information processing apparatus to execute a job. Programs are provided. The program causes the computing means to cause the at least one information processing apparatus to execute the job, and stores the job execution result in the storage means as job execution result information; and the at least one information processing apparatus. Occurring in the at least one information processing device based on the operation status monitoring process for monitoring the operating status of the information processing device and storing the monitored operating status information as the operating information in the storage means, and the job execution result information A failure monitoring process for determining a failure and storing the failure information in the storage means as failure history information; and a failure handling process for storing information relating to a countermeasure taken for the failure in the storage means as failure handling information And the at least one information process determined as a fault by the fault monitoring process from the fault handling information. Acquires information that matches the information of the failure of the device, and a display process of displaying the obtained information to said output means, to the execution.

本発明によれば、ジョブ管理システムにおいて過去に発生した障害の情報や対象方法の情報を蓄積し、障害発生時にその対処方法を提示することが可能になる。これにより、障害発生時の対処を迅速に行うことが可能になる。 According to the present invention, it is possible to accumulate information on a failure that occurred in the past and information on a target method in the job management system and present a coping method when a failure occurs. As a result, it becomes possible to quickly deal with a failure.

本発明に関連する更なる特徴は、本明細書の記述、添付図面から明らかになるものである。また、上記した以外の、課題、構成および効果は、以下の実施形態の説明により明らかにされる。 Further features related to the present invention will become apparent from the description of the present specification and the accompanying drawings. Further, problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

本発明の一実施例を示すシステム構成図である。1 is a system configuration diagram showing an embodiment of the present invention. 本発明の一実施例におけるジョブ管理マネージャーの構成図である。It is a block diagram of the job management manager in one Example of this invention. 本発明の一実施例におけるジョブ実行エージェントの構成図である。It is a block diagram of the job execution agent in one Example of this invention. 本発明の一実施例における障害対応システムの構成図である。It is a block diagram of the failure response system in one Example of this invention. ジョブスケジュール定義ＤＢに格納するスケジュール情報テーブル及びジョブスケジュール定義ファイルの一例を示した図である。It is the figure which showed an example of the schedule information table and job schedule definition file which are stored in job schedule definition DB. ジョブ定義ＤＢに格納するジョブ定義テーブル及びジョブ定義ファイルの一例を示した図である。It is a figure showing an example of a job definition table and a job definition file stored in a job definition DB. 実行結果情報ＤＢに格納する実行結果テーブルの一例を示す図である。It is a figure which shows an example of the execution result table stored in execution result information DB. ジョブ管理マネージャー用の稼働情報ＤＢに格納する稼働情報テーブルの一例を示す図である。It is a figure which shows an example of the operation information table stored in the operation information DB for job management managers. ジョブ実行エージェント用の稼働情報ＤＢに格納する稼働情報テーブルの一例を示す図である。It is a figure which shows an example of the operation information table stored in the operation information DB for job execution agents. 障害履歴ＤＢに格納する障害履歴テーブルの一例を示す図である。It is a figure which shows an example of the failure history table stored in failure history DB. 障害対応ＤＢに格納する障害対応テーブルの一例を示す図である。It is a figure which shows an example of the failure response table stored in failure response DB. 障害対応システムにおける障害履歴ＤＢの蓄積処理を説明するフローチャートである。It is a flowchart explaining the accumulation | storage process of failure log | history DB in a failure response system. 障害対応システムにおける障害対応ＤＢの蓄積処理を説明するフローチャートである。It is a flowchart explaining the accumulation | storage process of failure response DB in a failure response system. 図１３のステップ１３０１において端末に表示される画面の一例である。It is an example of the screen displayed on a terminal in step 1301 of FIG. 障害対応システムにおける障害対処方法の提示処理を説明するフローチャートである。It is a flowchart explaining the presentation process of the failure handling method in a failure handling system. 障害対応システムにおける障害の事前検知処理を説明するフローチャートである。It is a flowchart explaining the prior detection process of the failure in a failure handling system. 障害事前検知結果を表示し、障害対応テーブルの情報を更新する端末上の画面の一例である。It is an example of the screen on the terminal which displays a failure prior detection result and updates the information of a failure response table.

以下、添付図面を参照しながら、本発明のジョブ管理システムにおける障害対応システム及びそのプログラムを実施するための形態を詳細に説明する。なお、以下の実施例において、「ジョブ」とは、コンピュータ（情報処理装置）が処理する仕事の単位を意味し、各種業務に対応した外部プログラムを定義したものである。また、「ジョブネット」とは、複数のジョブをまとめたものであり、それぞれのジョブの実行順序を定義したものである。 DESCRIPTION OF EMBODIMENTS Hereinafter, a failure handling system in a job management system of the present invention and an embodiment for implementing the program will be described in detail with reference to the accompanying drawings. In the following embodiments, “job” means a unit of work processed by a computer (information processing apparatus), and defines an external program corresponding to various tasks. A “job net” is a collection of a plurality of jobs, and defines the execution order of each job.

＜システムの構成＞
図１は、本発明の一実施例を示すシステム構成図である。図１に示すように、本実施例では、ホストマシン１０１Ａ〜１０１Ｅと、端末１０５とがネットワーク１０６を介して相互に接続されている。ホストマシン１０１Ａ〜１０１Ｅ及び端末１０５は、パーソナルコンピュータやワークステーションなどの情報処理装置によって構成されている。ホストマシン１０１Ａ〜１０１Ｅ及び端末１０５は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの中央演算処理装置と、メモリと、ハードディスク（記憶装置）と、キーボードなどの入力装置と、ディスプレイなどの出力装置とを備えている。 <System configuration>
FIG. 1 is a system configuration diagram showing an embodiment of the present invention. As shown in FIG. 1, in this embodiment, host machines 101 </ b> A to 101 </ b> E and a terminal 105 are connected to each other via a network 106. The host machines 101A to 101E and the terminal 105 are configured by an information processing apparatus such as a personal computer or a workstation. The host machines 101A to 101E and the terminal 105 include a central processing unit such as a CPU (Central Processing Unit), a memory, a hard disk (storage device), an input device such as a keyboard, and an output device such as a display. Yes.

図１に示すように、ホストマシン１０１Ａには、ジョブ管理マネージャー１０２がインストールされている。また、ホストマシン１０１Ｂ〜１０１Ｄには、それぞれ、ジョブ実行エージェント１０３Ｂ〜１０３Ｄがインストールされている。さらに、ホストマシン１０１Ｅには、障害対応システム１０４がインストールされている。 As shown in FIG. 1, a job management manager 102 is installed in the host machine 101A. In addition, job execution agents 103B to 103D are installed in the host machines 101B to 101D, respectively. Further, a failure handling system 104 is installed in the host machine 101E.

ジョブ実行環境は、１台のジョブ管理マネージャー１０２と３台のジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）から構成される。ジョブ管理マネージャー１０２は、管理しているジョブを各ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）に配信し、各ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）が、受け取ったジョブを実行する。 The job execution environment includes one job management manager 102 and three job execution agents (103B to 103D). The job management manager 102 distributes the managed job to each job execution agent (103B to 103D), and each job execution agent (103B to 103D) executes the received job.

ジョブの配信先は、各ジョブの業務内容によって異なり、ジョブ管理マネージャー１０２において任意に指定することができる。ジョブの実行が終了したら、ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）は、ジョブの実行結果をジョブ管理マネージャー１０２へ通知する。ジョブ管理マネージャー１０２は、結果の通知を受け取ると、該当のジョブの実行を終了させる。なお、端末１０５は、端末１０５上において、ジョブ管理マネージャー１０２と障害対応システム１０４に対してデータの入力と結果の出力を行うことができる。 The job distribution destination differs depending on the job contents of each job, and can be arbitrarily designated in the job management manager 102. When the job execution is completed, the job execution agents (103B to 103D) notify the job management manager 102 of the job execution result. Upon receiving the result notification, the job management manager 102 ends the execution of the corresponding job. Note that the terminal 105 can input data and output results to the job management manager 102 and the failure handling system 104 on the terminal 105.

以下において、ジョブ管理マネージャー１０２、ジョブ実行エージェント１０３Ｂ〜１０３Ｄ、及び、障害対応システム１０４の具体的な構成について説明する。なお、以下で説明するジョブ管理マネージャー１０２、ジョブ実行エージェント１０３Ｂ〜１０３Ｄ、及び、障害対応システム１０４の具体的な構成（図２〜図４に示す各処理部）は、実施形態の機能を実現するソフトウェアのプログラムコードで実現してもよい。すなわち、本実施例は、所定のプログラムがプログラムコードとしてメモリに格納され、中央演算処理装置が各プログラムコードを実行することによって実現できる。 Hereinafter, specific configurations of the job management manager 102, the job execution agents 103B to 103D, and the failure handling system 104 will be described. The specific configurations of the job management manager 102, the job execution agents 103B to 103D, and the failure handling system 104 (each processing unit illustrated in FIGS. 2 to 4) described below realize the functions of the embodiments. You may implement | achieve with the program code of software. That is, the present embodiment can be realized by storing a predetermined program as a program code in a memory and executing the program code by the central processing unit.

図２は、ジョブ管理マネージャー１０２の構成図である。ジョブ管理マネージャー１０２は、ジョブスケジュール定義ＤＢ２０１と、ジョブ実行制御部２０２と、ジョブ定義ＤＢ２０６と、実行結果情報ＤＢ２０７と、稼働情報ＤＢ２０８と、ジョブ定義更新部２１１と、稼働状況監視部２１２とを備える。 FIG. 2 is a configuration diagram of the job management manager 102. The job management manager 102 includes a job schedule definition DB 201, a job execution control unit 202, a job definition DB 206, an execution result information DB 207, an operation information DB 208, a job definition update unit 211, and an operation status monitoring unit 212. .

ジョブスケジュール定義ＤＢ２０１は、ジョブネットのスケジュール情報を格納する。また、ジョブ実行制御部２０２は、スケジュール算出部２０３と、ジョブ定義読込部２０４と、ジョブ実行処理部２０５とを備え、これらの制御を行うものである。スケジュール算出部２０３は、ジョブスケジュール定義ＤＢ２０１からジョブネットのスケジュール情報を読み込み、ジョブネットの中の各ジョブのスケジュールを算出する。ジョブ定義読込部２０４は、ジョブネットを開始する時に、ジョブ定義ＤＢ２０６からジョブネットの中の各ジョブのデータをメモリ中に読み込む処理を実行する。 The job schedule definition DB 201 stores job net schedule information. The job execution control unit 202 includes a schedule calculation unit 203, a job definition reading unit 204, and a job execution processing unit 205, and performs these controls. The schedule calculation unit 203 reads the schedule information of the job net from the job schedule definition DB 201 and calculates the schedule of each job in the job net. When the job net is started, the job definition reading unit 204 executes processing for reading the data of each job in the job net from the job definition DB 206 into the memory.

また、ジョブ実行処理部２０５は、ジョブネットの中でまだ実行されていないジョブの実行要求をジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）に配信し、ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）からジョブの実行結果を受け取る。ジョブ実行処理部２０５は、受け取ったジョブの実行結果を実行結果情報ＤＢ２０７に格納する。実行結果情報ＤＢ２０７は、ジョブ実行処理部２０５で受け取ったジョブの実行結果（ジョブの状態や開始時刻、終了時刻、統計情報など）を格納するものである。 In addition, the job execution processing unit 205 distributes an execution request of a job that has not been executed in the job net to the job execution agents (103B to 103D), and receives the job execution result from the job execution agents (103B to 103D). receive. The job execution processing unit 205 stores the execution result of the received job in the execution result information DB 207. The execution result information DB 207 stores job execution results (job status, start time, end time, statistical information, etc.) received by the job execution processing unit 205.

また、稼働情報ＤＢ２０８は、ホストマシン１０１Ａの稼働状況を格納するものである。また、稼働状況監視部２１２は、ホストマシン１０１Ａの稼働状況の監視結果を稼働情報ＤＢ２０８に格納する。 The operation information DB 208 stores the operation status of the host machine 101A. In addition, the operating status monitoring unit 212 stores the monitoring result of the operating status of the host machine 101A in the operating information DB 208.

ジョブ管理マネージャー１０２は、ジョブスケジュール定義ファイル２０９と、ジョブ定義ファイル２１０とを更に備える。ジョブスケジュール定義ファイル２０９には、ジョブスケジュール定義ＤＢ２０１に登録するジョブスケジュール定義がＣＳＶ形式で記載されている。また、ジョブ定義ファイル２１０には、ジョブ定義ＤＢ２０６に登録するジョブ定義がＣＳＶ形式で記載されている。ジョブ定義更新部２１１は、ジョブスケジュール定義ファイル２０９及びジョブ定義ファイル２１０のデータを読み込んで、そのデータをジョブスケジュール定義ＤＢ２０１やジョブ定義ＤＢ２０６に格納する。したがって、ジョブスケジュールやジョブの定義が変更になった場合は、上述のように、ジョブスケジュール定義ファイル２０９とジョブ定義ファイル２１０を用いて更新することが可能である。 The job management manager 102 further includes a job schedule definition file 209 and a job definition file 210. The job schedule definition file 209 describes job schedule definitions to be registered in the job schedule definition DB 201 in CSV format. In the job definition file 210, job definitions to be registered in the job definition DB 206 are described in the CSV format. The job definition update unit 211 reads the data of the job schedule definition file 209 and the job definition file 210 and stores the data in the job schedule definition DB 201 and the job definition DB 206. Therefore, when the job schedule or the job definition is changed, the job schedule definition file 209 and the job definition file 210 can be updated as described above.

図３はジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）の構成図である。ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）は、ジョブ実行部３０１と、稼働状況監視部３０２と、稼働情報ＤＢ３０３とを備える。 FIG. 3 is a configuration diagram of the job execution agents (103B to 103D). The job execution agents (103B to 103D) include a job execution unit 301, an operation status monitoring unit 302, and an operation information DB 303.

ジョブ実行部３０１は、ジョブ管理マネージャー１０２のジョブ実行処理部２０５からジョブの実行要求を受け付け、受け付けた要求に対してジョブプロセスを生成して実行する。また、ジョブ実行部３０１は、生成したプロセスが正常に終了したら「正常終了」を、プロセスを生成する過程で異常終了したら「異常終了」をジョブ実行処理部２０５に通知する。稼働情報ＤＢ３０３は、ホストマシン（１０１Ｂ〜１０１Ｄ）の稼働状況を格納するものである。稼働状況監視部３０２は、ホストマシン（１０１Ｂ〜１０１Ｄ）の稼働状況の監視結果を稼働情報ＤＢ３０３に格納する。 The job execution unit 301 receives a job execution request from the job execution processing unit 205 of the job management manager 102, and generates and executes a job process in response to the received request. In addition, the job execution unit 301 notifies the job execution processing unit 205 of “normal end” when the generated process ends normally, and “abnormal end” when the process is generated abnormally. The operation information DB 303 stores the operation status of the host machines (101B to 101D). The operating status monitoring unit 302 stores the operating status monitoring results of the host machines (101B to 101D) in the operating information DB 303.

図４は、障害対応システム１０４の構成図である。障害対応システム１０４は、障害監視部４０１と、障害対応部４０２と、障害事前検知部４０３と、障害情報入力部４０４と、障害履歴ＤＢ４０５と、障害対応ＤＢ４０６とを備える。 FIG. 4 is a configuration diagram of the failure handling system 104. The failure handling system 104 includes a failure monitoring unit 401, a failure handling unit 402, a failure advance detection unit 403, a failure information input unit 404, a failure history DB 405, and a failure handling DB 406.

障害履歴ＤＢ４０５は、過去に発生した障害の情報を格納するものであり、障害対応ＤＢ４０６は、過去に発生した障害に対して実施した対処方法などの情報を格納するものである。障害監視部４０１は、各ホストマシン１０１Ａ〜１０１Ｄの障害状況を監視し、障害が発生していた場合にその障害の情報を障害履歴ＤＢ４０５に格納する。また、障害対応部４０２は、障害が発生していた場合にその障害の情報と障害対応ＤＢ４０６とを比較して障害への対処方法を検索し、検索した結果を端末１０５に送信する。また、障害情報入力部４０４は、発生した障害及びその障害に対して実施した対処方法の入力を受け付け、入力された情報を障害対応ＤＢ４０６に格納する。また、障害事前検知部４０３は、各ホストマシン１０１Ａ〜１０１Ｄの稼働情報ＤＢ（２０８、３０３）から取得した稼働状況と障害対応ＤＢ４０６の情報とを比較することにより、障害を事前に検知する。 The failure history DB 405 stores information on failures that have occurred in the past, and the failure handling DB 406 stores information such as coping methods implemented for failures that have occurred in the past. The failure monitoring unit 401 monitors the failure status of each of the host machines 101A to 101D, and stores a failure information in the failure history DB 405 when a failure has occurred. Further, when a failure has occurred, the failure handling unit 402 compares the failure information with the failure handling DB 406 to search for a handling method for the failure, and transmits the search result to the terminal 105. Further, the failure information input unit 404 receives input of a failure that has occurred and a coping method that has been implemented for the failure, and stores the input information in the failure handling DB 406. In addition, the failure prior detection unit 403 detects a failure in advance by comparing the operation status acquired from the operation information DB (208, 303) of each of the host machines 101A to 101D with the information in the failure handling DB 406.

＜データベースの構成＞
以下に、本実施例で使用する各ＤＢの構成について説明する。なお、以後の説明では「テーブル」構造を用いて本発明の情報について説明するが、これら情報は必ずしもテーブルによるデータ構造で表現されていなくても良く、他のデータ構造で表現されていても良い。そのため、データ構造に依存しないことを示すために、以下では単に「情報」と呼ぶことがある。また、以下で説明する各ＤＢは、各情報処理装置のハードディスク（記憶装置）に格納されている。 <Database configuration>
The configuration of each DB used in the present embodiment will be described below. In the following description, the information of the present invention will be described using a “table” structure. However, the information does not necessarily have to be expressed by a table data structure, and may be expressed by another data structure. . Therefore, in order to show that it does not depend on the data structure, it may be simply referred to as “information” below. Each DB described below is stored in the hard disk (storage device) of each information processing apparatus.

図５は、ジョブスケジュール定義ＤＢ２０１に格納するスケジュール情報テーブル及びジョブスケジュール定義ファイルの一例を示した図である。スケジュール情報テーブル５０１は、ジョブネット名５０２と、開始時刻５０３と、処理サイクル５０４とを構成項目として含んでいる。ジョブネット名５０２は、ジョブネットを一意に特定するための名称などである。開始時刻５０３は、ジョブネット毎に設定した開始時刻である。処理サイクル５０４は、ジョブネットの処理サイクルである。 FIG. 5 is a diagram showing an example of a schedule information table and a job schedule definition file stored in the job schedule definition DB 201. The schedule information table 501 includes a job net name 502, a start time 503, and a processing cycle 504 as configuration items. The job net name 502 is a name for uniquely identifying the job net. The start time 503 is a start time set for each job net. A processing cycle 504 is a job net processing cycle.

ジョブスケジュール定義ファイル５０５には、スケジュール情報テーブル５０１の各項目に対応する情報が、コンマ区切り（ＣＳＶ形式）で格納されている。ジョブ定義更新部２１１が、ジョブスケジュール定義ファイル５０５を読み込んでインポート処理を行うことにより、ジョブスケジュール定義ファイル５０５の各行がスケジュール情報テーブル５０１のレコードとして格納される。図示した例では、ＪｏｂＮｅｔＡを毎日００：００に開始し、ＪｏｂＮｅｔＢを毎月１５日の１０：１５に開始し、ＪｏｂＮｅｔＣを毎週木曜日の０３：００に開始するスケジュールが定義されている。 In the job schedule definition file 505, information corresponding to each item of the schedule information table 501 is stored in a comma delimited (CSV format). The job definition update unit 211 reads the job schedule definition file 505 and performs import processing, whereby each row of the job schedule definition file 505 is stored as a record in the schedule information table 501. In the illustrated example, a schedule is defined in which JobNetA starts at 00:00 every day, JobNetB starts at 10:15 on the 15th of every month, and JobNetC starts at 03:00 every Thursday.

図６は、ジョブ定義ＤＢ２０６に格納するジョブ定義テーブル及びジョブ定義ファイルの一例を示した図である。ジョブ定義テーブル６０１は、ジョブネット・ジョブ名６０２と、上位ジョブネット名６０３と、先行ジョブ名６０４と、実行先エージェント名６０５と、実行プログラム６０６とを構成項目として含んでいる。 FIG. 6 is a diagram illustrating an example of a job definition table and a job definition file stored in the job definition DB 206. The job definition table 601 includes job net / job name 602, upper job net name 603, preceding job name 604, execution destination agent name 605, and execution program 606 as configuration items.

ジョブネット・ジョブ名６０２は、ジョブネットを一意に特定するための名称、あるいはジョブを一意に特定するための名称などである。上位ジョブネット名６０３は、該当のジョブを含む上位のジョブネット名を示す。また、先行ジョブ名６０４は、該当のジョブの先行にあるジョブを示す。また、実行先エージェント名６０５は、該当のジョブの実行先を示し、本実施例では、ホストマシン（１０１Ｂ〜１０１Ｄ）のいずれかである。さらに、実行プログラム６０６は、該当のジョブを実行するための外部プログラムのパスなどを示す。 The job net / job name 602 is a name for uniquely identifying a job net or a name for uniquely identifying a job. The upper job net name 603 indicates the upper job net name including the corresponding job. A preceding job name 604 indicates a job preceding the corresponding job. The execution destination agent name 605 indicates the execution destination of the corresponding job, and is one of the host machines (101B to 101D) in this embodiment. Furthermore, the execution program 606 indicates a path of an external program for executing the job.

ジョブ定義ファイル６０７には、ジョブ定義テーブル６０１の各項目に対応する情報が、コンマ区切り（ＣＳＶ形式）で格納されている。ジョブ定義更新部２１１が、ジョブ定義ファイル６０７を読み込んでインポート処理を行うことにより、ジョブ定義ファイル６０７の各行がジョブ定義テーブル６０１のレコードとして格納される。図示した例では、ジョブネット「ＪｏｂＮｅｔＡ」の配下にジョブ「ＪｏｂＡ−１」及び「ＪｏｂＡ−２」がある。ジョブ「ＪｏｂＡ−１」は、ジョブネットの中で先頭のジョブであり、実行先エージェント名６０５として「ホストマシン１０１Ｂ」が指定され、実行プログラム６０６として「／ｏｐｔ／ｘｘｘｘ／ｘｘ．ｓｈ」が指定されている。ジョブ「ＪｏｂＡ−２」は、先行ジョブが「ＪｏｂＡ−１」であり、実行先エージェント名６０５として「ホストマシン１０１Ｄ」が指定され、実行プログラム６０６として「／ｏｐｔ／ｙｙｙｙ／ｙｙ．ｓｈ」が指定されている。なお、ジョブ「ＪｏｂＡ−１」の先行ジョブ名６０４のように設定するものがない場合は、値は設定されない。 In the job definition file 607, information corresponding to each item of the job definition table 601 is stored in comma delimited (CSV format). When the job definition update unit 211 reads the job definition file 607 and performs import processing, each row of the job definition file 607 is stored as a record in the job definition table 601. In the illustrated example, there are jobs “JobA-1” and “JobA-2” under the job net “JobNetA”. The job “JobA-1” is the first job in the job net, “host machine 101B” is designated as the execution destination agent name 605, and “/opt/xxxx/xxx.sh” is designated as the execution program 606. Has been. For the job “JobA-2”, the preceding job is “JobA-1”, “host machine 101D” is specified as the execution destination agent name 605, and “/opt/yyyy/yy.sh” is specified as the execution program 606 Has been. If there is no setting such as the preceding job name 604 of the job “JobA-1”, no value is set.

図７は、実行結果情報ＤＢ２０７に格納する実行結果テーブルの一例を示す図である。実行結果テーブル７０１は、ジョブネット・ジョブ名７０２と、状態７０３と、開始時刻７０４と、終了時刻７０５とを構成項目として含んでいる。 FIG. 7 is a diagram illustrating an example of an execution result table stored in the execution result information DB 207. The execution result table 701 includes job net / job name 702, status 703, start time 704, and end time 705 as configuration items.

ジョブネット・ジョブ名７０２は、ジョブネットを一意に特定するための名称、あるいはジョブを一意に特定するための名称などである。状態７０３は、該当のジョブネットやジョブの状態を示し、例えば、「実行中」、「正常終了」、「異常終了」などが設定される。開始時刻７０４は、該当のジョブネットやジョブを開始した時刻を示す。また、終了時刻７０５は、該当のジョブネットやジョブが終了した時刻を示す。ここで、該当のジョブネットやジョブが開始されるときにジョブネット・ジョブ名７０２、状態７０３、及び開始時刻７０４が格納され、終了するときに状態７０３と終了時刻７０５が更新される。 The job net / job name 702 is a name for uniquely identifying a job net or a name for uniquely identifying a job. The status 703 indicates the status of the corresponding job net or job. For example, “in execution”, “normal termination”, “abnormal termination”, or the like is set. The start time 704 indicates the time when the corresponding job net or job is started. An end time 705 indicates the time when the corresponding job net or job ends. Here, the job net / job name 702, the status 703, and the start time 704 are stored when the job net or job starts, and the status 703 and the end time 705 are updated when the job net or job ends.

図８は、ジョブ管理マネージャー１０２用の稼働情報ＤＢ２０８に格納する稼働情報テーブルの一例を示す図である。稼働情報テーブル８０１は、監視時間帯８０２と、平均ＣＰＵ使用率８０３と、平均メモリ使用率８０４と、平均ディスクＩ／Ｏ８０５と、ジョブ配信数８０６と、ユーザ操作量８０７と、ＤＢ使用率８０８と、ＤＢレコード数８０９とを構成項目として含んでいる。監視時間帯８０２は、ジョブ管理マネージャー１０２の稼働情報を監視した時間帯を示す。なお、本実施例では、５分毎に１レコードを格納されているが、他の時間間隔で１レコードを格納してもよい。 FIG. 8 is a diagram illustrating an example of an operation information table stored in the operation information DB 208 for the job management manager 102. The operation information table 801 includes a monitoring time zone 802, an average CPU usage rate 803, an average memory usage rate 804, an average disk I / O 805, a job distribution number 806, a user operation amount 807, and a DB usage rate 808. The number of DB records 809 is included as a configuration item. A monitoring time zone 802 indicates a time zone during which the operation information of the job management manager 102 is monitored. In this embodiment, one record is stored every 5 minutes, but one record may be stored at other time intervals.

平均ＣＰＵ使用率８０３は、該当時間帯におけるホストマシン１０１Ａの平均ＣＰＵ使用率である。また、平均メモリ使用率８０４は、該当時間帯におけるホストマシン１０１Ａの平均メモリ使用率である。平均ディスクＩ／Ｏ８０５は、該当時間帯におけるホストマシン１０１ＡのディスクＩ／Ｏ時間の占める割合の平均を示す。ジョブ配信数８０６は、該当時間帯において、ジョブ管理マネージャー１０２のジョブ実行処理部２０５からジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）に対して配信したジョブの総数を示す。 The average CPU usage rate 803 is an average CPU usage rate of the host machine 101A in the corresponding time zone. The average memory usage rate 804 is the average memory usage rate of the host machine 101A in the corresponding time zone. The average disk I / O 805 indicates the average of the proportion of the disk I / O time of the host machine 101A in the corresponding time zone. The job distribution number 806 indicates the total number of jobs distributed from the job execution processing unit 205 of the job management manager 102 to the job execution agents (103B to 103D) in the corresponding time zone.

また、ユーザ操作量８０７は、該当時間帯においてジョブ管理マネージャー１０２に対してユーザが行った操作の総量を示す。ＤＢ使用率８０８は、該当時間帯におけるジョブスケジュール定義ＤＢ２０１、ジョブ定義ＤＢ２０６、実行結果情報ＤＢ２０７、及び稼働情報ＤＢ２０８の使用済み領域の割合の平均を示す。また、ＤＢレコード数８０９は、該当時間帯におけるジョブスケジュール定義ＤＢ２０１、ジョブ定義ＤＢ２０６、実行結果情報ＤＢ２０７、及び稼働情報ＤＢ２０８に格納されているレコード数を示す。稼働情報テーブル８０１の各項目ついては、稼働状況監視部２１２によって、ホストマシン１０１Ａおよびジョブ管理マネージャー１０２から一定間隔で該当するデータが収集される。稼働状況監視部２１２は、収集したデータを１行ずつレコードとして稼働情報テーブル８０１に格納する。 The user operation amount 807 indicates the total amount of operations performed by the user on the job management manager 102 during the corresponding time period. The DB usage rate 808 indicates the average of the ratios of used areas in the job schedule definition DB 201, job definition DB 206, execution result information DB 207, and operation information DB 208 in the corresponding time zone. The number of DB records 809 indicates the number of records stored in the job schedule definition DB 201, job definition DB 206, execution result information DB 207, and operation information DB 208 in the corresponding time zone. For each item in the operation information table 801, the operation status monitoring unit 212 collects corresponding data from the host machine 101A and the job management manager 102 at regular intervals. The operation status monitoring unit 212 stores the collected data in the operation information table 801 as a record line by line.

図９は、ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）用の稼働情報ＤＢ３０３に格納する稼働情報テーブルの一例を示す図である。稼働情報テーブル９０１は、監視時間帯９０２と、ＣＰＵ使用率９０３と、メモリ使用率９０４と、ディスクＩ／Ｏ９０５と、ジョブ実行数９０６とを構成項目として含んでいる。 FIG. 9 is a diagram illustrating an example of an operation information table stored in the operation information DB 303 for job execution agents (103B to 103D). The operation information table 901 includes a monitoring time zone 902, a CPU usage rate 903, a memory usage rate 904, a disk I / O 905, and a job execution count 906 as configuration items.

監視時間帯９０２は、ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）の稼働情報を監視した時間帯を示す。ＣＰＵ使用率９０３は、該当時間帯におけるホストマシン（１０１Ｂ〜１０１Ｄ）の平均ＣＰＵ使用率である。また、メモリ使用率９０４は、該当時間帯におけるホストマシン（１０１Ｂ〜１０１Ｄ）の平均メモリ使用率である。 A monitoring time zone 902 indicates a time zone during which the operation information of the job execution agents (103B to 103D) is monitored. The CPU usage rate 903 is an average CPU usage rate of the host machine (101B to 101D) in the corresponding time zone. The memory usage rate 904 is an average memory usage rate of the host machines (101B to 101D) in the corresponding time zone.

ディスクＩ／Ｏ９０５は、該当時間帯におけるホストマシン（１０１Ｂ〜１０１Ｄ）のディスクＩ／Ｏ時間の占める割合の平均を示す。また、ジョブ実行数９０６は、該当時間帯においてジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）のジョブ実行部３０１で実行したジョブの総数を示す。稼働情報テーブル９０１の各項目については、稼働状況監視部３０２によって、ホストマシン（１０１Ｂ〜１０１Ｄ）およびジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）から一定間隔で該当するデータが収集される。稼働状況監視部３０２は、収集したデータを１行ずつレコードとして稼働情報テーブル９０１に格納する。 The disk I / O 905 indicates the average of the ratio of the disk I / O time of the host machine (101B to 101D) in the corresponding time zone. The job execution number 906 indicates the total number of jobs executed by the job execution unit 301 of the job execution agents (103B to 103D) in the corresponding time period. For each item in the operation information table 901, the operation status monitoring unit 302 collects corresponding data at regular intervals from the host machines (101B to 101D) and the job execution agents (103B to 103D). The operation status monitoring unit 302 stores the collected data in the operation information table 901 as a record line by line.

図１０は、障害履歴ＤＢ４０５に格納する障害履歴テーブルの一例を示す図である。障害履歴テーブル１００１は、障害発生ホストマシン１００２と、障害発生日時１００３と、現象１００４とを構成項目として含んでいる。障害発生ホストマシン１００２は、障害が発生したホストマシン（１０１Ａ〜１０１Ｄ）を示す。また、障害発生日時１００３は、該当するホストマシン（１０１Ａ〜１０１Ｄ）において障害が発生した時刻を示す。また、現象１００４は、該当するホストマシン（１０１Ａ〜１０１Ｄ）において発生した具体的な現象を示す。障害履歴テーブル１００１の各項目は、障害監視部４０１によって各ホストマシン１０１Ａ〜１０１Ｄの障害状況が監視されて障害の発生と判定された場合に格納される。 FIG. 10 is a diagram illustrating an example of a failure history table stored in the failure history DB 405. The failure history table 1001 includes a failure-occurring host machine 1002, a failure occurrence date and time 1003, and a phenomenon 1004 as configuration items. The failure occurrence host machine 1002 indicates the host machines (101A to 101D) where the failure has occurred. The failure occurrence date and time 1003 indicates the time when the failure occurred in the corresponding host machine (101A to 101D). A phenomenon 1004 indicates a specific phenomenon that has occurred in the corresponding host machine (101A to 101D). Each item of the failure history table 1001 is stored when the failure monitoring unit 401 monitors the failure status of each of the host machines 101A to 101D and determines that a failure has occurred.

図１１は、障害対応ＤＢ４０６に格納する障害対応テーブルの一例を示す図である。障害対応テーブル１１０１は、ホストマシン１１０２と、現象１１０３と、原因１１０４と、障害事前判定条件１１０５と、対処方法１１０６とを構成項目として含んでいる。ホストマシン１１０２は、障害対応の対象となるホストマシンを示す。現象１１０３は、該当するホストマシンにおいて発生した具体的な現象を示す。また、原因１１０４は、該当するホストマシンにおいて発生した現象の原因を示す。 FIG. 11 is a diagram illustrating an example of a failure handling table stored in the failure handling DB 406. The failure handling table 1101 includes a host machine 1102, a phenomenon 1103, a cause 1104, a failure pre-determination condition 1105, and a coping method 1106 as configuration items. A host machine 1102 indicates a host machine that is a target of failure handling. A phenomenon 1103 indicates a specific phenomenon that has occurred in the corresponding host machine. The cause 1104 indicates the cause of the phenomenon that occurred in the corresponding host machine.

障害事前判定条件１１０５には、稼働状況監視の結果で障害が発生していない場合でも、障害が発生する前に対処するための条件が設定される。ここでは、ジョブ配信数、ジョブ実行数、メモリ使用率、ＤＢ使用率などの条件が設定されている。これに限定されず、例えば、稼働情報テーブル８０１、９０１の構成項目に関する条件を、障害事前判定条件１１０５として設定することができる。対処方法１１０６は、障害が発生した場合、または障害事前判定条件１１０５を満たした場合の対処方法を示す。障害対応テーブル１１０１の各項目については、障害情報入力部４０４が、端末１０５からの入力情報を受け付けて、その入力情報を障害対応テーブル１１０１に格納する。 The failure pre-determination condition 1105 is set with a condition for handling before a failure occurs even when no failure has occurred as a result of the operation status monitoring. Here, conditions such as the number of job distributions, the number of job executions, the memory usage rate, and the DB usage rate are set. For example, conditions relating to the configuration items of the operation information tables 801 and 901 can be set as the failure prior determination condition 1105. The coping method 1106 indicates a coping method when a failure occurs or when the failure pre-determination condition 1105 is satisfied. For each item in the failure handling table 1101, the failure information input unit 404 receives input information from the terminal 105 and stores the input information in the failure handling table 1101.

＜障害対応システムにおける処理＞
次に、障害対応システム１０４における処理を説明する。図１２は、障害対応システム１０４における障害履歴ＤＢ４０５の蓄積処理を説明するフローチャートである。ここでは、障害対応システム１０４が、各ホストマシン（１０１Ａ〜１０１Ｄ）の障害を監視して障害履歴ＤＢ４０５へ情報を格納し、端末１０５に障害を通知するまでの処理を説明する。 <Processing in the failure response system>
Next, processing in the failure handling system 104 will be described. FIG. 12 is a flowchart for explaining the accumulation process of the failure history DB 405 in the failure handling system 104. Here, a process from when the failure handling system 104 monitors the failure of each host machine (101A to 101D), stores information in the failure history DB 405, and notifies the terminal 105 of the failure will be described.

ステップ１２０１において、障害監視部４０１が、ホストマシン１０１Ａの実行結果情報ＤＢ２０７を参照し、ジョブの実行結果を取得する。次に、ステップ１２０２において、障害監視部４０１が、実行結果テーブル７０１の情報に基づいて障害が発生しているかを判定する。例えば、実行結果テーブル７０１の状態７０３が「異常終了」となっている場合は、ジョブが異常終了していると判定することができる。また、別の方法として、障害監視部４０１が、スケジュール情報テーブル５０１の開始時刻５０３の情報を取得して、実行結果テーブル７０１の開始時刻７０４と比較して遅延を判定してもよい。また、開始時間になっても実行結果テーブル７０１にレコード自体が作成されていない場合、ジョブネットの開始自体が遅延していると判定できる。また、障害監視部４０１は、実行結果テーブル７０１の過去の処理サイクルで正常終了した時刻と比較して、今回の処理サイクルの各ジョブが遅延しているかを判定してもよい。ここで、障害が発生した場合は、ステップ１２０３に進み、障害が発生していない場合は、処理を終了する。 In step 1201, the failure monitoring unit 401 refers to the execution result information DB 207 of the host machine 101A and acquires the job execution result. Next, in step 1202, the failure monitoring unit 401 determines whether a failure has occurred based on the information in the execution result table 701. For example, if the status 703 of the execution result table 701 is “abnormal end”, it can be determined that the job has ended abnormally. As another method, the failure monitoring unit 401 may acquire information on the start time 503 in the schedule information table 501 and determine the delay by comparing with the start time 704 in the execution result table 701. If the record itself is not created in the execution result table 701 even when the start time comes, it can be determined that the start of the job net itself is delayed. Further, the failure monitoring unit 401 may determine whether each job in the current processing cycle is delayed by comparing with the time when the processing is normally completed in the past processing cycle in the execution result table 701. If a failure has occurred, the process proceeds to step 1203. If a failure has not occurred, the process ends.

ステップ１２０３に進んだ場合、障害監視部４０１が、発生した障害の情報を障害履歴ＤＢ４０５に格納する。なお、障害監視部４０１は、ホストマシン１０１Ａのジョブ定義ＤＢ２０６の情報を取得して、障害履歴テーブル１００１の障害発生ホストマシン１００２の項目を格納することができる。障害履歴テーブル１００１の現象１００４については、ステップ１２０２で判定された現象の情報が格納される。 When the processing proceeds to step 1203, the failure monitoring unit 401 stores information on the failure that has occurred in the failure history DB 405. The fault monitoring unit 401 can acquire information in the job definition DB 206 of the host machine 101A and store the item of the faulty host machine 1002 in the fault history table 1001. For the phenomenon 1004 in the failure history table 1001, information on the phenomenon determined in step 1202 is stored.

次に、ステップ１２０４において、障害監視部４０１が、発生した障害を端末１０５に通知する。この際、障害監視部４０１は、障害履歴ＤＢ４０５に格納した障害情報、及び、その障害に該当する時間帯の稼働情報を、その障害が発生したホストマシンの稼働情報ＤＢ（２０８または３０３）から取得し、端末１０５に送信する。
以上の処理を定期的に実施することで、障害履歴ＤＢ４０５に障害情報を蓄積しながら、端末１０５に障害を通知することができる。 Next, in step 1204, the failure monitoring unit 401 notifies the terminal 105 of the failure that has occurred. At this time, the failure monitoring unit 401 acquires the failure information stored in the failure history DB 405 and the operation information of the time zone corresponding to the failure from the operation information DB (208 or 303) of the host machine where the failure has occurred. To the terminal 105.
By periodically performing the above processing, it is possible to notify the terminal 105 of a failure while accumulating failure information in the failure history DB 405.

次に、障害対応システム１０４における別の処理を説明する。図１３は、障害対応システム１０４における障害対応ＤＢ４０６の蓄積処理を説明するフローチャートである。ここでは、障害対応システム１０４が、端末１０５から入力情報を受信して、障害対応ＤＢ４０６に情報を格納するまでの処理を説明する。 Next, another process in the failure handling system 104 will be described. FIG. 13 is a flowchart for explaining the accumulation processing of the failure handling DB 406 in the failure handling system 104. Here, a process until the failure handling system 104 receives input information from the terminal 105 and stores the information in the failure handling DB 406 will be described.

ステップ１３０１において、障害情報入力部４０４が、発生した障害の障害原因やその対処方法、及び、障害事前判定条件の情報を端末１０５から受信する。なお、ここで、端末１０５で使用される画面については後述する。次に、ステップ１３０２において、障害情報入力部４０４が、受信した情報を障害対応ＤＢ４０６に格納する。以上の処理を実施することで、障害対応ＤＢ４０６に障害対応に関する情報を蓄積する。 In step 1301, the failure information input unit 404 receives from the terminal 105 information on the cause of the failure that has occurred, a method for dealing with the failure, and a failure prior determination condition. Here, screens used in the terminal 105 will be described later. Next, in step 1302, the failure information input unit 404 stores the received information in the failure handling DB 406. By performing the above processing, information related to failure handling is accumulated in the failure handling DB 406.

図１４は、図１３のステップ１３０１において端末１０５に表示される画面の一例である。図１４の画面は、障害履歴表示部１４０１と、障害時稼働状況表示部１４０２と、障害対応情報表示部１４０３とを備える。 FIG. 14 is an example of a screen displayed on the terminal 105 in step 1301 of FIG. The screen of FIG. 14 includes a failure history display unit 1401, a failure operating status display unit 1402, and a failure handling information display unit 1403.

障害履歴表示部１４０１には、障害履歴ＤＢ４０５の障害履歴テーブル１００１のレコードに対応する情報が表示される。すなわち、障害履歴表示部１４０１には、障害発生ホストマシン１４１３と、障害発生日時１４１４と、現象１４１５とが表示される。また、障害時稼働状況表示部１４０２には、その障害が発生したホストマシンの稼働情報テーブル（８０１または９０１）のレコードに対応する情報が表示される。 The failure history display unit 1401 displays information corresponding to the records in the failure history table 1001 of the failure history DB 405. That is, the failure history display unit 1401 displays a failure occurrence host machine 1413, a failure occurrence date and time 1414, and a phenomenon 1415. Also, the operation status display section 1402 at the time of failure displays information corresponding to the record of the operation information table (801 or 901) of the host machine where the failure has occurred.

障害対応情報表示部１４０３には、障害対応テーブル１１０１のレコードに対応するテキストボックスが表示される。障害対応情報表示部１４０３には、ホストマシン１４０４と、現象１４０５と、障害原因１４０７と、障害事前判定条件１４０８と、対処方法１４０９とが表示される。ここで、障害対応者（端末１０５のユーザ）は、障害履歴表示部１４０１と障害時稼働状況表示部１４０２の情報を基に、障害対応情報表示部１４０３の入力を行う。ホストマシン１４０４と現象１４０５については、障害履歴テーブル１００１の障害発生ホストマシン１００２及び現象１００４から明らかであるから、入力済みの状態となっている。 In the failure handling information display unit 1403, a text box corresponding to the record of the failure handling table 1101 is displayed. In the failure handling information display unit 1403, a host machine 1404, a phenomenon 1405, a failure cause 1407, a failure prior determination condition 1408, and a coping method 1409 are displayed. Here, the failure handler (the user of the terminal 105) performs input to the failure handling information display unit 1403 based on the information of the failure history display unit 1401 and the failure-time operation status display unit 1402. Since the host machine 1404 and the phenomenon 1405 are obvious from the failure-occurring host machine 1002 and the phenomenon 1004 in the failure history table 1001, they are already input.

障害対応者は、障害の発生原因である障害原因１４０７と、今後同じ現象が発生する前に事前通知するための条件である障害事前判定条件１４０８と、現象が発生した場合の対処方法（今回の障害に対して実施した対処方法）１４０９を入力する。なお、この例では、障害原因１４０７と、障害事前判定条件１４０８と、対処方法１４０９とがテキストボックスとなっているが、予め用意された複数の選択肢から選択するような形式でもよい。 The failure handler responds to the failure cause 1407 that is the cause of the failure, the failure prior judgment condition 1408 that is a condition for prior notification before the same phenomenon occurs in the future, and the action to be taken when the phenomenon occurs (this time (Coping method implemented for failure) 1409 is input. In this example, the failure cause 1407, the failure pre-determination condition 1408, and the coping method 1409 are text boxes, but may be selected from a plurality of options prepared in advance.

最終的に、障害対応者は、データを入力後、画面下部の［障害対応情報登録］ボタン１４１０を押下する。これにより、入力された情報が、端末１０５から障害対応システム１０４の障害情報入力部４０４に送信され、その後、障害情報入力部４０４が、受信した情報を障害対応ＤＢ４０６に格納する。 Eventually, the failure handler inputs the data, and then presses a [failure handling information registration] button 1410 at the bottom of the screen. Thereby, the input information is transmitted from the terminal 105 to the failure information input unit 404 of the failure handling system 104, and then the failure information input unit 404 stores the received information in the failure handling DB 406.

次に、障害対応システム１０４における別の処理を説明する。図１５は、障害対応システム１０４の障害対応部４０２によって実行される対処方法の提示処理を説明するフローチャートである。 Next, another process in the failure handling system 104 will be described. FIG. 15 is a flowchart for explaining a coping method presentation process executed by the failure handling unit 402 of the failure handling system 104.

まず、ステップ１５０１において、障害対応部４０２が、ホストマシン（１０１Ａ〜１０１Ｄ）の障害情報を取得する。ここで、障害対応部４０２は、障害監視部４０１における障害判定処理（図１２の処理）と連動してホストマシン（１０１Ａ〜１０１Ｄ）の障害情報を取得することができる。例えば、障害対応部４０２は、障害監視部４０１が障害の発生を判定した後の段階（図１２のステップ１２０３など）で、障害情報を障害監視部４０１から取得してもよい。 First, in step 1501, the failure handling unit 402 acquires failure information of the host machines (101A to 101D). Here, the failure handling unit 402 can acquire the failure information of the host machines (101A to 101D) in conjunction with the failure determination processing (the processing of FIG. 12) in the failure monitoring unit 401. For example, the failure handling unit 402 may acquire the failure information from the failure monitoring unit 401 at a stage after the failure monitoring unit 401 determines the occurrence of a failure (step 1203 in FIG. 12 and the like).

次に、ステップ１５０３において、障害対応部４０２が、障害対応ＤＢ４０６の障害対応テーブル１１０１を読み込む。次に、ステップ１５０４において、障害対応部４０２が、ステップ１５０１で取得した障害情報の中のホストマシンと現象の情報と、ステップ１５０３において読み込んだ障害対応テーブル１１０１のレコードの中のホストマシン１１０２と現象１１０３とを比較して、両方が一致しているかを判定する。 Next, in step 1503, the failure handling unit 402 reads the failure handling table 1101 of the failure handling DB 406. Next, in step 1504, the failure handling unit 402 causes the host machine and phenomenon information in the failure information acquired in step 1501, and the host machine 1102 and phenomenon in the record of the failure handling table 1101 read in step 1503. 1103 is compared to determine whether both match.

次に、ステップ１５０５において、ステップ１５０４の判定で一致する障害対応テーブル１１０１のレコードが見つかった場合、そのレコードの原因１１０４と対処方法１１０６を端末１０５に通知する。なお、一致するレコードが複数あった場合には、複数の対処方法を端末１０５に通知するようにしてもよい。端末１０５は、通知された原因１１０４と対処方法１１０６を表示装置に表示する。ここで、ステップ１５０５の端末１０５への通知が完了するか、あるいは、ステップ１５０４の判定で一致するレコードが見つからなかった場合は、ステップ１５０２に戻る。そして、障害対応部４０２が、レコードの終了までステップ１５０３〜１５０５の処理を繰り返し実行する。 Next, in step 1505, when a matching record in the failure handling table 1101 is found in the determination in step 1504, the cause 105 of the record and the coping method 1106 are notified to the terminal 105. When there are a plurality of matching records, a plurality of countermeasures may be notified to the terminal 105. The terminal 105 displays the notified cause 1104 and coping method 1106 on the display device. Here, if the notification to the terminal 105 in step 1505 is completed, or if no matching record is found in the determination in step 1504, the process returns to step 1502. Then, the failure handling unit 402 repeatedly executes the processing of steps 1503 to 1505 until the end of the record.

なお、端末１０５へ通知された原因１１０４と対処方法１１０６は、図１４で示した画面と一緒に表示されてもよい。また、端末１０５が、図１４とは別の画面で原因１１０４と対処方法１１０６を表示してもよい。この場合、端末１０５は、その障害に対応する障害履歴テーブル１００１のレコードと、その障害が発生したホストマシン１０１Ａ〜１０１Ｄの稼働情報テーブル（８０１または９０１）のレコードと、通知された原因１１０４及び対処方法１１０６と、を表示装置に表示するようにしてもよい。 The cause 1104 and the countermeasure 1106 notified to the terminal 105 may be displayed together with the screen shown in FIG. Further, the terminal 105 may display the cause 1104 and the countermeasure 1106 on a screen different from that in FIG. In this case, the terminal 105 records the failure history table 1001 corresponding to the failure, the record of the operation information table (801 or 901) of the host machine 101A to 101D in which the failure occurred, the notified cause 1104, and the countermeasure. Method 1106 may be displayed on a display device.

次に、障害対応システム１０４における別の処理を説明する。図１６は、障害対応システム１０４の障害事前検知部４０３によって実行される障害事前検知処理を説明するフローチャートである。 Next, another process in the failure handling system 104 will be described. FIG. 16 is a flowchart for explaining failure pre-detection processing executed by the failure pre-detection unit 403 of the failure handling system 104.

まず、ステップ１６０１において、障害事前検知部４０３が、各ホストマシン１０１Ａ〜１０１Ｄの稼働情報ＤＢ（２０８、３０３）から稼働情報を取得する。例えば、障害事前検知部４０３は、各ホストマシン１０１Ａ〜１０１Ｄの稼働状況監視部（２１２、３０２）が稼働情報ＤＢ（２０８、３０３）へ情報を格納する時間間隔で、稼働情報を取得する。 First, in step 1601, the failure prior detection unit 403 acquires operation information from the operation information DB (208, 303) of each of the host machines 101A to 101D. For example, the failure pre-detection unit 403 acquires operation information at time intervals at which the operation status monitoring units (212, 302) of the host machines 101A to 101D store information in the operation information DB (208, 303).

次に、ステップ１６０３において、障害事前検知部４０３が、障害対応ＤＢ４０６の障害対応テーブル１１０１を読み込む。次に、ステップ１６０４において、障害事前検知部４０３は、ステップ１６０１で取得した各ホストマシン１０１Ａ〜１０１Ｄの稼働情報テーブル（８０１、９０１）のレコードのホストマシンと障害対応テーブル１１０１のホストマシン１１０２とが一致し、且つ、各ホストマシン１０１Ａ〜１０１Ｄの稼働情報テーブル（８０１、９０１）のレコードの各項目の値が障害対応テーブル１１０１の障害事前判定条件１１０５を満たすかを判定する。 Next, in step 1603, the failure prior detection unit 403 reads the failure handling table 1101 of the failure handling DB 406. Next, in step 1604, the failure pre-detection unit 403 determines whether the host machine of the record of the operation information table (801, 901) of each of the host machines 101A to 101D acquired in step 1601 and the host machine 1102 of the failure handling table 1101. It is determined whether the values match and the values of the items in the records of the operation information tables (801, 901) of the host machines 101A to 101D satisfy the failure pre-determination condition 1105 of the failure handling table 1101.

次に、ステップ１６０５において、ステップ１６０４の判定を満たす障害対応テーブル１１０１のレコードが見つかった場合、障害事前検知部４０３は、そのホストマシン１０１Ａ〜１０１Ｄの稼働状況が危険な状態であると判断し、障害対応テーブル１１０１のレコードの情報（ホストマシン１１０２、現象１１０３、原因１１０４、障害事前判定条件１１０５、及び対処方法１１０６）を端末１０５に通知する。端末１０５は、通知された情報を表示装置に表示する。なお、一致するレコードが複数あった場合には、複数の検知結果を端末１０５に通知するようにしてもよい。ここで、ステップ１６０５の端末１０５への通知が完了するか、あるいは、ステップ１６０４の判定で一致するレコードが見つからなかった場合は、ステップ１６０２に戻る。そして、障害事前検知部４０３が、レコードの終了までステップ１６０３〜１６０５の処理を繰り返し実行する。 Next, in step 1605, when a record of the failure correspondence table 1101 that satisfies the determination in step 1604 is found, the failure pre-detection unit 403 determines that the operating status of the host machines 101A to 101D is in a dangerous state, The terminal 105 is notified of the record information (the host machine 1102, the phenomenon 1103, the cause 1104, the failure prior determination condition 1105, and the coping method 1106) in the failure handling table 1101. The terminal 105 displays the notified information on the display device. In addition, when there are a plurality of matching records, a plurality of detection results may be notified to the terminal 105. Here, when the notification to the terminal 105 in step 1605 is completed, or when no matching record is found in the determination in step 1604, the process returns to step 1602. Then, the failure prior detection unit 403 repeatedly executes the processing of steps 1603 to 1605 until the end of the record.

なお、端末１０５へ通知された情報は、図１４で示した画面とは別の画面で表示する。この場合、端末１０５は、稼働状況が危険な状態であると判定されたホストマシン１０１Ａ〜１０１Ｄの稼働情報テーブル（８０１、９０１）のレコードと、障害対応テーブル１１０１のホストマシン１１０２、原因１１０４、障害事前判定条件１１０５、及び対処方法１１０６を表示する。 The information notified to the terminal 105 is displayed on a screen different from the screen shown in FIG. In this case, the terminal 105 records the operation information tables (801, 901) of the host machines 101A to 101D that are determined to be in a dangerous state, the host machine 1102, the cause 1104, and the failure of the failure correspondence table 1101. A prior determination condition 1105 and a coping method 1106 are displayed.

図１７は、障害事前検知結果を表示し、障害対応テーブル１１０１の情報を更新する端末１０５上の画面の一例である。図１７の画面は、障害事前検知結果表示部１７０１と、事前検知時稼働状況表示部１７０２と、障害対応情報表示部１７０３とを備える。 FIG. 17 is an example of a screen on the terminal 105 that displays the failure pre-detection result and updates the information in the failure handling table 1101. The screen of FIG. 17 includes a failure prior detection result display unit 1701, an operation state display unit 1702 at the time of prior detection, and a failure handling information display unit 1703.

障害事前検知結果表示部１７０１には、障害事前検知部４０３によって事前に検知された障害の情報が出力される。ここでは、稼働状況が危険な状態であると判定されたホストマシン１７１０と、事前検知された時間帯１７１１（稼働情報テーブルの８０１、９０１の監視時間帯８０２、９０２に対応する情報）と、現象１７１２（障害対応テーブル１１０１の現象１１０３に対応する情報）とが障害事前検知結果表示部１７０１に表示される。また、事前検知時稼働状況表示部１７０２には、時間帯１７１１におけるホストマシン１７１０の稼働情報テーブル（８０１または９０１）のレコードが表示される。 Information on a fault detected in advance by the fault pre-detection unit 403 is output to the fault pre-detection result display unit 1701. Here, the host machine 1710 in which the operation status is determined to be a dangerous state, the time zone 1711 detected in advance (information corresponding to the monitoring time zones 802 and 902 of the operation information table 801 and 901), and the phenomenon 1712 (information corresponding to the phenomenon 1103 in the failure correspondence table 1101) is displayed on the failure prior detection result display unit 1701. In addition, the pre-detection operation status display unit 1702 displays a record of the operation information table (801 or 901) of the host machine 1710 in the time zone 1711.

障害対応情報表示部１７０３には、障害対応テーブル１１０１のレコードに対応するテキストボックスが表示される。すなわち、障害対応情報表示部１４０３には、ホストマシン１７０４と、現象１７０５と、障害原因１７０６と、障害事前判定条件１７０７と、対処方法１７０８とが表示される。ここで、障害対応者（端末１０５のユーザ）は、障害事前検知結果表示部１７０１と事前検知時稼働状況表示部１７０２の情報を基に、障害に対する対処の要否の判断を行うことができる。 In the failure handling information display unit 1703, a text box corresponding to the record of the failure handling table 1101 is displayed. That is, the failure handling information display unit 1403 displays the host machine 1704, the phenomenon 1705, the cause of failure 1706, the failure prior determination condition 1707, and the coping method 1708. Here, the person who responds to the failure (the user of the terminal 105) can determine whether it is necessary to deal with the failure based on the information in the failure prior detection result display unit 1701 and the operation state display unit 1702 at the time of prior detection.

また、障害対応者は、障害対応情報表示部１７０３に表示された情報に変更が必要な場合は、障害対応情報表示部１７０３の情報を変更後、画面下部の［障害対応情報更新］ボタン１７０９を押下することで、更新された情報を障害対応システム１０４の障害情報入力部４０４に送信することができる。その後、障害情報入力部４０４が、対応する障害対応テーブル１１０１のレコードの情報を更新する。これにより、障害が事前検知されたホストマシン１０１Ａ〜１０１Ｄの障害事前判定条件をより適切な条件に設定し直すことができ、システムを運用しながら障害事前検知の精度をより高めることができる。また、障害の原因や対処方法などもより適切な内容に更新して蓄積していくことが可能になる。 In addition, if the person corresponding to the failure needs to change the information displayed on the failure correspondence information display unit 1703, the information on the failure correspondence information display unit 1703 is changed, and then a [failure correspondence information update] button 1709 at the bottom of the screen is displayed. By pressing the button, the updated information can be transmitted to the failure information input unit 404 of the failure handling system 104. Thereafter, the failure information input unit 404 updates the information of the record in the corresponding failure handling table 1101. As a result, the failure prior determination conditions of the host machines 101A to 101D in which the failure is detected in advance can be reset to a more appropriate condition, and the accuracy of the failure prior detection can be further improved while operating the system. In addition, it is possible to update and accumulate the cause of the failure and the coping method with more appropriate contents.

本実施例によれば、ジョブ管理システムにおいて、ジョブの実行結果から各ホストマシン１０１Ａ〜１０１Ｄの障害を判定し、その障害の情報を端末１０５に通知することができる。これにより、障害対応者に対して迅速に障害を通知することができる。また、本実施例によれば、ホストマシン１０１Ａ〜１０１Ｄにおいて障害が発生した際に、ホストマシン１０１Ａ〜１０１Ｄの障害に対する対処方法などの情報を障害対応テーブル１１０１に蓄積していくことができる。そして、障害が発生した際に、その障害に対応するレコードを障害対応テーブル１１０１で検索し、一致したレコードを端末１０５に出力させる。これにより、過去に発生した障害に対して実施した対処方法の履歴に基づいて、障害対応者が今回発生した障害に対する対処方法を検討することができる。これにより、障害発生時の対処を迅速に行うことが可能になる。 According to this embodiment, the job management system can determine the failure of each of the host machines 101A to 101D from the job execution result and notify the terminal 105 of the failure information. Thereby, it is possible to promptly notify the failure handler of the failure. Further, according to this embodiment, when a failure occurs in the host machines 101A to 101D, information such as a countermeasure for the failure of the host machines 101A to 101D can be accumulated in the failure correspondence table 1101. When a failure occurs, a record corresponding to the failure is searched in the failure correspondence table 1101 and the matched record is output to the terminal 105. Thereby, based on the history of the coping method implemented with respect to the fault that has occurred in the past, the coping method for the fault that has occurred this time can be examined. As a result, it becomes possible to quickly deal with a failure.

また、本実施例によれば、障害対応テーブル１１０１の対処方法１１０６とともにその障害の要因となる情報（原因１１０４）も端末１０５に表示される。したがって、障害対応者は、障害の要因についての判断も迅速に行うことが可能となる。さらに、本実施例によれば、障害発生時に対応する稼働情報テーブル（８０１、９０１）のレコードも端末１０５に表示される。したがって、障害要因である環境の変化の調査なども容易に行うことができる。 Further, according to the present embodiment, information (cause 1104) that causes the failure is also displayed on the terminal 105 together with the handling method 1106 of the failure handling table 1101. Therefore, the person with a disability can quickly determine the cause of the disability. Furthermore, according to the present embodiment, a record of the operation information table (801, 901) corresponding to the occurrence of a failure is also displayed on the terminal 105. Therefore, it is possible to easily investigate changes in the environment that are the cause of failure.

また、本実施例によれば、障害が発生した際に、障害対応テーブル１１０１に対処方法１１０６とともに障害事前判定条件１１０５を格納する。この構成によれば、ジョブ管理システムを運用中にホストマシン１０１Ａ〜１０１Ｄの稼働情報を自動的に収集し、収集した稼働情報が障害事前判定条件１１０５を満たすかを判定し、ホストマシン１０１Ａ〜１０１Ｄが危険な状態であることを事前に検知することができる。また、端末１０５には、障害の事前検知だけでなく、対処方法１１０６も表示されるので、過去に発生した障害と同様の問題が発生するのを防止したり、発生時に対処を迅速に行うことができる。 Further, according to the present embodiment, when a failure occurs, the failure pre-determination condition 1105 is stored in the failure handling table 1101 together with the handling method 1106. According to this configuration, the operation information of the host machines 101A to 101D is automatically collected during operation of the job management system, it is determined whether the collected operation information satisfies the failure prior determination condition 1105, and the host machines 101A to 101D. Can be detected in advance. Further, since the terminal 105 displays not only the prior detection of the failure but also a countermeasure 1106, it is possible to prevent the occurrence of the same problem as the failure that has occurred in the past, or to quickly deal with the occurrence of the failure. Can do.

また、本実施例によれば、障害が事前検知された際に、障害対応テーブル１１０１の内容を更新することもできる。これにより、障害が事前検知されたホストマシン１０１Ａ〜１０１Ｄの障害事前判定条件をより適切な条件に設定し直すことができ、システムを運用しながら障害事前検知の精度をより高めることができる。また、障害の原因や対処方法などもより適切な内容に更新して蓄積していくことが可能になる。 Further, according to the present embodiment, the contents of the failure handling table 1101 can be updated when a failure is detected in advance. As a result, the failure prior determination conditions of the host machines 101A to 101D in which the failure is detected in advance can be reset to a more appropriate condition, and the accuracy of the failure prior detection can be further improved while operating the system. In addition, it is possible to update and accumulate the cause of the failure and the coping method with more appropriate contents.

以上のように、本発明では、障害発生時に各種環境情報を自動的に収集することを特徴とし、障害発生時の調査を容易にすることができる。更に、本発明は、障害調査の結果、障害要因の判断基準や対処方法を記録できるユーザーインタフェースを持ち、運用中に障害要因の監視をすることができ、同様の問題が発生するのを防止したり、発生時に対処を迅速に行うことができる。 As described above, the present invention is characterized in that various environmental information is automatically collected when a failure occurs, and the investigation when the failure occurs can be facilitated. Furthermore, the present invention has a user interface that can record the judgment criteria and countermeasures of the failure factor as a result of the failure investigation, can monitor the failure factor during operation, and prevent similar problems from occurring. Or can be quickly dealt with when it occurs.

＜変形例＞
本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、障害の通知・入力を行う端末１０５は、複数の端末で構成されてもよい。また、図１の実施例では、ジョブ管理マネージャー１０２、ジョブ実行エージェント（１０３Ｂ〜１０３Ｄ）、及び、障害対応システム１０４などが、別々の情報処理装置に組み込まれているが、これらが１つの情報処理装置に組み込まれてもよい。 <Modification>
The present invention is not limited to the above-described embodiments, and includes various modifications. For example, the terminal 105 that performs failure notification / input may be composed of a plurality of terminals. In the embodiment of FIG. 1, the job management manager 102, job execution agents (103B to 103D), the failure handling system 104, and the like are incorporated in separate information processing apparatuses. It may be incorporated into the device.

また、障害対応テーブル１１０１に、参照された頻度を格納するフィールドを設けて、よく障害が発生するホストマシン１０１Ａ〜１０１Ｄを端末１０５に出力するように構成してもよい。また、頻度を格納するフィールドの値を参照して、より多く参照されたレコードの対処方法を優先的に提示するようにしてもよい。 Further, the failure correspondence table 1101 may be provided with a field for storing the referenced frequency so that the host machines 101A to 101D in which failures frequently occur are output to the terminal 105. In addition, referring to the value of the field that stores the frequency, a method of dealing with a record that has been referred to more frequently may be preferentially presented.

さらに、障害対応テーブル１１０１に、「重要度」を格納するフィールドを設けて、障害対応時に入力するようにしてもよい。障害発生時に端末１０５において「重要度」も表示するようにして、障害対応者に対応の優先度を検討し易くするような構成としてもよい。 Furthermore, a field for storing “importance” may be provided in the failure handling table 1101 and input when dealing with a failure. It may be configured such that when the failure occurs, the “importance” is also displayed on the terminal 105, so that the failure handling person can easily consider the priority of the response.

上記した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。例えば、過去の障害対応の履歴に基づいて対処方法を提示するという点においては、障害対応テーブル１１０１が少なくとも対処方法の情報を格納しており、障害対応部４０２が、少なくとも対処方法を端末１０５に通知すればよい。上述した全ての実施形態を含む構成は、本発明のより好ましい形態であり、当然ながら実施形態として説明した構成の一部を削除して本発明を構成することが可能である。 The above-described embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to the one having all the configurations described. In addition, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment. For example, in the point of presenting a coping method based on a past fault handling history, the fault handling table 1101 stores at least information on the coping method, and the fault handling unit 402 sends at least the coping method to the terminal 105. Just notify. The configuration including all the embodiments described above is a more preferable embodiment of the present invention, and it is possible to delete the part of the configuration described as the embodiment and configure the present invention.

また、本分野にスキルのある者には、本発明を実施するのに相応しいハードウェア、ソフトウェア、およびファームウエアの多数の組み合わせがあることが解るであろう。例えば、本実施形態に記載の機能を実現するプログラムコードは、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｊａｖａ（登録商標）等の広範囲のプログラムまたはスクリプト言語で実装できる。 Also, those skilled in the art will appreciate that there are numerous combinations of hardware, software, and firmware that are suitable for practicing the present invention. For example, the program code for realizing the functions described in the present embodiment can be implemented by a wide range of programs or script languages such as assembler, C / C ++, perl, Shell, PHP, Java (registered trademark).

１０１Ａ：ホストマシン
１０１Ｂ：ホストマシン
１０１Ｃ：ホストマシン
１０１Ｄ：ホストマシン
１０１Ｅ：ホストマシン
１０２：ジョブ管理マネージャー
１０３Ｂ：ジョブ実行エージェント
１０４：障害対応システム
１０５：端末
１０６：ネットワーク
２０１：ジョブスケジュール定義ＤＢ
２０２：ジョブ実行制御部
２０３：スケジュール算出部
２０４：ジョブ定義読込部
２０５：ジョブ実行処理部
２０６：ジョブ定義ＤＢ
２０７：実行結果情報ＤＢ
２０８：稼働情報ＤＢ
２０９：ジョブスケジュール定義ファイル
２１０：ジョブ定義ファイル
２１１：ジョブ定義更新部
２１２：稼働状況監視部
３０１：ジョブ実行部
３０２：稼働状況監視部
３０３：稼働情報ＤＢ
４０１：障害監視部
４０２：障害対応部
４０３：障害事前検知部
４０４：障害情報入力部
４０５：障害履歴ＤＢ
４０６：障害対応ＤＢ 101A: Host machine 101B: Host machine 101C: Host machine 101D: Host machine 101E: Host machine 102: Job management manager 103B: Job execution agent 104: Failure response system 105: Terminal 106: Network 201: Job schedule definition DB
202: Job execution control unit 203: Schedule calculation unit 204: Job definition reading unit 205: Job execution processing unit 206: Job definition DB
207: Execution result information DB
208: Operation information DB
209: Job schedule definition file 210: Job definition file 211: Job definition update unit 212: Operation status monitoring unit 301: Job execution unit 302: Operation status monitoring unit 303: Operation information DB
401: Failure monitoring unit 402: Failure handling unit 403: Failure prior detection unit 404: Failure information input unit 405: Failure history DB
406: Failure handling DB

Claims

A failure management system for a job management system that causes at least one information processing apparatus to execute a job,
Input / output means;
Job execution result information indicating an execution result of the job, operation information indicating an operation status of the at least one information processing apparatus, failure history information indicating information on a failure that has occurred in the at least one information processing apparatus, Storage means for storing failure handling information indicating information relating to a countermeasure taken for a failure occurring in at least one information processing apparatus;
Job execution control means for causing the at least one information processing apparatus to execute the job, and storing the execution result of the job in the storage means as the job execution result information;
An operating status monitoring unit that monitors an operating status of the at least one information processing apparatus, and stores the monitored operating status information in the storage unit as the operating information;
A fault monitoring unit that determines a fault that has occurred in the at least one information processing apparatus based on the job execution result information, and stores the fault information in the storage unit as the fault history information;
Failure handling means for storing in the storage means information relating to the countermeasure taken for the fault as the fault handling information;
With
The failure handling unit obtains information that matches the failure information of the at least one information processing apparatus determined to be a failure by the failure monitoring unit from the failure handling information. A fault response system characterized by being displayed on an entry output means.

The failure handling system according to claim 1,
The storage means stores, as the failure handling information, information on a coping method implemented for the failure and information on the cause of the failure,
The failure response system, wherein the failure response means causes the input / output means to display information on the coping method and information on the cause of the failure.

In the failure handling system according to claim 1 or 2,
A failure prior detection means for detecting in advance a failure that has occurred in the at least one information processing apparatus;
The storage means further stores determination condition information for detecting the failure in advance as the failure handling information,
The failure prior detection means is
Determining whether information on the operating status of the at least one information processing apparatus monitored by the operating status monitoring unit satisfies the determination condition information;
When the determination condition information is satisfied, the failure handling system displays the failure handling information on the input / output unit.

In the failure handling system according to claim 3,
The failure handling means receives updated information of the failure handling information displayed on the input / output means, and stores the updated information as the failure handling information in the storage means system.

In the failure response system according to any one of claims 1 to 4,
The failure handling system displays the operation information corresponding to when the failure occurs together with the acquired information on the input / output unit.

A program for causing a computer including a calculation unit, a storage unit, and an input / output unit to execute failure handling processing of a job management system that causes at least one information processing apparatus to execute a job,
In the calculation means,
A job execution control process for causing the at least one information processing apparatus to execute the job, and storing the execution result of the job in the storage unit as job execution result information;
An operational status monitoring process for monitoring the operational status of the at least one information processing apparatus and storing the monitored operational status information in the storage means as operational information;
A fault monitoring process for determining a fault that has occurred in the at least one information processing apparatus based on the job execution result information, and storing the fault information as fault history information in the storage unit;
A failure handling process for storing information relating to a countermeasure taken for the failure in the storage unit as failure handling information;
The information corresponding to the failure information of the at least one information processing apparatus determined to be a failure by the failure monitoring process is acquired from the failure handling information, and the acquired information is displayed on the input / output unit. Display processing,
A program for running