JP3964629B2

JP3964629B2 - Data path abnormality detection method by patrol of disk array device and computer system provided with disk array device

Info

Publication number: JP3964629B2
Application number: JP2001132947A
Authority: JP
Inventors: 進廣藤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-04-27
Filing date: 2001-04-27
Publication date: 2007-08-22
Anticipated expiration: 2021-04-27
Also published as: JP2002328850A

Description

【０００１】
【発明の属する技術分野】
本発明は、ホストコンピュータとディスクアレイ装置との間のデータパスなどの異常を検出するのに好適なディスクアレイ装置のパトロールによるデータパス異常検出方法及びディスクアレイ装置を備えたコンピュータシステムに関する。
【０００２】
【従来の技術】
複数のディスク記憶装置を有し、ホストコンピュータとの間でデータを授受するディスクアレイ装置では、ホストコンピュータとのインターフェース（ホストインターフェース）に、ＳＣＳＩ（Small Computer System Interface）等の標準化されたインターフェースが使用されるのが一般的である。このようなホストコンピュータとディスクアレイ装置とが接続されたコンピュータシステムでは、データ保護の機構として、ホストインターフェースがＳＣＳＩインターフェースの例では、パリティチェック回路を使用している。
【０００３】
【発明が解決しようとする課題】
上記したように従来は、ホストコンピュータとの間でデータを授受するディスクアレイ装置を備えたコンピュータシステムでは、データ保護の機構として、ホストインターフェースがＳＣＳＩインターフェースの例では、パリティチェック回路を使用するのが一般的であった。
【０００４】
しかしながら、パリティによる保護は、１データ単位（例えば１バイト単位）の確認であることから、データブロック（例えばセクタサイズのデータブロック）として考えた場合のデータずれ、データ抜けなどに対する検出には効果がないという問題があった。
【０００５】
そこで、データブロック単位で、ＣＲＣ（Cyclic Redundancy Check）等の冗長コード（データチェックコード）を付加する方法も考えられる。しかし、この方式では、標準化されたインターフェースの中で、特別な動作を要求することになり、動作に制限が加えられることになる。
【０００６】
本発明は上記事情を考慮してなされたものでその目的は、ホストコンピュータから転送されるライトデータ自身に、データ保護のための冗長コードがついていない場合でも、ホストコンピュータとディスクアレイ装置内のディスク記憶装置との間におけるターゲットとなるデータパスの異常によりデータ誤りを起こした場合の検出率を向上させることができるディスクアレイ装置のパトロールによるデータパス異常検出方法及びディスクアレイ装置を備えたコンピュータシステムを提供することにある。
【０００７】
【課題を解決するための手段】
本発明は、監視プログラムが動作するホストコンピュータと、複数のディスク記憶装置を含むディスクアレイ、及びキャッシュメモリを内蔵するディスクアレイコントローラから構成されたディスクアレイ装置とを備えたコンピュータシステムにおいて、上記ホストコンピュータからディスクアレイ装置に発行されたライト要求が予め設定されたタイミングに合致する場合に、当該ライト要求に応じてホストコンピュータからディスクアレイ装置に転送されてキャッシュメモリに書き込まれたライトデータの第１の冗長コードを予め設定されたサイズのデータブロックを単位に生成しておき、次にホストコンピュータからディスクアレイ装置に対し、ホストコンピュータとディスクアレイ装置との間のデータパス監視のための問い合わせが上記監視プログラムに従って上記タイミングに対応してなされた場合に、その際に生成されている第１の冗長コード及び対応するデータを指定するためのデータ指定情報をディスクアレイ装置からホストコンピュータに通知し、この通知を受けたホストコンピュータでは、通知されたデータ指定情報の指定するデータを上記キャッシュメモリからホストコンピュータに読み出して、そのデータの第２の冗長コードを上記監視プログラムに従って上記データブロックを単位に生成し、生成した第２の冗長コードを上記通知された第１の冗長コードと比較することで、少なくとも当該比較結果に基づいてホストコンピュータとディスクアレイ装置との間のデータパスの異常の有無を判定することを特徴とする。
【０００８】
このような構成において、ホストコンピュータから例えばアプリケーションプログラムに従って発行されるライト要求が予め設定されたタイミングに合致する場合にだけ、そのライト要求に応じてホストコンピュータからディスクアレイ装置に転送されてディスクアレイコントローラ内のキャッシュメモリに一時記憶されたデータの第１の冗長コードが生成される。また、ホストコンピュータからディスクアレイ装置には上記のタイミングに対応してデータパス監視のための問い合わせが送られ、この問い合わせに対して上記第１の冗長コードとデータ指定情報とがホストコンピュータに通知される。これによりホストコンピュータは、通知されたデータ指定情報の指定するデータ、即ち上記第１の冗長コードの生成に用いられた上記キャッシュメモリ上のデータを読み出して、そのデータの第２の冗長コードを生成し、上記第１の冗長コードと比較することで、ホストコンピュータとディスクアレイ装置との間のデータパス（更に詳細に述べるならば、ホストコンピュータとディスクアレイコントローラ内のキャッシュメモリとの間のデータパスの異常、つまり当該パス上のいずれかの回路の異常）を判定する。
【０００９】
このように本発明においては、ディスクアレイ装置側でのライトデータに対する第１の冗長コード生成は、監視プログラムからの設定に従ってアプリケーションプログラムから独立に行われ、またホストコンピュータ側でのディスクアレイ装置に対する問い合わせと当該問い合わせに対する応答から取得したデータ指定情報の指定するデータの読み出しと第２の冗長コードの生成、及び第１及び第２の冗長コードの比較は、監視プログラムに従って、アプリケーションプログラムとの間で通信を行うことなく当該アプリケーションプログラムから独立に行われる。
【００１０】
なお、一般には、ホストコンピュータ上で動作する監視プログラムにより当該ホストコンピュータとディスクアレイ装置との間のデータパスをチェックするには、ホストコンピュータからディスクアレイ装置に対してライト・リードコンペアテストを行うことが最も確実で簡便である。しかし、データを確実に保護するためには、アプリケーションプログラムとデータの情報を授受する処理が必要となる。これに対して本発明では、アプリケーションプログラムの動作と関係せずに、データパスの確認を行うことができる。
【００１１】
ここで、アプリケーションプログラムの動作と関係せずに、データパスの確認を行うには、上記タイミングとして、監視プログラムに従って予め設定される時間間隔またはライト要求回数の間隔を適用するとよい。
【００１２】
また、アプリケーションプログラムの動作と関係せずにデータパスが確認できるようにしていることから、第１の冗長コードの生成に用いられたデータが第２の冗長コードの生成に用いられる前に更新される可能性がある。この場合、第２の冗長コードは更新後のデータをもとに生成されることになるから、上記データパスの異常の有無に無関係に、第１及び第２の冗長コードは一致しなくなる。そこで、第１及び第２の冗長コードが不一致の場合、ホストコンピュータからディスクアレイ装置に対してキャッシュメモリ上の元のデータが更新されているか否かを問い合わせ、更新されていない旨の応答が返された場合に限り、上記データパスの異常を判定するとよい。
【００１３】
また本発明は、第１の冗長コードの生成に用いられたキャッシュメモリ上のデータの写しを、ホストコンピュータからの監視プログラムに従う専用のリード要求によってのみ読み出しが可能な当該キャッシュメモリ上の別の特定領域に保持しておき、第１の冗長コードの生成に用いられたキャッシュメモリ上の元のデータではなくて、そのデータの写し、即ち上記特定領域のデータを上記専用のリード要求を用いて読み出して、その読み出したデータから第２の冗長コードを生成するようにしたことをも特徴とする。
【００１４】
このような構成においては、第２の冗長コードの生成に用いられるデータ、即ち上記特定領域のデータがアプリケーションプログラム等に従うライト要求により更新されるのを防止できる。
【００１５】
また本発明は、ホストコンピュータから転送されてキャッシュメモリに書き込まれたライトデータをディスク記憶装置に書き込む際に、当該データの写しを当該ディスク記憶装置の別の特定領域に書き込むと共にキャッシュメモリ上の上記データをアプリケーションプログラムから見えない無効データで且つチェック用データとし、上記ディスク記憶装置の特定領域に書き込まれたデータをディスクアレイコントローラに読み出して、キャッシュメモリ上の元のデータと比較することで、ディスクアレイコントローラと前記ディスク記憶装置との間のデータパスの異常の有無を判定することをも特徴とする。ここでも、全ライトデータでなく、キャッシュメモリからディスク記憶装置へのライトデータの書き込みが予め設定された間隔（時間間隔または書き込み回数の間隔）に合致した際のライトデータについて、処理するとよい。
【００１６】
このような構成においては、ディスク記憶装置に書き込まれたキャッシュメモリ上のデータをアプリケーションプログラムから見えない無効データで且つチェック用データとして、更新から保護すると共に、ディスク記憶装置に書き込まれたデータの写しを当該ディスク記憶装置の別の特定領域に保存して、当該データの写しを更新から保護するようにしたので、つまりデータを破壊する要因をなくすようにしたので、この特定領域のデータを読み出してキャッシュメモリ上の元のデータ（チェック用データ）と比較することで、ディスクアレイコントローラとディスク記憶装置との間のデータパスの異常の有無を正しく判定することが可能となる。
【００１７】
この他に、ホストコンピュータから転送されてキャッシュメモリに書き込まれたライトデータをディスク記憶装置に書き込む際に、当該データの第３の冗長コードを生成してディスクアレイコントローラ内に保持しておく一方、この第３の冗長コードの生成に用いられたディスク記憶装置内のデータをディスクアレイコントローラに読み出して当該データの第４の冗長コードを生成し、この第４の冗長コードを先に生成されている第３の冗長コードと比較することでも、ディスクアレイコントローラとディスク記憶装置との間のデータパスの異常の有無を判定することが可能である。
【００１８】
以上のディスクアレイ装置のパトロールによるデータパス異常検出方法に係る本発明は、ホストコンピュータとディスクアレイ装置とから構成されるコンピュータシステムに係る発明、或いはディスクアレイコントローラに係る発明としても成立する。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態につき図面を参照して説明する。
【００２０】
図１は本発明の一実施形態に係るコンピュータシステムの構成を示すブロック図である。
図１のコンピュータシステムは、ディスクアレイ装置１０と当該ディスクアレイ装置１０を利用するホストコンピュータ２０とから構成される。
【００２１】
ディスクアレイ装置１０は、複数のディスク記憶装置、例えばハードディスク装置（以下、ＨＤＤと称する）１１０を備えたディスクアレイ１１と当該ディスクアレイ１１を制御するコントローラ（ディスクアレイコントローラ）１２とから構成される。
【００２２】
ディスクアレイコントローラ１２は、主制御部１２０と、ホストインターフェース１２１と、各ＨＤＤ１１０毎に設けられたディスクインターフェース１２２と、キャッシュメモリ装置１２３とを備えている。この主制御部１２０、ホストインターフェース１２１、ディスクインターフェース１２２及びキャッシュメモリ装置１２３は、データ転送用バス１２４により相互接続されている。データ転送用バス１２４は、例えば標準バスとして知られているＰＣＩバス（Peripheral
Component Interconnect Bus）である。
【００２３】
主制御部１２０は、マイクロプロセッサ１２５と、当該マイクロプロセッサ１２５が実行する制御プログラム（ファームウェア）を記憶したＲＯＭ（Read Only Memory）１２６と、当該プログラムの実行時に使用されるＲＡＭ（Random Access Memory）１２７とを有するマイクロプロセッサ部である。マイクロプロセッサ１２５は、各インターフェース１２１，１２２とキャッシュメモリ装置１２３との間のデータ転送の制御を行う。
【００２４】
ホストインターフェース１２１は、ホストコンピュータ２０と接続されており、当該ホストコンピュータ２０とのインターフェースをなす。ホストインターフェース１２１は、ホストコンピュータ２０との間のデータ転送の制御と、ホストコンピュータ２０からのデータをデータ転送用バス１２４を介してキャッシュメモリ装置１２３または（マイクロプロセッサ１２５が使用する）ＲＡＭ１２７に転送する制御と、キャッシュメモリ装置１２３またはＲＡＭ１２７のデータをホストコンピュータ２０に転送する制御とを行う。
【００２５】
ディスクインターフェース１２２は、ＨＤＤ１１０と接続されており、データ転送用バス１２４とＨＤＤ１１０との間のデータ転送を行うための手順等の制御を司る。具体的には、ディスクインターフェース１２２は、ＨＤＤ１１０からのデータをキャッシュメモリ装置１２３またはＲＡＭ１２７に転送する制御と、キャッシュメモリ装置１２３またはＲＡＭ１２７のデータをＨＤＤ１１０に転送する制御とを行う。
【００２６】
キャッシュメモリ装置１２３は、ＨＤＤ１１０とホストコンピュータ２０との間の転送データを一時記憶するためのキャッシュメモリ１２８と、当該キャッシュメモリ１２８を制御するキャッシュ制御ＬＳＩ１２９とを有する。キャッシュメモリ１２８には特定領域１２８ａが確保されている。この領域１２８ａは、ＨＤＤ１１０とホストコンピュータ２０との間の転送データを記憶するためのキャッシュ領域とは別に管理される。領域１２８ａは、後述するデータアドレス、サイズ及びＣＲＣコード列を一時保持するのに用いられる。
【００２７】
ホストコンピュータ２０上では、ユーザが目的とする処理を行うために使用される各種アプリケーションプログラム２１と、サポートプログラム２２とが並行して動作する。このサポートプログラム２２は、ホストコンピュータ２０をディスクアレイ装置１０の監視及び当該ディスクアレイ装置１０に対する各種の設定を行う監視手段として機能させるための監視ソフトウェアである。
【００２８】
次に、図１の構成の動作について、ホストコンピュータ２０からのディスクアレイ装置１０を対象とするパトロールにより、当該ホストコンピュータ２０とディスクアレイ装置１０との間のデータパスの異常を検出する処理（第１のデータパス異常検出処理）と、ディスクアレイ装置１０内のディスクアレイコントローラ１２とＨＤＤ１１０との間のデータパスの異常を検出する処理（第２のデータパス異常検出処理）を例に説明する。
【００２９】
まず、データパス異常検出処理の理解を容易にするために、ホストコンピュータ２０からディスクアレイ装置１０にデータのライト／リードを要求した際の従来から知られているデータの流れについて、図２を参照して説明する。
【００３０】
＜ライトデータの流れ＞
ホストコンピュータ２０からのライト要求（ライトコマンド）、例えばアプリケーションプログラム２１に従うライト要求は、図２中のパスＡ１を介して、即ちディスクアレイ装置１０内のディスクアレイコントローラ１２のホストインターフェース１２１、及びデータ転送用バス１２４を介して主制御部１２０内のＲＡＭ１２７に転送される。
【００３１】
また、ホストコンピュータ２０からのデータ（ライトデータ）は、図２中のパスＡ２を介して、即ちディスクアレイコントローラ１２のホストインターフェース１２１、及びデータ転送用バス１２４を介してキャッシュメモリ装置１２３に転送され、当該キャッシュメモリ装置１２３内のキャッシュ制御ＬＳＩ１２９の制御によりキャッシュメモリ１２８に一時記憶される。
【００３２】
さて、図１中のディスクアレイ装置１０では、ホストコンピュータ２０からのライト要求に対し、ライデータをキャッシュメモリ１２８に書き込んだ段階で、ライト要求の実行完了をホストコンピュータ２０に返す、いわゆるライトバックキャッシュ方式を適用している。
【００３３】
そこでディスクアレイコントローラ１２内のマイクロプロセッサ１２５は、ライト要求の実行完了がホストコンピュータ２０に返された後、当該ライト要求を解釈して、ライトデータの分割先ＨＤＤ１１０を決定する。そしてマイクロプロセッサ１２５は、ディスクアレイコントローラ１２内の対応するディスクインターフェース１２２にライト指示を送ることで、キャッシュメモリ１２８上の該当するライトデータを、図２のパス中のパスＡ３を介して、即ちデータ転送用バス１２４、ディスクインターフェース１２２を介してＨＤＤ１１０に転送させる。
【００３４】
＜リードデータの流れ＞
ホストコンピュータ２０からのリード要求（リードコマンド）は、図２中のパスＡ１を介してディスクアレイコントローラ１２内のＲＡＭ１２７に転送される。
【００３５】
ディスクアレイコントローラ１２内のマイクロプロセッサ１２５は、ホストコンピュータ２０からのリード要求を解釈し、当該要求で指定されたデータがキャッシュメモリ１２８上に存在するか否かをキャッシュ制御ＬＳＩ１２９により判定させる。
【００３６】
もし、リード要求で指定されたデータがキャッシュメモリ１２８上に存在しないならば、例えば複数のＨＤＤ１１０に分割して格納されているデータを当該ＨＤＤ１１０から読み出すように対応するディスクインターフェース１２２に指示する。
【００３７】
これによりディスクインターフェース１２２は、マイクロプロセッサ１２５から指示されたデータ、即ちホストコンピュータ２０から要求されたデータを、図２中のパスＡ３を介してＨＤＤ１１０からキャッシュメモリ１２８に転送する。このとき、後続する連続データの読み出しが続けて要求されると予測される場合には、当該後続する連続データもキャッシュメモリ１２８に転送されるのが一般的である。但し、この予測の方法は本発明に直接関係しないため説明を省略する。ＨＤＤ１１０からキャッシュメモリ１２８に転送されたデータは、図２中のパスＡ２を介してホストコンピュータ２０に転送される。
【００３８】
＜第１のデータパス異常検出処理＞
次に、本発明に関係する第１のデータパス異常検出処理について、図３を参照して説明する。
【００３９】
まずディスクアレイ装置１０では、ホストコンピュータ２０からアプリケーションプログラム２１に従うライト要求が与えられると、要求されたライトデータをキャッシュメモリ１２８に書き込むデータライトが行われる（ステップＢ１）。
【００４０】
マイクロプロセッサ１２５は、キャッシュメモリ１２８に書き込まれたライトデータを読み込んで、ホストコンピュータ２０上で動作しているサポートプログラム２２からの指定により予め定められたデータブロック（例えばＨＤＤ１１０の記録単位であるセクタのサイズに一致するデータブロック）を単位に冗長コード、例えばＣＲＣ（Cyclic Redundancy Check）コードを生成し、その冗長コードの列を対応するデータのアドレス（ここでは先頭のディスクアドレス）及びサイズと共にキャッシュメモリ１２８上の領域１２８ａに保持する（ステップＢ２）。
【００４１】
本実施形態において、上記のＣＲＣコード生成は、ホストコンピュータ２０からライト要求が送られる都度行われる訳ではなく、例えば図４（ａ）に示すように、一定時間間隔Ｔで、その際のライト要求に応じてライトされたデータに対してのみ行われる。但し、一定時間間隔Ｔのタイミングでライト要求が実行されなかった場合には、その後の最初のライト要求に応じてライトされたデータに対してＣＲＣコードが生成される。時間間隔Ｔは、サポートプログラム２２からの指定により予め設定される。なお、図４（ｂ）に示すように、一定回数（個数）のライト要求毎に（ここでは、２個のライト要求毎に）、つまり一定回数（個数）のライト要求の間隔で、その際のライト要求に応じてライトされたデータに対してＣＲＣコード生成を行うようにしても構わない。
【００４２】
ホストコンピュータ２０は、サポートプログラム２２に従い、上記一定時間間隔（または、一定回数のライト要求の間隔）で、ディスクアレイ装置１０に対してデータパス監視のための問い合わせを行う（ステップＢ３）
これを受けてディスクアレイ装置１０内のマイクロプロセッサ１２５は、その際にキャッシュメモリ１２８上の領域１２８ａに保持されているデータアドレス、サイズ及び冗長コード列とをホストコンピュータ２０に通知する（ステップＢ４）。
【００４３】
ホストコンピュータ２０はディスクアレイ装置１０からデータアドレス、サイズ及び冗長コード列がサポートプログラム２２に通知されると、当該サポートプログラム２２に従い、通知されたデータアドレス及びサイズで指定されるデータを読み出すためのリード要求をディスクアレイ装置１０に送出する（ステップＢ５）。
【００４４】
これを受けてディスクアレイ装置１０内のマイクロプロセッサ１２５は、キャッシュ制御ＬＳＩ１２９により該当するデータをキャッシュメモリ１２８から読み出させて、データ転送用バス１２４及びホストインターフェース１２１を介してホストコンピュータ２０に転送させる（ステップＢ６）。
【００４５】
ホストコンピュータ２０は、サポートプログラム２２に従うリード要求に対してディスクアレイ装置１０から該当するデータが返された場合、サポートプログラム２２に従って、当該データを対象にデータブロック単位でＣＲＣコードを生成し、先にディスクアレイ装置１０から通知されたＣＲＣコードと比較する（ステップＢ７，Ｂ８）。
【００４６】
ホストコンピュータ２０は、上記ＣＲＣコード比較の結果、上記両コードが一致しているならば（ステップＢ９）、ホストコンピュータ２０とディスクアレイ装置１０との間のデータパス（図２中のパスＡ１に相当）上の回路は正常であると判定し、そのまま次の異常検出のタイミングを待つ。
【００４７】
これに対し、上記ＣＲＣコード比較の結果、上記両コードが一致していないならば（ステップＢ９）、その不一致が、該当するデータが上書き（更新）されたことに起因するのか、或いはディスクアレイ装置１０との間のデータパス上の回路が異常であることに起因するのかを判別するために、ホストコンピュータ２０はサポートプログラム２２に従って、ディスクアレイ装置１０に対して当該データが更新されたか否かを問い合わせる（ステップＢ１０）。
【００４８】
この問い合わせに対し、ディスクアレイ装置１０内のマイクロプロセッサ１２５は、キャッシュメモリ１２８上の該当するデータが、アプリケーションプログラム２１に従うホストコンピュータ２０からの新たなライト要求により更新（上書き）されているか否かを通知する応答をサポートプログラム２２に返す（ステップＢ１１）。
【００４９】
ホストコンピュータ２０は、この応答により、サポートプログラム２２に従うリード要求に対し、該当するデータが更新（上書き）されていないことが通知された場合（ステップＢ１２）、上記ＣＲＣコードの不一致の要因が、ホストコンピュータ２０とディスクアレイ装置１０との間のデータパスの異常、更に詳細に述べるならば当該データパス上のいずれかの回路の異常にあると判別する。この場合、ホストコンピュータ２０はディスクアレイ装置１０との間のデータパスの異常を、サポートプログラム２２に従ってディスクアレイ装置１０及びアプリケーションプログラム２１に通知し、しかる後にディスクアレイ装置１０を切り離すよう動作する（ステップＢ１３，Ｂ１４）。
【００５０】
一方、ディスクアレイ装置１０から該当するデータが更新（上書き）されていることが通知された場合には（ステップＢ１２）、ホストコンピュータ２０は上記ＣＲＣコードの不一致の要因がディスクアレイ装置１０との間のデータパスの異常にあるか否か判別不可能であるとして、そのまま、次の異常検出（パトロール）のタイミングを待つ。
【００５１】
なお、以上の説明では、ホストコンピュータ２０からのデータ更新の有無（データ更新確認）の問い合わせに対して、ディスクアレイ装置１０から応答を返すものとしたが、これに限るものではない。例えば、ステップＢ２でのＣＲＣコードの生成に用いられたデータが、ホストコンピュータ２０からサポートプログラム２２に従う当該データを対象とするリード要求を受け取るまでの間に更新された場合に、その旨を、その更新された時点で返すか、或いは当該リード要求が与えられた時点で、要求されたデータに代えて返すようにしてもよい。
【００５２】
また、ＣＲＣコードの生成に用いられたデータがアプリケーションプログラム２１により上書きされるのを防ぐために、当該データの写しをチェック（監視）用データとしてキャッシュメモリ１２８上の例えば上記領域１２８ａに、データアドレス、サイズ及びＣＲＣコード列と共に保持し、元のデータとは別管理とするようにしてもよい。但し、このようにすると、ホストコンピュータ２０からは、通常のリード要求では該当するデータ（チェック用データ）を読み出すことができない。そこで上記ステップＢ５では、専用のリード要求（リードコマンド）によって通常のリード要求と区別することにより、チェック用データを読み出すことを可能とするとよい。
【００５３】
＜第２のデータパス異常検出処理＞
次に、本発明に関係する第２のデータパス異常検出処理について、図５を参照して説明する。
【００５４】
本実施形態における第２のデータパス異常検出処理の特徴は、キャッシュメモリ１２８からＨＤＤ１１０へのデータ書き込みのうち、予め設定された間隔（ここでは時間間隔）に合致した書き込みの際に、ディスクアレイ装置１０内のマイクロプロセッサ１２５からＨＤＤ１１０に確保されたマイクロプロセッサ１２５の管理領域を対象として該当するデータの写しのライト・リードを実施することで、ディスクアレイ１１とＨＤＤ１１０との間のデータパスにおいて、データ化けのエラーが発生していないことを確認する点にある。この第２のデータパス異常検出処理の詳細は次の通りである。
【００５５】
まず、マイクロプロセッサ１２５は、キャッシュメモリ１２８上のデータをＨＤＤ１１０に書き込む（ライトバックする）（ステップＣ１）。この書き込みが上記間隔に合致している場合、マイクロプロセッサ１２５は、ＨＤＤ１１０に書き込んだデータの写しを当該ＨＤＤ１１０の別の特定領域に書き込む（ステップＣ２）。この特定領域は、ユーザに開放されない（使用させない）マイクロプロセッサ１２５の管理領域である。したがって、この管理領域上のデータが、ユーザのデータで上書きされることはない。また、マイクロプロセッサ１２５は、キャッシュメモリ１２８上の元のデータをアプリケーションプログラム２１（ユーザ）からは見えない無効データとすると共に、当該データをＨＤＤチェック用データとして保持する。
【００５６】
次にマイクロプロセッサ１２５は、ＨＤＤ１１０の特定領域から、当該特定領域に格納されているデータをＲＡＭ１２７上に確保されている作業領域（図示せず）に読み込み、キャッシュメモリ１２８上の該当する元のデータであるＨＤＤチェック用データと比較することで、その一致の有無により、キャッシュメモリ１２８とＨＤＤ１１０との間のデータパス（図２中のパスＡ３に相当）、つまりディスクアレイコントローラ１２とＨＤＤ１１０との間のデータパスが正常であるか異常であるかを判定する（ステップＣ３）。
【００５７】
ステップＣ３で異常が判定された場合、マイクロプロセッサ１２５はその旨を通知するためのステータスを、ホストコンピュータ２０からサポートプログラム２２に従って読み込み可能なように、ディスクインターフェース１２２に設定する。また、ホストコンピュータ２０からの要求に対してディスクアレイコントローラ１２が応答しない構成とすることも可能である。
【００５８】
なお、ＨＤＤ１１０の特定領域から読み出したデータをＲＡＭ１２７上の作業領域ではなくて、キャッシュメモリ１２８上の元のデータとは別領域に読み込み、この別領域上のデータとキャッシュメモリ１２８上の元のデータ（ＨＤＤチェック用データ）とをマイクロプロセッサ１２５が比較するようにしてもよい。
【００５９】
また、上記別領域上のデータとキャッシュメモリ１２８上のＨＤＤチェック用データとを、キャッシュ制御ＬＳＩ１２９により比較するようにしてもよい。このためには、マイクロプロセッサ１２５からキャッシュ制御ＬＳＩ１２９に対して、上記別領域上のデータが記憶されているキャッシュメモリ１２８上の先頭アドレスとサイズを指定すると共に、上記ＨＤＤチェック用データが記憶されているキャッシュメモリ１２８上の先頭アドレスとサイズを指定して、当該キャッシュ制御ＬＳＩ１２９を起動する構成とすればよい。このようにすると、キャッシュ制御ＬＳＩ１２９自身が、先頭アドレスとサイズで指定されたキャッシュメモリ１２８上の領域のデータを比較することが可能となる。このように、マイクロプロセッサ１２５（のプログラム処理）による比較処理に代えて、キャッシュ制御ＬＳＩ１２９（ハードウェア）による比較処理を行うことにより、処理の高速化を図ることが可能となる。
【００６０】
また、キャッシュメモリ１２８上のデータをＨＤＤ１１０に書き込んだ際に当該データのＣＲＣコードを生成すると共に、当該データをＨＤＤ１１０から読み込んだ際にもＣＲＣコードを生成して、両ＣＲＣコードを比較するようにしても、キャッシュメモリ１２８とＨＤＤ１１０との間のデータパスの異常検出が可能となる。このＣＲＣコード比較によるデータパス異常検出処理（第２のデータパス異常検出処理）の手順の詳細を図６のフローチャートを参照して説明する。
【００６１】
まずマイクロプロセッサ１２５は、キャッシュメモリ１２８上のデータをＨＤＤ１１０に書き込む（ライトバックする）際、この書き込みが上記間隔に合致している場合には、当該データのＣＲＣコードＣＲＣ１をサポートプログラム２２に従って設定されたサイズのデータブロック（ここではセクタサイズに一致するデータブロック）を単位に生成して、これをＲＡＭ１２７上の作業領域（またはキャッシュメモリ１２８上の特定領域１２８ａ）に保持する（ステップＤ１）。
【００６２】
次にマイクロプロセッサ１２５は、ステップＤ１でＨＤＤ１１０に書き込んだデータを読み込んで、当該データのＣＲＣコードＣＲＣ２を上記データブロック単位に生成する（ステップＤ２）。そしてマイクロプロセッサ１２５は、生成したＣＲＣコードＣＲＣ２を、先のＨＤＤ１１０への書き込み時に生成したＣＲＣコードＣＲＣ１と比較することで、キャッシュメモリ１２８とＨＤＤ１１０との間のデータパスが正常であるか異常であるかを判定する（ステップＤ３）。
【００６３】
なお、以上に述べた実施形態においては、ディスクアレイコントローラ１２におけるＣＲＣコードの生成がマイクロプロセッサ１２５により行われるものとして説明したが、これに限るものではない。例えばキャッシュ制御ＬＳＩ１２９に以下に述べるＣＲＣコード生成機能（１）または（２）を持たせ、当該キャッシュ制御ＬＳＩ１２９によりＣＲＣコードが生成されるようにしてもよい。
【００６４】
＜ＣＲＣコード生成機能（１）＞
ＣＲＣコード生成機能（１）は、ホストコンピュータ２０のサポートプログラム２２から指定された例えば時間間隔でデータをキャッシュメモリ１２８から読み出す際、即ちキャッシュメモリ１２８からホストコンピュータ２０へのリードデータの転送時、或いはキャッシュメモリ１２８からＨＤＤ１１０へのライトデータの転送時で、且つ上記時間間隔に合致するタイミングの転送時に、ホストコンピュータ２０のサポートプログラム２２から予め設定されたデータサイズ分のＣＲＣコードを演算し、予め指定したメモリまたはキャッシュ制御ＬＳＩ１２９のレジスタ内に、演算結果を格納しておく機能である。
【００６５】
このＣＲＣコード生成機能（１）は、例えば図６中のステップＤ１でのＣＲＣコード生成、即ちキャッシュメモリ１２８上のデータを読み出してＨＤＤ１１０に書き込む際の当該データのＣＲＣコードの生成に利用できる。
【００６６】
＜ＣＲＣコード生成機能（２）＞
ＣＲＣコード生成機能（２）は、ホストコンピュータ２０のサポートプログラム２２から指定された例えば時間間隔でデータをキャッシュメモリ１２８に書き込む際、即ちホストコンピュータ２０からキャッシュメモリ１２８へのライトデータの転送時、或いはＨＤＤ１１０からＨＤＤ１１０へのリードデータの転送時で、且つ上記時間間隔に合致するタイミングの転送時に、ホストコンピュータ２０のサポートプログラム２２から予め設定されたデータサイズ分のＣＲＣコードを演算し、予め指定したメモリまたはキャッシュ制御ＬＳＩ１２９のレジスタ内に、演算結果を格納しておく機能である。
【００６７】
このＣＲＣコード生成機能（２）を利用して、ホストコンピュータ２０からディスクアレイ装置１０に転送されたデータをキャッシュメモリ１２８に書き込む際に、当該データのＣＲＣコードを生成するならば、当該ＣＲＣコードを前記第１のデータパス異常検出処理におけるステップＢ４でホストコンピュータ２０（上のサポートプログラム２２）に通知することができる。
【００６８】
なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果（の少なくとも１つ）が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【００６９】
【発明の効果】
以上詳述したように本発明によれば、ディスクアレイ装置側でのライトデータに対する第１の冗長コード生成が、監視プログラムからの設定に従ってアプリケーションプログラムから独立に行われ、またホストコンピュータ側でのディスクアレイ装置に対する問い合わせと当該問い合わせに対する応答から取得したデータ指定情報の指定するデータの読み出しと第２の冗長コードの生成、及び第１及び第２の冗長コードの比較が、監視プログラムに従って、アプリケーションプログラムとの間で通信を行うことなくアプリケーションプログラムから独立に行われる。したがって本発明によれば、ホストコンピュータから転送されるライトデータ自身に、データ保護のための冗長コードがついていない場合でも、ホストコンピュータとディスクアレイコントローラとの間のデータパスの異常によりデータ誤りを起こした場合の検出率を向上させることができ、しかもアプリケーションプログラムとデータの情報を授受する必要がない。
【００７０】
また本発明によれば、ディスクアレイコントローラにおいて、ホストコンピュータから転送されてキャッシュメモリに書き込まれたライトデータをディスク記憶装置に書き込む際に、そのライトデータを利用することで、ディスクアレイコントローラとディスク記憶装置との間のデータパスの異常を監視し、当該データパスの異常によりデータ誤りを起こした場合の検出率を向上させることができる。
【００７１】
このように本発明によれば、ホストコンピュータとディスクアレイ装置内のディスク記憶装置との間におけるターゲットとなるデータパスの異常によりデータ誤りを起こした場合の検出率を向上させることができる
【図面の簡単な説明】
【図１】本発明の一実施形態に係るコンピュータシステムの構成を示すブロック図。
【図２】ホストコンピュータ２０からディスクアレイ装置１０にデータのライト／リードを要求した際のデータの流れを説明するための図。
【図３】同実施形態における第１のデータパス異常検出処理を説明するためのフローチャート。
【図４】上記第１のデータパス異常検出処理におけるＣＲＣコード生成条件例を説明するための図。
【図５】同実施形態における第２のデータパス異常検出処理を説明するためのフローチャート。
【図６】上記第２のデータパス異常検出処理の変形例を説明するためのフローチャート。
【符号の説明】
１０…ディスクアレイ装置
１２…ディスクアレイコントローラ
２０…ホストコンピュータ
２１…アプリケーションプログラム
２２…サポートプログラム（監視プログラム）
１１０…ＨＤＤ（ディスク記憶装置）
１２０…主制御部
１２１…ホストインターフェース
１２２…ディスクインターフェース
１２３…キャッシュメモリ装置
１２４…データ転送用バス
１２５…マイクロプロセッサ
１２６…ＲＯＭ
１２７…ＲＡＭ
１２８…キャッシュメモリ
１２８ａ…特定領域
１２９…キャッシュ制御ＬＳＩ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data path abnormality detection method by patrol of a disk array device suitable for detecting an abnormality such as a data path between a host computer and a disk array device, and a computer system including the disk array device.
[0002]
[Prior art]
In a disk array device that has multiple disk storage devices and exchanges data with a host computer, a standardized interface such as SCSI (Small Computer System Interface) is used for the interface (host interface) with the host computer. It is common to be done. In a computer system in which such a host computer and a disk array device are connected, a parity check circuit is used as an example of a data protection mechanism when the host interface is a SCSI interface.
[0003]
[Problems to be solved by the invention]
As described above, conventionally, in a computer system including a disk array device that exchanges data with a host computer, a parity check circuit is used as a data protection mechanism in the case where the host interface is an SCSI interface. It was general.
[0004]
However, since the protection by the parity is confirmation of one data unit (for example, one byte unit), it is effective for detection of data shift, data omission, etc. when considered as a data block (for example, data block of sector size). There was no problem.
[0005]
Therefore, a method of adding a redundancy code (data check code) such as CRC (Cyclic Redundancy Check) in data block units is also conceivable. However, in this method, a special operation is required in the standardized interface, and the operation is restricted.
[0006]
The present invention has been made in view of the above circumstances, and its purpose is to provide a host computer and a disk in a disk array device even when the write data itself transferred from the host computer does not have a redundant code for data protection. A data path abnormality detection method by patrol of a disk array device and a computer system provided with the disk array device capable of improving the detection rate when a data error occurs due to an abnormality of a target data path with the storage device It is to provide.
[0007]
[Means for Solving the Problems]
The present invention relates to a computer system comprising a host computer on which a monitoring program operates, a disk array including a plurality of disk storage devices, and a disk array controller including a cache memory. When the write request issued to the disk array device matches the preset timing, the first write data transferred from the host computer to the disk array device and written to the cache memory in response to the write request is written. Redundant code is generated in units of data blocks of a preset size, and then the host computer issues an inquiry to the disk array device to monitor the data path between the host computer and the disk array device. When it is made in accordance with the above timing according to the monitoring program, the disk array device notifies the host computer of the data designation information for designating the first redundant code generated at that time and the corresponding data. Upon receiving the notification, the host computer reads the data designated by the notified data designation information from the cache memory to the host computer, and generates a second redundant code of the data in units of the data block according to the monitoring program. By comparing the generated second redundant code with the notified first redundant code, it is determined whether or not there is an abnormality in the data path between the host computer and the disk array device based on at least the comparison result It is characterized by that.
[0008]
In such a configuration, the disk array controller is transferred from the host computer to the disk array apparatus in response to the write request only when a write request issued from the host computer according to, for example, an application program matches a preset timing. A first redundant code of data temporarily stored in the internal cache memory is generated. Also, an inquiry for data path monitoring is sent from the host computer to the disk array device in correspondence with the above timing, and the first redundancy code and data designation information are notified to the host computer in response to this inquiry. The As a result, the host computer reads the data specified by the notified data specification information, that is, the data on the cache memory used for generating the first redundant code, and generates the second redundant code of the data. In comparison with the first redundancy code, the data path between the host computer and the disk array device (or more specifically, the data path between the host computer and the cache memory in the disk array controller). Abnormality, that is, abnormality of any circuit on the path).
[0009]
As described above, in the present invention, the first redundant code generation for the write data on the disk array device side is performed independently of the application program according to the setting from the monitoring program, and the inquiry to the disk array device on the host computer side. And reading the data specified by the data specification information acquired from the response to the inquiry, generating the second redundant code, and comparing the first and second redundant codes are communicated with the application program according to the monitoring program Without being performed independently from the application program.
[0010]
In general, in order to check the data path between the host computer and the disk array device by a monitoring program operating on the host computer, the host computer performs a write / read compare test on the disk array device. Is the most reliable and convenient. However, in order to securely protect the data, a process for exchanging data information with the application program is required. On the other hand, in the present invention, the data path can be confirmed regardless of the operation of the application program.
[0011]
Here, in order to check the data path regardless of the operation of the application program, it is preferable to apply a time interval or a write request count interval set in advance according to the monitoring program as the timing.
[0012]
Since the data path can be confirmed regardless of the operation of the application program, the data used for generating the first redundant code is updated before it is used for generating the second redundant code. There is a possibility. In this case, since the second redundant code is generated based on the updated data, the first and second redundant codes do not match regardless of whether the data path is abnormal. Therefore, if the first and second redundant codes do not match, the host computer inquires of the disk array device whether or not the original data in the cache memory has been updated, and a response indicating that it has not been updated is returned. Only in the case where it is done, the abnormality of the data path should be determined.
[0013]
The present invention also provides another specification on the cache memory that can be read only by a dedicated read request in accordance with a monitoring program from the host computer, for copying the data on the cache memory used for generating the first redundant code. A copy of the data stored in the area and not the original data on the cache memory used to generate the first redundant code, that is, the data in the specific area is read using the dedicated read request. The second redundant code is generated from the read data.
[0014]
In such a configuration, it is possible to prevent the data used for generating the second redundant code, that is, the data in the specific area from being updated by a write request according to an application program or the like.
[0015]
In addition, when writing write data transferred from a host computer and written in a cache memory to the disk storage device, the present invention writes a copy of the data in another specific area of the disk storage device and By making the data invalid data that cannot be seen by the application program and checking data, the data written in the specific area of the disk storage device is read out to the disk array controller and compared with the original data in the cache memory. It is also characterized by determining whether there is an abnormality in the data path between the array controller and the disk storage device. In this case as well, it is preferable to process the write data when writing of the write data from the cache memory to the disk storage device matches the preset interval (time interval or write interval) instead of all the write data.
[0016]
In such a configuration, the data in the cache memory written in the disk storage device is protected from updating as invalid data that cannot be seen by the application program and as check data, and a copy of the data written in the disk storage device is also obtained. Is stored in another specific area of the disk storage device so that the copy of the data is protected from being updated, that is, the factor that destroys the data is eliminated. By comparing with the original data (check data) on the cache memory, it is possible to correctly determine whether there is an abnormality in the data path between the disk array controller and the disk storage device.
[0017]
In addition to this, when writing the write data transferred from the host computer and written in the cache memory to the disk storage device, a third redundant code of the data is generated and held in the disk array controller, Data in the disk storage device used to generate the third redundant code is read out to the disk array controller to generate a fourth redundant code of the data, and the fourth redundant code is generated first. By comparing with the third redundant code, it is possible to determine whether there is an abnormality in the data path between the disk array controller and the disk storage device.
[0018]
The present invention related to the data path abnormality detection method by patrol of the disk array device described above is also established as an invention related to a computer system composed of a host computer and a disk array device, or an invention related to a disk array controller.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0020]
FIG. 1 is a block diagram showing a configuration of a computer system according to an embodiment of the present invention.
The computer system of FIG. 1 includes a disk array device 10 and a host computer 20 that uses the disk array device 10.
[0021]
The disk array device 10 includes a disk array 11 having a plurality of disk storage devices, for example, hard disk devices (hereinafter referred to as HDDs) 110, and a controller (disk array controller) 12 that controls the disk array 11.
[0022]
The disk array controller 12 includes a main control unit 120, a host interface 121, a disk interface 122 provided for each HDD 110, and a cache memory device 123. The main control unit 120, host interface 121, disk interface 122, and cache memory device 123 are interconnected by a data transfer bus 124. The data transfer bus 124 is, for example, a PCI bus (Peripheral) known as a standard bus.
Component Interconnect Bus).
[0023]
The main control unit 120 includes a microprocessor 125, a ROM (Read Only Memory) 126 storing a control program (firmware) executed by the microprocessor 125, and a RAM (Random Access Memory) 127 used when the program is executed. And a microprocessor unit. The microprocessor 125 controls data transfer between the interfaces 121 and 122 and the cache memory device 123.
[0024]
The host interface 121 is connected to the host computer 20 and serves as an interface with the host computer 20. The host interface 121 controls data transfer with the host computer 20 and transfers data from the host computer 20 to the cache memory device 123 or the RAM 127 (used by the microprocessor 125) via the data transfer bus 124. The control and the control of transferring the data in the cache memory device 123 or the RAM 127 to the host computer 20 are performed.
[0025]
The disk interface 122 is connected to the HDD 110, and controls a procedure for performing data transfer between the data transfer bus 124 and the HDD 110. Specifically, the disk interface 122 performs control to transfer data from the HDD 110 to the cache memory device 123 or the RAM 127 and control to transfer data in the cache memory device 123 or RAM 127 to the HDD 110.
[0026]
The cache memory device 123 includes a cache memory 128 for temporarily storing transfer data between the HDD 110 and the host computer 20, and a cache control LSI 129 that controls the cache memory 128. The cache memory 128 has a specific area 128a. This area 128a is managed separately from a cache area for storing transfer data between the HDD 110 and the host computer 20. The area 128a is used to temporarily hold a data address, a size, and a CRC code string to be described later.
[0027]
On the host computer 20, various application programs 21 used for performing processing intended by the user and the support program 22 operate in parallel. The support program 22 is monitoring software for causing the host computer 20 to function as monitoring means for monitoring the disk array device 10 and making various settings for the disk array device 10.
[0028]
Next, with regard to the operation of the configuration of FIG. 1, processing for detecting an abnormality in the data path between the host computer 20 and the disk array device 10 by the patrol for the disk array device 10 from the host computer 20 (first operation) 1) and a process for detecting a data path abnormality between the disk array controller 12 in the disk array device 10 and the HDD 110 (second data path abnormality detection process) will be described as an example.
[0029]
First, in order to facilitate understanding of the data path abnormality detection processing, refer to FIG. 2 for a conventionally known data flow when the host computer 20 requests data write / read from the disk array device 10. To explain.
[0030]
<Flow of write data>
A write request (write command) from the host computer 20, for example, a write request according to the application program 21 is transmitted via the path A 1 in FIG. 2, that is, the host interface 121 of the disk array controller 12 in the disk array device 10, and data transfer The data is transferred to the RAM 127 in the main controller 120 via the bus 124.
[0031]
Data (write data) from the host computer 20 is transferred to the cache memory device 123 via the path A2 in FIG. 2, that is, the host interface 121 of the disk array controller 12 and the data transfer bus 124. The data is temporarily stored in the cache memory 128 under the control of the cache control LSI 129 in the cache memory device 123.
[0032]
In the disk array device 10 shown in FIG. 1, in response to a write request from the host computer 20, a so-called write-back cache that returns completion of execution of the write request to the host computer 20 at the stage of writing the write data to the cache memory 128. The method is applied.
[0033]
Therefore, after the completion of execution of the write request is returned to the host computer 20, the microprocessor 125 in the disk array controller 12 interprets the write request and determines the write data division destination HDD 110. Then, the microprocessor 125 sends a write instruction to the corresponding disk interface 122 in the disk array controller 12 so that the corresponding write data on the cache memory 128 is transferred via the path A3 in the path of FIG. The data is transferred to the HDD 110 via the transfer bus 124 and the disk interface 122.
[0034]
<Flow of read data>
A read request (read command) from the host computer 20 is transferred to the RAM 127 in the disk array controller 12 via the path A1 in FIG.
[0035]
The microprocessor 125 in the disk array controller 12 interprets the read request from the host computer 20 and causes the cache control LSI 129 to determine whether or not the data specified by the request exists on the cache memory 128.
[0036]
If the data specified by the read request does not exist on the cache memory 128, the corresponding disk interface 122 is instructed to read out data stored in a plurality of HDDs 110, for example, from the HDD 110.
[0037]
Thereby, the disk interface 122 transfers the data instructed from the microprocessor 125, that is, the data requested from the host computer 20, from the HDD 110 to the cache memory 128 via the path A3 in FIG. At this time, in the case where it is predicted that the subsequent continuous data will be continuously read, the subsequent continuous data is generally transferred to the cache memory 128. However, since this prediction method is not directly related to the present invention, description thereof is omitted. The data transferred from the HDD 110 to the cache memory 128 is transferred to the host computer 20 via the path A2 in FIG.
[0038]
<First data path error detection process>
Next, a first data path abnormality detection process related to the present invention will be described with reference to FIG.
[0039]
First, in the disk array device 10, when a write request according to the application program 21 is given from the host computer 20, data write for writing the requested write data into the cache memory 128 is performed (step B1).
[0040]
The microprocessor 125 reads the write data written in the cache memory 128, and predetermines a data block (for example, a sector that is a recording unit of the HDD 110) by designation from the support program 22 operating on the host computer 20. A redundant code, for example, a CRC (Cyclic Redundancy Check) code is generated in units of data blocks matching the size, and the cache memory 128 together with the address of the corresponding data (here, the first disk address) and the size of the redundant code column The area is held in the upper area 128a (step B2).
[0041]
In the present embodiment, the CRC code generation is not performed every time a write request is sent from the host computer 20, but for example, as shown in FIG. Is performed only on the data written in response to However, if the write request is not executed at the timing of the fixed time interval T, a CRC code is generated for the data written in response to the first subsequent write request. The time interval T is set in advance by designation from the support program 22. As shown in FIG. 4B, every predetermined number (number) of write requests (here, every two write requests), that is, at a predetermined number (number) of write request intervals, CRC code generation may be performed on data written in response to the write request.
[0042]
In accordance with the support program 22, the host computer 20 makes an inquiry for data path monitoring to the disk array device 10 at the fixed time interval (or at a fixed number of write request intervals) (step B3).
In response to this, the microprocessor 125 in the disk array device 10 notifies the host computer 20 of the data address, size, and redundant code string held in the area 128a on the cache memory 128 (step B4). .
[0043]
When the host computer 20 is notified of the data address, size, and redundant code string from the disk array device 10 to the support program 22, the host computer 20 reads in accordance with the support program 22 to read data specified by the notified data address and size. The request is sent to the disk array device 10 (step B5).
[0044]
In response to this, the microprocessor 125 in the disk array apparatus 10 causes the cache control LSI 129 to read the corresponding data from the cache memory 128 and transfer it to the host computer 20 via the data transfer bus 124 and the host interface 121. (Step B6).
[0045]
When the corresponding data is returned from the disk array device 10 in response to the read request according to the support program 22, the host computer 20 generates a CRC code in units of data blocks for the data according to the support program 22, and first The CRC code notified from the disk array device 10 is compared (steps B7 and B8).
[0046]
If the two codes match as a result of the CRC code comparison (step B9), the host computer 20 determines the data path between the host computer 20 and the disk array device 10 (corresponding to the path A1 in FIG. 2). ) The above circuit is determined to be normal, and the next abnormality detection timing is awaited.
[0047]
On the other hand, if the two codes do not match as a result of the CRC code comparison (step B9), the mismatch is caused by overwriting (updating) the corresponding data, or the disk array device. 10, the host computer 20 determines whether or not the data has been updated with respect to the disk array device 10 according to the support program 22 in order to determine whether or not the circuit on the data path to the disk 10 is abnormal. An inquiry is made (step B10).
[0048]
In response to this inquiry, the microprocessor 125 in the disk array device 10 determines whether or not the corresponding data in the cache memory 128 has been updated (overwritten) by a new write request from the host computer 20 according to the application program 21. A response to be notified is returned to the support program 22 (step B11).
[0049]
In response to this response, the host computer 20 is notified that the corresponding data has not been updated (overwritten) in response to the read request according to the support program 22 (step B12). It is determined that there is an abnormality in the data path between the computer 20 and the disk array device 10, and more specifically, an abnormality in any circuit on the data path. In this case, the host computer 20 notifies the disk array device 10 and the application program 21 of the data path abnormality with the disk array device 10 according to the support program 22, and then operates to disconnect the disk array device 10 (step). B13, B14).
[0050]
On the other hand, when it is notified from the disk array device 10 that the corresponding data has been updated (overwritten) (step B12), the host computer 20 indicates that the cause of the CRC code mismatch between the disk array device 10 and the disk array device 10 is As it is impossible to determine whether or not there is an abnormality in the data path, the next abnormality detection (patrol) timing is awaited.
[0051]
In the above description, a response is returned from the disk array device 10 in response to an inquiry about the presence or absence of data update (data update confirmation) from the host computer 20, but the present invention is not limited to this. For example, when the data used to generate the CRC code in step B2 is updated before receiving a read request for the data according to the support program 22 from the host computer 20, the fact that It may be returned when it is updated, or it may be returned instead of the requested data when the read request is given.
[0052]
In addition, in order to prevent the data used to generate the CRC code from being overwritten by the application program 21, a copy of the data is checked (monitored) as data for check (monitoring), for example, the data address, It may be held together with the size and CRC code string and managed separately from the original data. However, in this case, the host computer 20 cannot read the corresponding data (check data) with a normal read request. Therefore, in step B5, it is preferable that the check data can be read by distinguishing it from a normal read request by a dedicated read request (read command).
[0053]
<Second data path error detection process>
Next, a second data path abnormality detection process related to the present invention will be described with reference to FIG.
[0054]
The feature of the second data path abnormality detection process in the present embodiment is that the disk array device at the time of writing that matches a preset interval (here, time interval) among the data writing from the cache memory 128 to the HDD 110. In the data path between the disk array 11 and the HDD 110, data is read / written to the management area of the microprocessor 125 reserved in the HDD 110 from the microprocessor 125 in the memory 10. The point is to confirm that no garbled error has occurred. The details of the second data path abnormality detection process are as follows.
[0055]
First, the microprocessor 125 writes (writes back) the data on the cache memory 128 to the HDD 110 (step C1). If this writing matches the above interval, the microprocessor 125 writes a copy of the data written in the HDD 110 in another specific area of the HDD 110 (step C2). This specific area is a management area of the microprocessor 125 that is not opened (not used) by the user. Therefore, the data on the management area is not overwritten with the user data. The microprocessor 125 sets the original data on the cache memory 128 as invalid data that cannot be seen by the application program 21 (user), and holds the data as HDD check data.
[0056]
Next, the microprocessor 125 reads the data stored in the specific area from the specific area of the HDD 110 into a work area (not shown) secured on the RAM 127, and the corresponding original data on the cache memory 128. The data path between the cache memory 128 and the HDD 110 (corresponding to the path A3 in FIG. 2), that is, between the disk array controller 12 and the HDD 110, is compared with the HDD check data. Whether the data path is normal or abnormal is determined (step C3).
[0057]
If an abnormality is determined in step C3, the microprocessor 125 sets a status for notifying that in the disk interface 122 so that it can be read from the host computer 20 according to the support program 22. Further, a configuration in which the disk array controller 12 does not respond to a request from the host computer 20 is possible.
[0058]
Note that the data read from the specific area of the HDD 110 is not read into the work area on the RAM 127 but is read into an area different from the original data on the cache memory 128, and the data on this different area and the original data on the cache memory 128 are read The microprocessor 125 may compare (HDD check data).
[0059]
Further, the cache control LSI 129 may compare the data on the separate area and the HDD check data on the cache memory 128. For this purpose, the microprocessor 125 specifies the start address and size on the cache memory 128 in which the data in the other area is stored, and the HDD check data are stored in the cache control LSI 129. The cache control LSI 129 may be activated by designating the start address and size on the cache memory 128 that is present. In this way, the cache control LSI 129 itself can compare the data in the area on the cache memory 128 specified by the head address and the size. Thus, instead of the comparison processing by the microprocessor 125 (program processing thereof), the comparison processing by the cache control LSI 129 (hardware) is performed, so that the processing speed can be increased.
[0060]
In addition, a CRC code of the data is generated when the data on the cache memory 128 is written to the HDD 110, and a CRC code is also generated when the data is read from the HDD 110, so that both CRC codes are compared. However, it is possible to detect an abnormality in the data path between the cache memory 128 and the HDD 110. Details of the procedure of the data path abnormality detection process (second data path abnormality detection process) by CRC code comparison will be described with reference to the flowchart of FIG.
[0061]
First, when the microprocessor 125 writes (writes back) the data on the cache memory 128 to the HDD 110, if this writing matches the above interval, the CRC code CRC1 of the data is set according to the support program 22. A data block of the same size (here, a data block that matches the sector size) is generated as a unit, and is stored in a work area on the RAM 127 (or a specific area 128a on the cache memory 128) (step D1).
[0062]
Next, the microprocessor 125 reads the data written in the HDD 110 in step D1, and generates a CRC code CRC2 of the data for each data block (step D2). The microprocessor 125 compares the generated CRC code CRC2 with the CRC code CRC1 generated at the time of writing to the previous HDD 110, so that the data path between the cache memory 128 and the HDD 110 is normal or abnormal. Is determined (step D3).
[0063]
In the above-described embodiment, the CRC code generation in the disk array controller 12 has been described as being performed by the microprocessor 125. However, the present invention is not limited to this. For example, the cache control LSI 129 may have the CRC code generation function (1) or (2) described below, and the cache control LSI 129 may generate a CRC code.
[0064]
<CRC code generation function (1)>
The CRC code generation function (1) is used when reading data from the cache memory 128, for example, at time intervals designated from the support program 22 of the host computer 20, that is, when transferring read data from the cache memory 128 to the host computer 20, or When the write data is transferred from the cache memory 128 to the HDD 110 and at a timing matching the above time interval, a CRC code corresponding to a preset data size is calculated from the support program 22 of the host computer 20 and designated in advance. This function stores the operation result in the register of the memory or cache control LSI 129.
[0065]
This CRC code generation function (1) can be used, for example, to generate a CRC code at step D1 in FIG. 6, that is, to generate a CRC code of the data when the data on the cache memory 128 is read and written to the HDD 110.
[0066]
<CRC code generation function (2)>
The CRC code generation function (2) is used when data is written to the cache memory 128, for example, at time intervals designated by the support program 22 of the host computer 20, that is, when write data is transferred from the host computer 20 to the cache memory 128, or When the read data is transferred from the HDD 110 to the HDD 110 and at a timing matching the above time interval, a CRC code corresponding to a data size set in advance from the support program 22 of the host computer 20 is calculated, and a memory designated in advance is calculated. Alternatively, it is a function for storing the operation result in the register of the cache control LSI 129.
[0067]
If the CRC code of the data is generated when the data transferred from the host computer 20 to the disk array device 10 is written into the cache memory 128 using the CRC code generation function (2), the CRC code is The host computer 20 (upper support program 22) can be notified in step B4 in the first data path error detection process.
[0068]
In addition, this invention is not limited to the said embodiment, In the implementation stage, it can change variously in the range which does not deviate from the summary. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of the effect of the invention If (at least one of) is obtained, a configuration from which this configuration requirement is deleted can be extracted as an invention.
[0069]
【The invention's effect】
As described above in detail, according to the present invention, the first redundant code generation for the write data on the disk array device side is performed independently of the application program according to the setting from the monitoring program, and the disk on the host computer side Reading the data specified by the data specifying information acquired from the inquiry to the array device and the response to the inquiry, generating the second redundant code, and comparing the first and second redundant codes are performed according to the monitoring program, It is performed independently from the application program without performing communication between them. Therefore, according to the present invention, even if the write data transferred from the host computer itself does not have a redundant code for data protection, a data error occurs due to an abnormal data path between the host computer and the disk array controller. In this case, the detection rate can be improved, and there is no need to exchange data between the application program and data.
[0070]
Further, according to the present invention, in the disk array controller, when the write data transferred from the host computer and written in the cache memory is written to the disk storage device, the write data is used, thereby the disk array controller and the disk storage. It is possible to monitor the data path abnormality with the apparatus and improve the detection rate when a data error occurs due to the data path abnormality.
[0071]
As described above, according to the present invention, it is possible to improve the detection rate when a data error occurs due to an abnormality in a target data path between the host computer and the disk storage device in the disk array device.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a computer system according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining a data flow when a data write / read is requested from the host computer 20 to the disk array device 10;
FIG. 3 is a flowchart for explaining a first data path abnormality detection process in the embodiment;
FIG. 4 is a view for explaining an example of CRC code generation conditions in the first data path abnormality detection process;
FIG. 5 is a flowchart for explaining second data path abnormality detection processing in the embodiment;
FIG. 6 is a flowchart for explaining a modification of the second data path abnormality detection process;
[Explanation of symbols]
10. Disk array device
12 ... Disk array controller
20: Host computer
21 ... Application program
22 ... Support program (monitoring program)
110: HDD (disk storage device)
120 ... Main control unit
121 ... Host interface
122 ... Disk interface
123 ... Cache memory device
124: Data transfer bus
125 ... Microprocessor
126 ... ROM
127 ... RAM
128: Cache memory
128a ... specific area
129: Cache control LSI

Claims

A disk array controller including a host computer, a disk array including a plurality of disk storage devices, and a cache memory for controlling the disk array and temporarily storing transfer data between the host computer and the disk storage device And is applied to a computer system in which monitoring of the disk array device and various settings for the disk array device are performed by a monitoring program that operates in parallel with an application program on the host computer. A data path error detection method by patrol of a disk array device,
When a write request issued from the host computer to the disk array device matches a preset timing, it is transferred from the host computer to the disk array device and written to the cache memory in response to the write request. Generating a first redundant code of write data in units of data blocks of a preset size;
Performing an inquiry for data path monitoring between the host computer and the disk array device from the host computer to the disk array device at intervals corresponding to the timing according to the monitoring program;
In response to the inquiry, notifying the host computer of data designation information for designating the first redundant code generated at that time and the corresponding data;
Reading data designated by the data designation information notified from the disk array device from the cache memory to the host computer;
Generating a second redundant code of data read by the host computer in units of the data block according to the monitoring program;
The first redundant code notified from the disk array device is compared with the second redundant code, and whether there is an abnormality in the data path between the host computer and the disk array device based on at least the comparison result And a data path error detection method by patrol of the disk array device.

2. The data path abnormality detection method by patrol of a disk array device according to claim 1, wherein the timing is a time interval set in advance according to the monitoring program or an interval of the number of write requests.

If the first redundant code and the second redundant code do not match, the host computer inquires to the disk array device whether or not the original data in the cache memory has been updated. In addition,
2. The disk array device according to claim 1, wherein when a response indicating that the original data has not been updated is returned, an abnormality in a data path between the host computer and the disk array device is determined. Data path error detection method by patrol of

Another specific area on the cache memory in which a copy of the data on the cache memory used to generate the first redundant code can be read only by a dedicated read request according to the monitoring program from the host computer To keep
The step of reading data specified by the data specification information notified from the disk array device from the cache memory to the host computer reads the data in the specific area using the dedicated read request. Item 8. A data path error detection method by patrol of the disk array device according to Item 1.

When writing the write data transferred from the host computer and written to the cache memory to the disk storage device, a copy of the data is written to another specific area of the disk storage device and the data on the cache memory The invalid data invisible to the application program,
Data between the disk array controller and the disk storage device is read by reading the data written in the specific area of the disk storage device to the disk array controller and comparing it with the original data on the cache memory. 2. The data path abnormality detection method by patrol of a disk array device according to claim 1, further comprising the step of determining whether there is a path abnormality.

When writing the write data transferred from the host computer and written in the cache memory to the disk storage device, a third redundant code of the data is generated in units of the data block and stored in the disk array controller. Holding step;
Reading data in the disk storage device used to generate the third redundant code to the disk array controller and generating a fourth redundant code of the data in units of the data block;
By comparing the fourth redundant code with the third redundant code held in the disk controller, it is determined whether there is an abnormality in the data path between the disk array controller and the disk storage device. 2. The data path abnormality detection method by patrol of a disk array device according to claim 1, further comprising a step.

Host computer on which various application programs operate, disk array including a plurality of disk storage devices, and cache memory for controlling the disk array and temporarily storing transfer data between the host computer and the disk storage devices In a computer system comprising a disk array device composed of a disk array controller incorporating
The disk array controller
When a write request issued from the host computer to the disk array device matches a preset timing, it is transferred from the host computer to the disk array device and written to the cache memory in response to the write request. First redundant code generation means for generating a first redundant code of write data in units of data blocks of a preset size;
In response to an inquiry from the host computer for monitoring the data path between the host computer and the disk array device, the first redundant code generated by the first redundant code generation means at that time And means for notifying the host computer of data designation information for designating the write data used to generate the first redundant code,
The host computer
Means for making the inquiry to the disk array device at intervals corresponding to the timing;
In response to the inquiry, the data is received from the disk array device, and the data is received in response to the first redundant code and data specifying information for specifying the write data used to generate the first redundant code. Means for reading the data from the disk array device by a read request for requesting data on the cache memory designated by the designation information;
Second redundant code generation means for generating a second redundant code of data read from the disk array device in response to the read request in units of the data block;
The second redundant code generated by the second redundant code generating means is compared with the first redundant code notified from the disk array device in response to the inquiry, and at least based on the comparison result A computer system comprising: means for determining whether or not a data path between the host computer and the disk array device is abnormal.

The disk array controller
When writing the write data transferred from the host computer and written to the cache memory to the disk storage device, a copy of the data is written to another specific area of the disk storage device and the data on the cache memory Write data storage means that is invalid data that cannot be seen from the application program and is data for checking;
By reading the data written in the specific area of the disk storage device by the write data storage means to the disk array controller and comparing it with the original data on the cache memory, the disk array controller and the disk storage 8. The computer system according to claim 7, further comprising means for determining the presence / absence of an abnormality in the data path to the apparatus.

The disk array controller
Third redundant code generation means for generating a third redundant code of the data in units of the data block when writing the write data transferred from the host computer and written in the cache memory to the disk storage device When,
Fourth redundant code generation means for reading data in the disk storage device used for generating the third redundant code and generating a fourth redundant code of the data in units of the data block;
By comparing the fourth redundant code generated by the fourth redundant code generating means with the third redundant code generated by the third redundant code generating means, the disk array controller and the disk 8. The computer system according to claim 7, further comprising means for determining whether there is an abnormality in a data path to the storage device.

In a disk array controller including a cache memory for controlling a disk array including a plurality of disk storage devices and temporarily storing transfer data between a host computer on which various application programs operate and the disk storage device,
When writing the write data transferred from the host computer and written to the cache memory to the disk storage device, a copy of the data is written to another specific area of the disk storage device and the data on the cache memory Write data storage means that is invalid data that cannot be seen from the application program and is data for checking;
By reading the data written in the specific area of the disk storage device by the write data storage means to the disk array controller and comparing it with the original data on the cache memory, the disk array controller and the disk storage A disk array controller comprising: means for determining the presence / absence of an abnormality in a data path to the apparatus.

In a disk array controller including a cache memory for controlling a disk array including a plurality of disk storage devices and temporarily storing transfer data between a host computer on which various application programs operate and the disk storage device,
When writing the write data transferred from the host computer and written in the cache memory to the disk storage device, a first redundancy code of the data is generated in units of data blocks of a predetermined size. Redundant code generation means,
Second redundant code generation means for reading data in the disk storage device used for generating the first redundant code and generating a second redundant code of the data in units of the data block;
The disk array controller and the disk are compared by comparing the second redundant code generated by the second redundant code generating means with the first redundant code generated by the first redundant code generating means. A disk array controller comprising: means for determining whether there is an abnormality in a data path to the storage device.