JP4852118B2

JP4852118B2 - Storage device and logical disk management method

Info

Publication number: JP4852118B2
Application number: JP2009072593A
Authority: JP
Inventors: 邦保清水; 達也伊藤; 義光上山
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2009-03-24
Filing date: 2009-03-24
Publication date: 2012-01-11
Anticipated expiration: 2029-03-24
Also published as: JP2010224954A

Description

本発明は、タイムアウト等の一時的障害が生じた場合でも継続使用が可能なストレージ装置及び論理ディスク管理方法に関する。 The present invention relates to a storage apparatus and a logical disk management method that can be used continuously even when a temporary failure such as a timeout occurs.

従来から、ストレージ装置においては、論理ディスクを構成する物理ディスクに障害が発生した場合、その障害が発生した物理ディスクが切り離されて縮退運転が行われている。例えば、ＲＡＩＤ構成したストレージ装置では、物理ディスクへのアクセスでタイムアウト等の一時的障害が発生すると故障とみなされる。そして、その一時的障害が発生した物理ディスクが切り離されて縮退運転が行われる。 Conventionally, in a storage apparatus, when a failure occurs in a physical disk that constitutes a logical disk, the physical disk in which the failure has occurred is disconnected and a degeneration operation is performed. For example, in a RAID-configured storage apparatus, if a temporary failure such as a timeout occurs when accessing a physical disk, it is regarded as a failure. Then, the physical disk in which the temporary failure has occurred is disconnected and the degenerate operation is performed.

しかしながら、一時的障害の検出により切り離された物理ディスクであっても、その後の検査により、故障が生じていない場合がある。例えば物理ディスクの電源を再投入することにより正常な状態に復帰させることができる場合がある。このような場合、実際には物理ディスクは故障していないにもかかわらず、故障として扱われることとなり、故障発生頻度が増大する問題が生じる。 However, even a physical disk that has been separated by detecting a temporary failure may not have failed due to subsequent inspection. For example, it may be possible to restore the normal state by turning on the power of the physical disk again. In such a case, although the physical disk is not actually failed, it is treated as a failure, and a problem of increasing the frequency of failure occurs.

そこで、一時的障害が発生した物理ディスクを一旦論理ディスクから切り離したのち、その物理ディスクの診断処理を実施し、故障が生じていなければ論理ディスクのメンバーディスクに再度組み込む方法がある（例えば、非特許文献１参照）。
特開２００６−１６４３０４号公報 ”フェニックス技術”、［online］、［平成２０年６月１９日検索］、インターネット＜URL：http://www.nec.co.jp/products/istorage/words/m036.shtml＞ Therefore, there is a method in which a physical disk in which a temporary failure has occurred is once disconnected from the logical disk, and then the physical disk is subjected to diagnostic processing. Patent Document 1).
JP 2006-164304 A "Phoenix Technology", [online], [Search June 19, 2008], Internet <URL: http://www.nec.co.jp/products/istorage/words/m036.shtml>

しかしながら、上述した従来の方法でも、物理ディスクが切り離された時点から論理ディスクが再構成されるまでの間は、論理ディスクの冗長性が失われるという問題がある。 However, the above-described conventional method also has a problem that the redundancy of the logical disk is lost from the time when the physical disk is disconnected until the logical disk is reconfigured.

また、例えばＲＡＩＤ−５構成のストレージ装置においては、複数の物理ディスクの障害にまでは対応しておらず、ひとつの物理ディスクに障害が生じている間は、他の物理ディスクに障害が生じても対応することができない。 Further, for example, in a RAID-5 configuration storage device, failure of a plurality of physical disks is not supported, and while one physical disk has a failure, a failure has occurred in another physical disk. Can not respond.

それゆえ、故障と判断された物理ディスクが切り離された時点から論理ディスクが再構成されるまでの間に別の物理ディスクに障害が発生すると、論理ディスクの運用が停止してしまい、データが失われることになる。 Therefore, if a failure occurs on another physical disk between the time when the physical disk determined to be failed and the time when the logical disk is reconfigured, the logical disk operation stops and data is lost. It will be.

本発明は上記実情に鑑みてなされたものであり、タイムアウト等の一時的障害が生じた場合でも継続使用が可能なストレージ装置及び論理ディスク管理方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a storage apparatus and a logical disk management method that can be used continuously even when a temporary failure such as a timeout occurs.

本発明は上記課題を解決するために、論理ディスクを構成する一以上の物理ディスクとホットスペアディスクとディスクコントローラとを備え、ネットワークを介して接続されたホスト装置からの要求に応じてデータを記憶するストレージ装置であって、前記ディスクコントローラは、前記物理ディスクに生じる一時的な障害を検出する一時的障害検出手段と、前記障害を検出した場合、該一時的な障害が生じた障害発生物理ディスクに対し障害回復処理を行なう障害回復手段と、前記障害発生物理ディスクのデータを前記ホットスペアディスクに複製するデータ複製手段と、障害回復処理の開始後一定期間、前記障害発生物理ディスクを監視し、前記ホスト装置からのコマンドに対する前記障害発生物理ディスクの応答データを記録する手段と、前記記録した応答データと基準応答データとを比較し、前記障害発生物理ディスクが故障か否かを判定する故障判定手段と、前記故障判定手段により故障と判定された場合、前記障害発生物理ディスクに替えて前記ホットスペアディスクを前記論理ディスクを構成する物理ディスクとする論理ディスク再構成部とを備えたストレージ装置を提供する。 In order to solve the above problems, the present invention includes at least one physical disk, a hot spare disk, and a disk controller that constitute a logical disk, and stores data in response to a request from a host device connected via a network. In the storage apparatus, the disk controller detects a temporary failure that occurs in the physical disk, and, when detecting the failure, the disk controller detects a failure in the failed physical disk. Failure recovery means for performing failure recovery processing, data replication means for copying data of the failed physical disk to the hot spare disk, monitoring the failed physical disk for a certain period after the start of failure recovery processing, and the host A method of recording response data of the failed physical disk in response to a command from the device And the recorded response data and reference response data to determine whether or not the failure physical disk is in failure, and if the failure determination means determines that there is a failure, the failure occurrence physical Provided is a storage device comprising a logical disk reconfiguration unit that uses the hot spare disk as a physical disk constituting the logical disk instead of a disk.

本発明によれば、タイムアウト等の一時的障害が生じた場合でも継続使用が可能なストレージ装置及び論理ディスク管理方法を提供することが可能となる。 According to the present invention, it is possible to provide a storage apparatus and a logical disk management method that can be used continuously even when a temporary failure such as a timeout occurs.

以下、図面を参照して本発明の実施形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜第１の実施形態＞
図１は本発明の第１の実施形態に係るストレージ装置１０の構成を示す模式図である。ストレージ装置１０は、論理ディスク２０を構成する一以上の物理ディスク２１とホットスペアディスク２２とディスクコントローラ３０とを備えている。また、このストレージ装置１０は、ＳＣＳＩ（small computer system interface）やＦＣ（Fibre Channel）などによりネットワークを介してホスト装置５と接続しており、ホスト装置５からの要求に応じてデータを記憶する２次記憶装置である。物理ディスク２１は、一般的にはＨＤＤ（hard disk drive）が用いられるが、これに限らず、半導体ディスクを含めたストレージデバイスであってもよい。ホットスペアディスク２２は、障害が発生した物理ディスクを代替するものである。 <First Embodiment>
FIG. 1 is a schematic diagram showing a configuration of a storage apparatus 10 according to the first embodiment of the present invention. The storage apparatus 10 includes one or more physical disks 21, a hot spare disk 22, and a disk controller 30 that constitute the logical disk 20. The storage device 10 is connected to the host device 5 via a network by SCSI (small computer system interface), FC (Fibre Channel), etc., and stores data in response to a request from the host device 2 Secondary storage device. The physical disk 21 is generally an HDD (hard disk drive), but is not limited to this, and may be a storage device including a semiconductor disk. The hot spare disk 22 replaces a physical disk in which a failure has occurred.

ディスクコントローラ３０は、メモリ４０とプロセッサ５０とを有しており、メモリ４０に格納された「論理ディスク管理プログラム」がプロセッサ５０に読み込まれることにより、論理ディスク設定部５１、一時的障害検出部５２、障害回復部５３、データ複製部５４、ディスク監視部５５、故障判定部５６、論理ディスク再構成部５７、警告データ出力部５８としての機能するものである。なお、図１では各処理部５１〜５８をプロセッサ内部に記載しているが、これは便宜上の表現である。すなわち、各処理部５１〜５８は論理ディスク管理プログラムの機能の一部としてプログラムされ、そのプログラムをプロセッサ５０が実行することで実現される。 The disk controller 30 includes a memory 40 and a processor 50, and when a “logical disk management program” stored in the memory 40 is read by the processor 50, a logical disk setting unit 51 and a temporary failure detection unit 52 are included. It functions as a failure recovery unit 53, a data replication unit 54, a disk monitoring unit 55, a failure determination unit 56, a logical disk reconstruction unit 57, and a warning data output unit 58. In FIG. 1, the processing units 51 to 58 are described inside the processor, but this is a representation for convenience. That is, each of the processing units 51 to 58 is programmed as part of the function of the logical disk management program, and is realized by the processor 50 executing the program.

メモリ４０は、ディスクコントローラ３０が情報処理するデータを記憶する記憶装置である。このメモリ４０には、予め設定された「基準応答データ」が記憶されている。また、メモリ４０には、後述するディスク監視部５５により「応答データ」が書き込まれる。 The memory 40 is a storage device that stores data processed by the disk controller 30. The memory 40 stores preset “reference response data”. Also, “response data” is written in the memory 40 by a disk monitoring unit 55 described later.

なお、正常な平均レスポンスタイムはＨＤＤ機種により異なり、また、同じＨＤＤであってもＩＯ負荷（キューイングの深さ）により変動するので、これに対応した値が基準応答データとして用いられる。例えば、ストレージ装置１０がサポートしているＨＤＤ機種毎のテーブルデータや、あらゆる機種と環境とを想定して確実に異常と判定できる値などが基準応答データとして用いられる。論理ディスク設定部５１は、一以上の物理ディスク２１をまとめて論理ディスク２０として設定するものである。これによりＲＡＩＤ（redundant array of inexpensive disks）機能が実現される。なお、論理ディスク２０はＲＡＩＤの種類に応じて冗長性をもつ場合と持たない場合とがある。 The normal average response time differs depending on the HDD model, and even the same HDD varies depending on the IO load (queuing depth), and a value corresponding to this varies as the reference response data. For example, table data for each HDD model supported by the storage apparatus 10 or a value that can be reliably determined as abnormal assuming any model and environment is used as the reference response data. The logical disk setting unit 51 sets one or more physical disks 21 together as a logical disk 20. This implements a RAID (redundant array of inexpensive disks) function. The logical disk 20 may or may not have redundancy depending on the type of RAID.

一時的障害検出部５２は、物理ディスク２１に生じる一時的な障害を検出するものである。例えば、一時的障害検出部５２は、ホスト装置５からライト要求を受けたときのタイムアウト等から一時的障害を検出する。 The temporary failure detection unit 52 detects a temporary failure that occurs in the physical disk 21. For example, the temporary failure detection unit 52 detects a temporary failure from a timeout or the like when a write request is received from the host device 5.

障害回復部５３は、一時的障害検出部５２が一時的障害を検出した場合、その一時的な障害が生じた物理ディスク（以下、障害発生物理ディスク２１Ｘという）に対し障害回復処理を行なうものである。例えば、障害回復部５３は、デバイスリセットや物理ディスクの電源のオフオン等により障害回復処理を行う。なお、障害回復部５３が障害回復処理中は、障害発生物理ディスク２１Ｘへのホスト装置５からのアクセスは停止され、障害回復処理が完了した時点でＩ／Ｏ処理が再開される。 When the temporary failure detection unit 52 detects a temporary failure, the failure recovery unit 53 performs failure recovery processing on a physical disk in which the temporary failure has occurred (hereinafter referred to as a failure-occurring physical disk 21X). is there. For example, the failure recovery unit 53 performs a failure recovery process by device reset, physical disk power off / on, or the like. Note that while the failure recovery unit 53 is performing the failure recovery processing, access from the host device 5 to the failed physical disk 21X is stopped, and the I / O processing is resumed when the failure recovery processing is completed.

データ複製部５４は、障害発生物理ディスク２１Ｘのデータをホットスペアディスク２２に複製してミラー化するものである。ここで、データ複製部５４は、ホットスペアディスク２２のデータが障害発生物理ディスク２１Ｘのミラーであるため、障害発生物理ディスク２１Ｘからデータを全面コピーすることができる。また、データ複製部５４は、論理ディスク２０が冗長性をもつＲＡＩＤ構成である場合、図２に示すように、論理ディスク２０のメンバーディスクのうち、障害発生物理ディスク２１Ｘ以外のメンバーディスクからホットスペアディスク２２のデータを復元することも可能である。 The data replicating unit 54 replicates the data of the failed physical disk 21X on the hot spare disk 22 and mirrors it. Here, since the data of the hot spare disk 22 is a mirror of the failed physical disk 21X, the data replicating unit 54 can copy the entire data from the failed physical disk 21X. In addition, when the logical disk 20 has a redundant RAID configuration, the data replicating unit 54 selects a hot spare disk from member disks other than the failed physical disk 21X among the member disks of the logical disk 20, as shown in FIG. It is also possible to restore the 22 data.

ディスク監視部５５は、障害回復処理の開始後一定期間、障害発生物理ディスク２１Ｘを監視するものであり、ホスト装置５からのコマンドに対する障害発生物理ディスク２１Ｘの応答データをメモリ４０に記録する機能を有している。具体的には、ディスク監視部５５は、物理ディスク２１に最初の一時的障害が発生したときから、ホットスペアディスク２２へのデータの復元中、さらにデータの復元完了後の一定期間（例えば２４時間など）、障害発生物理ディスク２１ＸのＩ／Ｏパターンなどを記録する。 The disk monitoring unit 55 monitors the failed physical disk 21X for a certain period after the start of the failure recovery processing, and has a function of recording the response data of the failed physical disk 21X in response to a command from the host device 5 in the memory 40. Have. Specifically, the disk monitoring unit 55 performs a certain period (for example, 24 hours) after data restoration to the hot spare disk 22 during the restoration of data to the hot spare disk 22 after the first temporary failure has occurred in the physical disk 21. ), The I / O pattern of the failed physical disk 21X is recorded.

故障判定部５６は、障害回復処理後に記録した応答データを、メモリ４０に記憶された基準応答データと比較して、障害発生物理ディスク２１Ｘが故障であるか否かを判定するものである。例えば、故障判定部５６は、Ｉ／Ｏの応答遅延や、その他の物理ディスク２１の異常動作から、障害発生物理ディスク２１Ｘが故障であるか否かを判定する。 The failure determination unit 56 compares the response data recorded after the failure recovery process with the reference response data stored in the memory 40 to determine whether or not the failure physical disk 21X has a failure. For example, the failure determination unit 56 determines whether or not the failure-occurring physical disk 21X has a failure from the I / O response delay and other abnormal operations of the physical disk 21.

なお、ホットスペアディスク２２へのデータの複製が完了した時点では、障害発生物理ディスク２１Ｘとホットスペアディスク２２とはデータがミラー化された状態で動作している。 Note that when the data replication to the hot spare disk 22 is completed, the failed physical disk 21X and the hot spare disk 22 are operating in a state where the data is mirrored.

論理ディスク再構成部５７は、図３に示すように、故障判定部５６により障害発生物理ディスク２１Ｘが故障であると判定された場合、その障害発生物理ディスク２１Ｘに替えてホットスペアディスク２２を論理ディスク２０のメンバーディスクとして組み込むものである。 As shown in FIG. 3, when the failure determination unit 56 determines that the failed physical disk 21X has failed, the logical disk reconfiguration unit 57 replaces the failed physical disk 21X with the hot spare disk 22 as a logical disk. It is incorporated as 20 member disks.

また、論理ディスク再構成部５７は、ホットスペアディスク２２を論理ディスク２０のメンバーディスクとして組み込んだ場合、ディスク監視部５５による障害発生物理ディスク２１Ｘに対する監視を解除し、応答データの記録を終了させる。この時点で障害発生物理ディスク２１Ｘの故障が確定することになる。 In addition, when the hot spare disk 22 is incorporated as a member disk of the logical disk 20, the logical disk reconstruction unit 57 cancels the monitoring of the failed physical disk 21X by the disk monitoring unit 55 and ends the recording of response data. At this point, the failure of the failed physical disk 21X is confirmed.

なお、論理ディスク再構成部５７は、ホットスペアディスク２２へのデータの復元中、またはデータ復元後の一定期間中に、監視対象の障害発生物理ディスク２１Ｘで一時的障害が再度検出された場合、その時点で障害発生物理ディスク２１Ｘを論理ディスク２０から切り離し、ホットスペアディスク２２を論理ディスク２０のメンバーディスクとして割り当てる。 If a temporary failure is detected again on the failed physical disk 21X to be monitored during restoration of data to the hot spare disk 22 or during a certain period after data restoration, the logical disk reconfiguration unit 57 At that time, the failed physical disk 21X is disconnected from the logical disk 20, and the hot spare disk 22 is assigned as a member disk of the logical disk 20.

障害発生物理ディスク２１Ｘに対して異常動作が観測されなければ、図４に示すように、論理ディスク再構成部５７は障害発生物理ディスク２１Ｘで発生した障害は一時的なものであったとみなし、ディスク監視部５５による監視を解除し、応答データの記録を終了させる。それから、論理ディスク再構成部５７は、障害発生物理ディスク２１Ｘとホットスペアディスク２２とのミラー構成を解除する。これにより論理ディスク２０は元の状態に戻る。 If no abnormal operation is observed for the failed physical disk 21X, as shown in FIG. 4, the logical disk reconstruction unit 57 assumes that the failure that has occurred in the failed physical disk 21X is temporary, and the disk The monitoring by the monitoring unit 55 is canceled and the recording of response data is terminated. Then, the logical disk reconstruction unit 57 cancels the mirror configuration of the failed physical disk 21X and the hot spare disk 22. As a result, the logical disk 20 returns to the original state.

警告データ出力部５８は、故障判定部５６により障害発生物理ディスク２１Ｘが故障であると判定された場合、警告データを出力するものである。 The warning data output unit 58 outputs warning data when the failure determination unit 56 determines that the failed physical disk 21X has a failure.

次に本実施形態に係るストレージ装置１０の動作を図５のフローチャートを用いて説明する。ディスクコントローラ３０では一時的障害検出部５２が常時稼動しており、物理ディスク２１に一時的障害が発生すると、そのことが一時的障害検出部５２により検出される（Ｓ１−Ｙｅｓ）。続いて、障害回復部５３により障害発生物理ディスク２１Ｘに対する障害回復処理が実行される（Ｓ２）。障害回復処理では、デバイスリセットや電源のオンオフを実行する。 Next, the operation of the storage apparatus 10 according to this embodiment will be described with reference to the flowchart of FIG. In the disk controller 30, the temporary failure detection unit 52 is always in operation, and when a temporary failure occurs in the physical disk 21, this is detected by the temporary failure detection unit 52 (S1-Yes). Subsequently, the failure recovery unit 53 executes failure recovery processing for the failed physical disk 21X (S2). In the failure recovery processing, device reset and power on / off are executed.

障害回復部５３により障害発生物理ディスク２１Ｘの障害回復処理が開始されると、データ複製部５４により障害発生物理ディスク２１Ｘに対してホットスペアディスク２２が割り当てられる（Ｓ３）。これにより、ホットスペアディスク２２が障害発生物理ディスク２１Ｘのミラーディスクとして構成される。そして、データ複製部５４により、障害発生物理ディスク２１Ｘのデータがホットスペアディスク２２に複製される。 When the failure recovery processing of the failed physical disk 21X is started by the failure recovery unit 53, the hot spare disk 22 is assigned to the failed physical disk 21X by the data replication unit 54 (S3). As a result, the hot spare disk 22 is configured as a mirror disk of the failed physical disk 21X. Then, the data replicating unit 54 replicates the data of the failed physical disk 21X to the hot spare disk 22.

また、障害回復部５３により障害発生物理ディスク２１Ｘの障害回復処理が開始されると、ディスク監視部５５によりホスト装置５からのコマンドに対する障害発生物理ディスク２１Ｘの応答データがメモリ４０に記録される（Ｓ４）。 When the failure recovery processing of the failed physical disk 21X is started by the failure recovery unit 53, the response data of the failed physical disk 21X to the command from the host device 5 is recorded in the memory 40 by the disk monitoring unit 55 ( S4).

続いて、故障判定部５６により、障害回復処理後に記録された応答データと、予めメモリ４０に記憶された基準応答データとが比較されて、障害発生物理ディスク２１Ｘが故障であるか否かが判定される（Ｓ５）。 Subsequently, the failure determination unit 56 compares the response data recorded after the failure recovery process with the reference response data stored in advance in the memory 40 to determine whether or not the failure physical disk 21X has a failure. (S5).

故障判定部５６により故障であると判定された場合（Ｓ５−Ｙｅｓ）、論理ディスク再構成部５７により障害発生物理ディスク２１Ｘに替えて、ホットスペアディスク２２が論理ディスク２０のメンバーディスクとして再構成される（Ｓ６）。論理ディスク２０が再構成されると、障害発生物理ディスク２１Ｘが切り離される（Ｓ７）。 If the failure determination unit 56 determines that there is a failure (S5-Yes), the logical disk reconfiguration unit 57 reconfigures the hot spare disk 22 as a member disk of the logical disk 20 instead of the failed physical disk 21X. (S6). When the logical disk 20 is reconfigured, the failed physical disk 21X is disconnected (S7).

一方、故障判定部５６により故障であると判定されなかった場合、障害発生物理ディスク２１Ｘが継続使用される（Ｓ５−Ｎｏ，Ｓ８）。なお、障害発生物理ディスク２１Ｘとホットスペアディスク２２とはミラー化されているので、障害発生物理ディスク２１Ｘの継続使用ではなくホットスペアディスク２２を論理ディスクに組み込み、障害ディスクとして判断されていたディスクをホットスペアディスクとして使用してもよい。 On the other hand, if the failure determination unit 56 does not determine that there is a failure, the failed physical disk 21X is continuously used (S5-No, S8). Since the failed physical disk 21X and the hot spare disk 22 are mirrored, the hot spare disk 22 is incorporated into the logical disk instead of continuing to use the failed physical disk 21X, and the disk that has been determined as the failed disk is the hot spare disk. May be used as

以上説明したように、本実施形態に係るストレージ装置１０は、ディスクコントローラ３０が、物理ディスク２１に生じる一時的な障害を検出する一時的障害検出部５２と、一時的な障害が生じた障害発生物理ディスク２１Ｘに対し障害回復処理を行なう障害回復部５３と、障害発生物理ディスク２１Ｘのデータをホットスペアディスク２２に複製するデータ複製部５４とを備え、障害発生物理ディスク２１Ｘとホットスペアディスク２２をミラー化することで、タイムアウト等の一時的障害が生じた場合でも継続使用することができる。 As described above, in the storage apparatus 10 according to the present embodiment, the disk controller 30 detects the temporary failure that occurs in the physical disk 21, and the occurrence of a failure in which a temporary failure has occurred. A failure recovery unit 53 that performs failure recovery processing on the physical disk 21X and a data replication unit 54 that replicates data of the failed physical disk 21X to the hot spare disk 22 are provided, and the failed physical disk 21X and the hot spare disk 22 are mirrored. By doing so, even if a temporary failure such as a timeout occurs, it can be used continuously.

また、ディスクコントローラ３０は、障害回復処理の開始後（障害発生物理ディスク２１Ｘとホットスペアディスク２２のミラー化後）一定期間、障害発生物理ディスク２１Ｘを監視し、ホスト装置５からのコマンドに対する障害発生物理ディスク２１Ｘの応答データを記録するので、障害発生物理ディスク２１Ｘが故障しているか否か、どのような故障が発生しているのかを判断するための情報を提供できる。例えば、システム管理者が、この応答データのログから障害発生物理ディスク２１Ｘの故障原因を分析することができ、最終的に故障であるとして切り離された障害発生物理ディスク２１Ｘの一時的障害発生までのＩ／Ｏシーケンスを再現することで、故障原因を容易に検証できるようになる。 Further, the disk controller 30 monitors the failed physical disk 21X for a certain period after the failure recovery process is started (after the failed physical disk 21X and the hot spare disk 22 are mirrored), and the failed physical for the command from the host device 5 is monitored. Since the response data of the disk 21X is recorded, it is possible to provide information for determining whether or not the failure physical disk 21X has failed and what kind of failure has occurred. For example, the system administrator can analyze the cause of failure of the failed physical disk 21X from the log of the response data, and finally until the temporary failure of the failed physical disk 21X separated as a failure. By reproducing the I / O sequence, the cause of the failure can be easily verified.

また、障害発生物理ディスク２１Ｘが故障であるか否かを判定する故障判定部５６を具備しているので、ディスクコントローラ３０は、障害回復処理後に記録した応答データを基準応答データと比較して、障害発生物理ディスク２１Ｘを論理ディスク２０から切り離すことなく、故障の有無を確認できる。結果として、故障判定作業に伴って生じる論理ディスク２０の冗長性喪失の問題を回避できる。 In addition, since the failure determination unit 56 that determines whether or not the failure physical disk 21X has failed, the disk controller 30 compares the response data recorded after the failure recovery processing with the reference response data, The presence or absence of a failure can be confirmed without disconnecting the failed physical disk 21X from the logical disk 20. As a result, it is possible to avoid the problem of loss of redundancy of the logical disk 20 that occurs due to the failure determination work.

また、故障判定部５６により障害発生物理ディスク２１Ｘが故障であると判定された場合、その障害発生物理ディスク２１Ｘに替えてホットスペアディスク２２を論理ディスク２０のメンバーディスクに組み込むので、故障があると判定された場合でもストレージ装置１０を継続使用することができる。 If the failure determining unit 56 determines that the failed physical disk 21X is defective, the hot spare disk 22 is incorporated in the member disk of the logical disk 20 instead of the failed physical disk 21X, and it is determined that there is a failure. Even in such a case, the storage apparatus 10 can be used continuously.

従来のストレージ装置では、図６（Ａ）に示すように、正常状態の物理ディスク２１に障害が発生すると（Ａ１）、障害発生物理ディスク２１Ｘを切り離し（Ａ２）、ホットスペアディスク２２をメンバーディスクとして論理ディスク２０をリビルドする（Ａ３）。この際、ホットスペアディスク２２のリビルドが完了するまで（Ａ４）、論理ディスク２０の冗長性が維持できないという問題が生じていた。 In the conventional storage apparatus, as shown in FIG. 6A, when a failure occurs in the normal physical disk 21 (A1), the failed physical disk 21X is disconnected (A2), and the hot spare disk 22 is logically set as a member disk. The disk 20 is rebuilt (A3). At this time, there is a problem that the redundancy of the logical disk 20 cannot be maintained until the rebuilding of the hot spare disk 22 is completed (A4).

これに対し、本実施形態に係るストレージ装置１０は、図６（Ｂ）に示すように、正常状態の物理ディスク２１に障害が発生すると（Ｂ１）、障害発生物理ディスク２１Ｘのデータをホットスペアディスク２２にコピーし（Ｂ２）、障害発生ディスク２１Ｘとホットスペアディスク２２とを一定期間ミラー動作する。そして、ミラー動作中のホスト装置５からのコマンドに対するＩ／Ｏパターン等を記録し（Ｂ３）、障害発生物理ディスク２１Ｘが故障であるか否かを判定する（Ｂ４・Ｂ５）。それゆえ、一時的な障害が発生しただけでは障害発生物理ディスク２１Ｘを切り離さずにホットスペアディスク２２とミラー化して継続使用するので、冗長性を失わずに論理ディスク２０を使用することができる。 On the other hand, as shown in FIG. 6B, when a failure occurs in the normal physical disk 21 (B1), the storage apparatus 10 according to this embodiment transfers the data of the failed physical disk 21X to the hot spare disk 22. (B2), and the failed disk 21X and the hot spare disk 22 are mirrored for a certain period. Then, an I / O pattern for the command from the host device 5 during the mirror operation is recorded (B3), and it is determined whether or not the failure physical disk 21X has a failure (B4 / B5). Therefore, the logical disk 20 can be used without losing redundancy because the failure physical disk 21X is mirrored with the hot spare disk 22 without being disconnected only when a temporary failure occurs.

また、ディスクコントローラ３０は、故障判定部５６により障害発生物理ディスク２１Ｘが故障であると判定された場合、警告データを出力することで、障害発生物理ディスク２１Ｘを継続使用するか、ホットスペアディスク２２に切り替えるかの判断をシステム管理者に促すことが可能となる。 Further, when the failure determination unit 56 determines that the failed physical disk 21X has failed, the disk controller 30 outputs warning data so that the failed physical disk 21X can be used continuously or the hot spare disk 22 can be used. It is possible to prompt the system administrator to determine whether to switch.

ディスク監視部５５による監視と故障判定部５６による故障判定についての変形例を示す。
（監視方法１：Ｉ／Ｏパターンの監視）
監視方法１では、障害発生物理ディスク２１Ｘに対して発行された監視期間中のすべての(あるいは少なくとも直近数十秒間の)コマンドのＩ／Ｏパターンをメモリに記録する。ここで、コマンドのＩ／Ｏパターンとは、コマンド種別（ＣＤＢイメージ）・発行時刻・完了時刻・完了ステータス（センス情報)を含むものである。センス情報には、コマンドの実行に一度に失敗したがＨＤＤ内部のリトライで成功した等のエラーリカバリ情報が記録される。 A modification of monitoring by the disk monitoring unit 55 and failure determination by the failure determination unit 56 will be described.
(Monitoring method 1: I / O pattern monitoring)
In the monitoring method 1, the I / O pattern of all commands (or at least the latest several tens of seconds) during the monitoring period issued to the failed physical disk 21X is recorded in the memory. Here, the command I / O pattern includes a command type (CDB image), an issue time, a completion time, and a completion status (sense information). In the sense information, error recovery information such as failure in executing the command at once but success in retrying in the HDD is recorded.

また、監視方法１では、故障判定部５６は、リトライで成功したコマンドの、すべての発行コマンドに占める割合が既定の閾値を超えた場合に故障が生じたと判定する。（改行なし）なお、上記判定基準での判定を行わない場合でも、別の監視方法により故障と判定された場合、Ｉ／Ｏパターンの情報が必要となる。例えば、障害発生物理ディスク２１Ｘを回収して再現試験を行なう場合、一時的障害に至るまでのコマンドシーケンスを再現するためにコマンドのＩ／Ｏパターンが必要となる。そのため、以下に示す監視方法を実施する場合でも、このＩ／Ｏパターンの記録を併用することが望ましい。 Also, in the monitoring method 1, the failure determination unit 56 determines that a failure has occurred when the ratio of commands that have been successfully retried to all issued commands exceeds a predetermined threshold. (No line break) Even if the determination based on the above-described determination criterion is not performed, if it is determined that a failure has occurred by another monitoring method, information on the I / O pattern is required. For example, when the failure physical disk 21X is collected and a reproduction test is performed, a command I / O pattern is required to reproduce the command sequence up to the temporary failure. Therefore, it is desirable to use this I / O pattern recording together even when the monitoring method described below is performed.

（監視方法２：レスポンスタイムの監視）
監視方法２では、ホットスペアディスク２２へのコピー命令またはホスト装置５からのコマンドに対する応答時間の詳細な統計をメモリに記録する。上記のＩ／Ｏパターンの記録を監視期間全体にわたって保存するだけの容量的余裕がある場合は、この情報を統計的に処理するだけで済む。 (Monitoring method 2: Response time monitoring)
In the monitoring method 2, detailed statistics of the response time to the copy command to the hot spare disk 22 or the command from the host device 5 are recorded in the memory. If there is sufficient capacity to store the above I / O pattern records over the entire monitoring period, this information need only be statistically processed.

また、監視方法２では、故障判定部５６は、ディスク監視部５５により得られた応答時間と、あらかじめメモリ４０に記憶された基準応答時間とを比較して、所定の閾値を超える比率に応じて故障が生じたか否かを判定する。例えば応答時間が１秒を超えるコマンドが１０％を超えるようならば、故障が生じたと判定する。 In the monitoring method 2, the failure determination unit 56 compares the response time obtained by the disk monitoring unit 55 with the reference response time stored in advance in the memory 40, and according to a ratio exceeding a predetermined threshold value. It is determined whether or not a failure has occurred. For example, if a command whose response time exceeds 1 second exceeds 10%, it is determined that a failure has occurred.

（監視方法３：スループットの監視）
監視方法３では、ホットスペアディスク２２へデータをコピーするときの障害発生物理ディスク２１Ｘへの全面リード要求に対するリードスループットを測定する。ただし、障害発生物理ディスク２１Ｘは、ホスト装置５からのコマンドも並行して受けているので、これによるデータの変化分の補正処理は別途実行される。 (Monitoring method 3: Throughput monitoring)
In the monitoring method 3, the read throughput is measured for a full-face read request to the failed physical disk 21X when copying data to the hot spare disk 22. However, since the failed physical disk 21X receives a command from the host device 5 in parallel, the correction process for the data change by this is separately executed.

故障判定部５６は、障害発生物理ディスク２１Ｘが本来もつべきスループット性能をテーブルデータとして保持しておき、そのテーブルデータと実測値との性能差から故障であるか否かを判定する。例えば、この性能差がある程度の閾値（たとえば５０％）を下回ったら故障であると判定する。 The failure determination unit 56 holds the throughput performance that the failure-occurring physical disk 21X should have as table data, and determines whether or not there is a failure from the performance difference between the table data and the actually measured value. For example, if this performance difference falls below a certain threshold value (for example, 50%), it is determined that there is a failure.

（監視方法４：ＳＭＡＲＴ機能によるエラー情報の監視）
監視方法４では、ＨＤＤのＳＭＡＲＴ機能によるエラー情報を取得する。一般的なＨＤＤではＳＭＡＲＴ機能が搭載されており、ＳＭＡＲＴ機能ではそのＨＤＤ自体の内部エラーの監視を実施している。そして、このエラー情報は外部から参照できるので、ディスク監視部５５がこれを取得する。なお、ＳＭＡＲＴ機能により取得できるエラー情報としては、リードエラーレート・ライトエラーレート・シークエラーレート・残り交代セクタ数・スピンアップ時間・Ｇリスト更新頻度・装置温度等が挙げられる。ただし、スピンアップ時間は電源投入時の記録なのでディスク監視部５５の監視対象外とすべきものである。 (Monitoring method 4: Error information monitoring by SMART function)
In the monitoring method 4, error information obtained by the SMART function of the HDD is acquired. A general HDD is equipped with a SMART function, and the SMART function monitors an internal error of the HDD itself. Since this error information can be referred to from the outside, the disk monitoring unit 55 acquires it. Error information that can be acquired by the SMART function includes a read error rate, a write error rate, a seek error rate, the number of remaining alternate sectors, a spin-up time, a G list update frequency, a device temperature, and the like. However, since the spin-up time is recorded when the power is turned on, it should not be monitored by the disk monitoring unit 55.

また、故障判定部５６は、スピンアップ時間以外の各パラメータを定期的(例えば１分おき)に参照し、参照した各パラメータの値またはその増分が閾値を超えた場合に故障であると判定する。 Further, the failure determination unit 56 periodically refers to each parameter other than the spin-up time (for example, every 1 minute), and determines that a failure occurs when the value of each referenced parameter or its increment exceeds a threshold value. .

（監視方法５：最初の一時的障害と同様のエラー情報の監視）
監視方法５では、一時的障害検出部５２が一時的障害を検出するための判定基準と同様の判定基準(たとえばタイムアウトまたはエラー応答のためリトライしたが、リトライアウトした等)により、最初の一時的障害と同様のエラー情報が再度得られるか否かを監視する。故障判定部５６は、ディスク監視部５５により最初の一時的障害と同様のエラー情報が再度得られた場合に故障であると判定する。 (Monitoring method 5: Monitoring error information similar to the first temporary failure)
In the monitoring method 5, the first temporary detection is performed based on a determination criterion similar to the determination criterion for the temporary failure detection unit 52 to detect a temporary failure (for example, a retry was made due to a timeout or error response, but a retry was made). It is monitored whether error information similar to the failure can be obtained again. The failure determination unit 56 determines that a failure has occurred when the disk monitoring unit 55 obtains again error information similar to the first temporary failure.

この監視方法５であれば、一般的には発生しないような一時的障害が物理ディスク２１に発生した場合、２回目の一時的障害が検出されることは非常に稀なので、故障とみなされる回数を減らすことができる。
なお、監視方法５は、監視方法１と併用し、Ｉ／Ｏパターンの詳細なログを採取することで、障害要因を容易に調査することができる。 With this monitoring method 5, when a temporary failure that does not generally occur occurs in the physical disk 21, it is very rare that the second temporary failure is detected. Can be reduced.
Note that the monitoring method 5 can be used together with the monitoring method 1 to collect a detailed log of the I / O pattern, thereby easily investigating the cause of the failure.

＜その他＞
本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に構成要素を適宜組み合わせてもよい。 <Others>
The present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine a component suitably in different embodiment.

なお、上記実施形態に記載した手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、光磁気ディスク（ＭＯ）、半導体メモリ、半導体ディスクなどの記憶媒体に格納して頒布することもできる。 Note that the method described in the above embodiment includes a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a magneto-optical disk (MO) as programs that can be executed by a computer. ), Stored in a storage medium such as a semiconductor memory or a semiconductor disk, and distributed.

また、この記憶媒体としては、プログラムを記憶でき、かつコンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であっても良い。 In addition, as long as the storage medium can store a program and can be read by a computer, the storage format may be any form.

また、記憶媒体からコンピュータにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワークソフト等のＭＷ（ミドルウェア）等が上記実施形態を実現するための各処理の一部を実行しても良い。 In addition, an OS (operating system) running on a computer based on an instruction of a program installed in the computer from a storage medium, MW (middleware) such as database management software, network software, and the like realize the above-described embodiment. A part of each process may be executed.

さらに、本発明における記憶媒体は、コンピュータと独立した媒体に限らず、ＬＡＮやインターネット等により伝送されたプログラムをダウンロードして記憶または一時記憶した記憶媒体も含まれる。 Further, the storage medium in the present invention is not limited to a medium independent of a computer, but also includes a storage medium in which a program transmitted via a LAN, the Internet, or the like is downloaded and stored or temporarily stored.

また、記憶媒体は１つに限らず、複数の媒体から上記実施形態における処理が実行される場合も本発明における記憶媒体に含まれ、媒体構成は何れの構成であっても良い。 Further, the number of storage media is not limited to one, and the case where the processing in the above embodiment is executed from a plurality of media is also included in the storage media in the present invention, and the media configuration may be any configuration.

尚、本発明におけるコンピュータは、記憶媒体に記憶されたプログラムに基づき、上記実施形態における各処理を実行するものであって、パソコン等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であっても良い。 The computer according to the present invention executes each process in the above-described embodiment based on a program stored in a storage medium, and is a single device such as a personal computer or a system in which a plurality of devices are connected to a network. Any configuration may be used.

また、本発明におけるコンピュータとは、パソコンに限らず、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって本発明の機能を実現することが可能な機器、装置を総称している。 In addition, the computer in the present invention is not limited to a personal computer, but includes an arithmetic processing device, a microcomputer, and the like included in an information processing device, and is a generic term for devices and devices that can realize the functions of the present invention by a program. .

本発明の第１の実施形態に係るストレージ装置１０の構成を示す模式図である。1 is a schematic diagram showing a configuration of a storage apparatus 10 according to a first embodiment of the present invention. 同実施形態に係るデータ複製部５４の機能を説明するための模式図である。It is a schematic diagram for demonstrating the function of the data replication part 54 which concerns on the embodiment. 同実施形態に係る論理ディスク再構成部５７の機能を説明するための模式図である。FIG. 6 is a schematic diagram for explaining functions of a logical disk reconstruction unit 57 according to the embodiment. 同実施形態に係る論理ディスク再構成部５７の機能を説明するための模式図である。FIG. 6 is a schematic diagram for explaining functions of a logical disk reconstruction unit 57 according to the embodiment. 同実施形態に係るストレージ装置１０の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the storage apparatus 10 according to the embodiment. 同実施形態に係るストレージ装置１０の効果を説明するための模式図である。It is a schematic diagram for demonstrating the effect of the storage apparatus 10 concerning the embodiment.

５・・・ホスト装置、１０・・・ストレージ装置、２０・・・論理ディスク、２１・・・物理ディスク、２２・・・ホットスペアディスク、３０・・・ディスクコントローラ、４０・・・メモリ、５０・・・プロセッサ、５１・・・論理ディスク設定部、５２・・・一時的障害検出部、５３・・・障害回復部、５４・・・データ複製部、５５・・・ディスク監視部、５６・・・故障判定部、５７・・・論理ディスク再構成部、５８・・・警告データ出力部。 5 ... Host device, 10 ... Storage device, 20 ... Logical disk, 21 ... Physical disk, 22 ... Hot spare disk, 30 ... Disk controller, 40 ... Memory, 50. .. Processor 51... Logical disk setting unit 52. Temporary failure detection unit 53 53 Failure recovery unit 54 Data replication unit 55 Disk monitoring unit 56 Failure determination unit, 57... Logical disk reconstruction unit, 58.

Claims

A storage device that includes one or more physical disks, a hot spare disk, and a disk controller that constitute a logical disk, and stores data in response to a request from a host device connected via a network,
The disk controller is
Temporary failure detection means for detecting a temporary failure occurring in the physical disk;
When the failure is detected, failure recovery means for performing failure recovery processing on the failed physical disk in which the temporary failure has occurred;
Data duplicating means for duplicating the data of the failed physical disk to the hot spare disk ;
Means for monitoring the failed physical disk for a certain period after the start of failure recovery processing, and recording response data of the failed physical disk in response to a command from the host device;
A failure determination means for comparing the recorded response data with reference response data and determining whether the failure physical disk is a failure;
A logical disk reconfiguration unit configured to use the hot spare disk as a physical disk constituting the logical disk instead of the failed physical disk when the failure determination unit determines that a failure has occurred. Storage device.

2. The storage apparatus according to claim 1, wherein when the failure determination unit does not determine that a failure has occurred, the failure physical disk is continuously used.

A logical disk management method used in a storage device that includes one or more physical disks, a hot spare disk, and a disk controller that constitute a logical disk, and that stores data in response to a request from a host device connected via a network. And
The disk controller is
When a temporary failure occurring in the physical disk is detected, a failure recovery process is performed on the failed physical disk in which the temporary failure has occurred, and data of the failed physical disk is copied to the hot spare disk,
Monitoring the failed physical disk for a certain period after the start of the failure recovery process, and recording response data of the failed physical disk in response to a command from the host device;
The response data recorded after the failure recovery process is compared with reference response data to determine whether or not the failed physical disk is faulty,
If the result of the determination is that there is a failure, the hot spare disk is used as a physical disk constituting the logical disk instead of the failed physical disk .

4. The logical disk management method according to claim 3, wherein if the result of the determination is that there is no failure, the disk controller continues to use the failed physical disk.