JP6247314B2

JP6247314B2 - Computer system and computer system control method

Info

Publication number: JP6247314B2
Application number: JP2015551349A
Authority: JP
Inventors: 真生濱本; 山岡　雅直; 雅直山岡
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2013-12-06
Filing date: 2013-12-06
Publication date: 2017-12-13
Anticipated expiration: 2033-12-06
Also published as: WO2015083276A1; JPWO2015083276A1

Description

本発明は、半導体メモリを備えた情報処理システム及びその制御方法に関する。特に、低消費電力かつ所定の信頼性を満たす情報処理システムを実現する技術に関する。 The present invention relates to an information processing system including a semiconductor memory and a control method thereof. In particular, the present invention relates to a technique for realizing an information processing system that satisfies low power consumption and predetermined reliability.

半導体微細化に伴い、計算機システムの性能が向上する一方で、トランジスタの特性ばらつきが増大している。この特性ばらつきは特に、ＳＲＡＭ（ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）など記憶デバイスの信頼性を低下させ、保持データの破損などを招く原因となる。データ破損はシステムダウンなどを引き起こす可能性があるため、その補償技術が近年大きな課題となっている。ＳＲＡＭのみならず、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などの記憶デバイスにおいても同様である。例えば、ＤＲＡＭにおいては記憶保持時間が小さくなる。 With the miniaturization of semiconductors, the performance of a computer system is improved, while the variation in transistor characteristics is increasing. This variation in characteristics particularly decreases the reliability of a storage device such as SRAM (Static Random Access Memory) and causes damage to retained data. Since data corruption can cause system down, compensation technology has become a major issue in recent years. The same applies to a storage device such as a DRAM (Dynamic Random Access Memory) as well as an SRAM. For example, in a DRAM, the memory holding time is reduced.

このため、記憶デバイスの信頼性を維持する技術として、特許文献１では、エラー訂正符号化（ＥＣＣ）やデータの多重化により記憶したデータの誤りを訂正する技術が開示されている。また、特許文献２では、メモリチップの劣化による必要閾値電圧の違いに対応するために、メモリチップに対するデータの書き込みまたは読み込みに用いる信号の電気的特性を定めるパラメータを変更して適正値に設定する技術が開示されている。 For this reason, as a technique for maintaining the reliability of the storage device, Patent Document 1 discloses a technique for correcting errors in data stored by error correction coding (ECC) or data multiplexing. Further, in Patent Document 2, in order to cope with a difference in necessary threshold voltage due to deterioration of a memory chip, a parameter that determines an electrical characteristic of a signal used for writing or reading data with respect to the memory chip is changed and set to an appropriate value. Technology is disclosed.

特表２００８−５２１１６０Special table 2008-521160 特開２０１２−６８８２５JP2012-68825A

特許文献１および特許文献２のように、全てのデータに対し、誤りを完全に訂正する場合、信頼性維持コストが増大する。例えば、ＥＣＣを適用する場合、符号化および訂正処理のために電力を消費する。また、ＳＲＡＭにおいては電圧を高める、ＤＲＡＭにおいてはリフレッシュレートの頻度を上げることで、動作マージンを拡大することが可能だが、消費電力も増加する。このように、記憶デバイスの信頼性維持には多大な電力コストが必要であり、半導体微細化が進むほどその信頼性維持コストは増大する。 As in Patent Document 1 and Patent Document 2, when errors are completely corrected for all data, the reliability maintenance cost increases. For example, when ECC is applied, power is consumed for encoding and correction processing. In addition, by increasing the voltage in the SRAM and increasing the frequency of the refresh rate in the DRAM, the operation margin can be expanded, but the power consumption also increases. As described above, a large amount of power cost is required to maintain the reliability of the storage device, and the reliability maintenance cost increases as the semiconductor becomes finer.

また、今後、大規模データを用いた学習・認識処理などのアプリケーションの台頭が予想されている。このようなアプリケーションでは多量の計算を行うため、大容量の記憶デバイスを必要とする。このため、記憶デバイスの大容量化に伴う信頼性維持コストの増加が、特に問題となってくる。 In the future, the rise of applications such as learning / recognition processing using large-scale data is expected. Such an application requires a large capacity storage device to perform a large amount of calculations. For this reason, the increase in the reliability maintenance cost accompanying the increase in the capacity of the storage device becomes a particular problem.

但し、学習・認識処理を応用したアプリケーションなど、一部のアプリケーションにおいては計算結果の誤差に対して強い耐性がある。例えば、人物の認識において正しい計算結果は確信度９０％であるに対し、データのエラーによって確信度８８％となっても、結論としてこれがＡさんであるという結論に変わりが無ければ問題ない。しかしながら、記憶デバイスが高信頼であることを前提としている現在の計算機システムにおいては、プロセッサへの命令もデータも全て同様に扱うために、記憶デバイスのデータのエラーが計算機システム全体のダウンにつながる恐れがある。 However, some applications, such as an application using learning / recognition processing, have a strong resistance to calculation result errors. For example, the correct calculation result in the recognition of a person has a certainty factor of 90%, but even if the certainty factor is 88% due to a data error, there is no problem as long as there is no change in the conclusion that this is Mr. A. However, in the current computer system that assumes that the storage device is highly reliable, all the instructions and data to the processor are handled in the same way, and therefore an error in the data of the storage device may lead to a failure of the entire computer system. There is.

そこで、本発明の実施例における計算機システムでは、命令データ、ポインタなど、システム全体の制御に関わるデータであり、誤りが発生するとシステムダウンに繋がる重要度の高いデータについては、記憶デバイス内で高信頼（エラー訂正可能となるレベル）に保持する。一方、画像やテキストなどの入力データや計算の中間データなど、データに誤りが発生してもシステム全体を停止させない重要度が低いデータについては低信頼（1ビット以上のデータがＥＣＣを用いてもエラー訂正不可能となるレベル）に保持する。これにより、記憶デバイスの大部分を低信頼（言い換えれば低電力）の状態で使用しつつ、システム全体の停止など計算機システムの致命的なエラーを回避する。 Therefore, in the computer system according to the embodiment of the present invention, data relating to control of the entire system, such as instruction data and pointers, and highly important data that leads to system down if an error occurs are highly reliable in the storage device. (Level at which error correction is possible). On the other hand, low-reliability data (such as input data such as images and text) and intermediate data for calculations that do not stop the entire system even if an error occurs. At a level where error correction is impossible). As a result, a fatal error of the computer system such as the stop of the entire system can be avoided while using most of the storage device in a state of low reliability (in other words, low power).

具体的には、実施例の一例における計算機システムは、メモリと、メモリに接続された第１及び第２のプロセッサを備える。第１のプロセッサは、第１の動作状態でメモリへアクセスし、メモリにおけるデータエラー発生率が前記第１の動作状態よりも高い第２の動作状態ではメモリへのアクセスを停止する。一方、第２のプロセッサは、第２の動作状態で前記メモリへアクセスする。 Specifically, a computer system in an example of the embodiment includes a memory and first and second processors connected to the memory. The first processor accesses the memory in the first operation state, and stops accessing the memory in the second operation state in which the data error occurrence rate in the memory is higher than the first operation state. On the other hand, the second processor accesses the memory in the second operating state.

そして、第１のプロセッサは、第１の動作状態で第２のプロセッサへの作業指示内容をメモリに格納し、第２のプロセッサは、第２の動作状態でメモリに格納された作業指示内容を読み出して処理を実行する。また、第１のプロセッサは、第１の動作状態で第２のプロセッサが動作しているかを確認し、動作していない場合は、第２のプロセッサを再起動させる。 Then, the first processor stores the work instruction content to the second processor in the memory in the first operation state, and the second processor stores the work instruction content stored in the memory in the second operation state. Read and execute the process. In addition, the first processor checks whether the second processor is operating in the first operating state, and if it is not operating, restarts the second processor.

メモリがＳＲＡＭの場合は、前述の第１及び第２の動作状態はＳＲＡＭの動作電圧により決定される。この場合、第２の動作状態における動作電圧は第１の動作状態における動作電圧よりも低い。 When the memory is an SRAM, the first and second operating states described above are determined by the operating voltage of the SRAM. In this case, the operating voltage in the second operating state is lower than the operating voltage in the first operating state.

メモリがＤＲＡＭの場合は、前述の第１及び第２の動作状態はＤＲＡＭのリフレッシュレートにより決定される。この場合、第２の動作状態におけるリフレッシュレートは第１の動作状態におけるリフレッシュレートよりも低い。 When the memory is a DRAM, the first and second operating states described above are determined by the refresh rate of the DRAM. In this case, the refresh rate in the second operating state is lower than the refresh rate in the first operating state.

本発明により、所定の信頼性を維持しつつ記憶デバイスの消費電力を削減した計算機システムの提供が可能になる。 According to the present invention, it is possible to provide a computer system that reduces power consumption of a storage device while maintaining predetermined reliability.

ＳＲＡＭを備えたプロセッサの構成例を示す図である。It is a figure which shows the structural example of the processor provided with SRAM. ＳＲＡＭが保有するデータを示す図である。It is a figure which shows the data which SRAM has. プロセッサの並列計算処理を説明する図である。It is a figure explaining the parallel calculation processing of a processor. ワーカが使用するＳＲＡＭ上のアドレス領域のデータ配置を示す図である。It is a figure which shows the data arrangement | positioning of the address area | region on SRAM which a worker uses. ワーカが使用するＳＲＡＭ上のアドレス領域のデータ配置を示す図である。It is a figure which shows the data arrangement | positioning of the address area | region on SRAM which a worker uses. ワーカが使用するＳＲＡＭ上のアドレス領域のデータ配置を示す図である。It is a figure which shows the data arrangement | positioning of the address area | region on SRAM which a worker uses. 並列計算処理におけるマスタの動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of the master in parallel calculation processing. 並列計算処理におけるワーカの動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of the worker in a parallel calculation process. 計算機システムの構成例を示す図である。It is a figure which shows the structural example of a computer system. メモリが保有するデータを示す図である。It is a figure which shows the data which a memory holds. 計算機システムのパラメータ調整用プログラムの処理フローチャートである。It is a process flowchart of the program for parameter adjustment of a computer system. 計算機システムの構成例を示す図である。It is a figure which shows the structural example of a computer system. メモリの構成例を示す図である。It is a figure which shows the structural example of a memory. メモリの制御ユニットが保有するデータを示す図である。It is a figure which shows the data which the control unit of memory has. メモリの記憶ユニットの高信頼領域が保有するデータを示す図である。It is a figure which shows the data which the high reliability area | region of the memory | storage unit of a memory holds. メモリの記憶ユニットの低電力領域が保有するデータを示す図である。It is a figure which shows the data which the low power area | region of the memory | storage unit of a memory holds. 計算機システムの動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of a computer system.

実施例１では、ＳＲＡＭの消費電力を削減したプロセッサの例を説明する。 In the first embodiment, an example of a processor in which the power consumption of the SRAM is reduced will be described.

図１は、ＳＲＡＭを備えたプロセッサ１０の構成を示すブロック図である。プロセッサ１０は、複数のプロセッサコアを備えたマルチコアプロセッサであり、ＣＰＵ１１０、ＣＰＵ１２０、バス１３０、入出力ユニット１４０、ＳＲＡＭ１５０、タイマ１６０、電圧／周波数制御ユニット１７０を有している。 FIG. 1 is a block diagram illustrating a configuration of a processor 10 including an SRAM. The processor 10 is a multi-core processor having a plurality of processor cores, and includes a CPU 110, a CPU 120, a bus 130, an input / output unit 140, an SRAM 150, a timer 160, and a voltage / frequency control unit 170.

ＣＰＵ１１０は、マスタ・ワーカ方式の並列処理において、マスタの役割をする演算コアであり、ＣＰＵ１２０は、ワーカの役割をする演算コアである。ＣＰＵ１２０は、命令キャッシュ１２１とロード／ストアユニット１２２を有する。 The CPU 110 is an arithmetic core that serves as a master in the master-worker parallel processing, and the CPU 120 is an arithmetic core that serves as a worker. The CPU 120 includes an instruction cache 121 and a load / store unit 122.

命令キャッシュ１２１は、命令データを格納するキャッシュメモリであり、メモリセルのトランジスタサイズが大きい、又はトランジスタ数が多いなど、低電圧動作においても高信頼に動作できるように作られている。ロード／ストアユニット１２２は、ＣＰＵ１２０のデータをＳＲＡＭ１５０へ書込む処理と、ＳＲＡＭ１５０のデータをＣＰＵ１２０から読出す処理を行うユニットである。 The instruction cache 121 is a cache memory that stores instruction data, and is configured to be able to operate with high reliability even in a low-voltage operation, such as a large memory cell transistor size or a large number of transistors. The load / store unit 122 is a unit that performs processing for writing the data of the CPU 120 to the SRAM 150 and processing for reading the data of the SRAM 150 from the CPU 120.

バス１３０は、プロセッサ１０に存在する各モジュールを繋ぐユニットである。入出力ユニット１４０は、プロセッサ１０と外部システムを繋ぐユニットである。ＳＲＡＭ１５０は、ＣＰＵ１１０（マスタ）とＣＰＵ１２０（ワーカ）が計算に使用するデータが格納される共有メモリであり、例えば図２に示すデータを格納する。 The bus 130 is a unit that connects modules existing in the processor 10. The input / output unit 140 is a unit that connects the processor 10 and an external system. The SRAM 150 is a shared memory in which data used for calculation by the CPU 110 (master) and the CPU 120 (worker) is stored. For example, the data shown in FIG. 2 is stored.

タイマ１６０は、時間をカウントするタイマであり、ＣＰＵ１１０から受信した低電圧設定値情報２０１と電圧変更間隔情報２０２を含む制御情報１１１に基づいて、電圧／周波数制御ユニット１７０へ電圧変更指示を含む制御情報１６１を出力し、ＣＰＵ１１０へ電圧変更完了を示す情報を含む割込み情報１６２を出力する。 The timer 160 is a timer that counts time, and based on the control information 111 including the low voltage set value information 201 and the voltage change interval information 202 received from the CPU 110, a control including a voltage change instruction to the voltage / frequency control unit 170. Information 161 is output, and interrupt information 162 including information indicating completion of voltage change is output to the CPU 110.

電圧／周波数制御ユニット１７０は、プロセッサ１０の動作電圧および動作周波数を変更するユニットである。本実施形態では、電圧／周波数制御ユニット１７０により、ＣＰＵ１１０とＣＰＵ１２０の電圧の制御を共通して行うが、異なる電圧／周波数制御ユニットにより独立した制御を行ってもよい。 The voltage / frequency control unit 170 is a unit that changes the operating voltage and operating frequency of the processor 10. In this embodiment, the voltage of the CPU 110 and the CPU 120 is commonly controlled by the voltage / frequency control unit 170, but independent control may be performed by different voltage / frequency control units.

図２は、ＳＲＡＭ１５０に格納されるデータの一例である。低電圧設定値情報２０１は、低電圧状態の動作電圧と、該動作電圧でＣＰＵ１１０およびＣＰＵ１２０が動作可能な動作周波数の情報である。電圧変更間隔情報２０２は、プロセッサ１０の動作電圧を変更する時間間隔の情報である。アドレスオフセット情報２０３は、ＣＰＵ１１０がＣＰＵ１２０へ割り当てるＳＲＡＭ１５０上の記憶領域のアドレスオフセット情報である。 FIG. 2 is an example of data stored in the SRAM 150. The low voltage set value information 201 is information on an operating voltage in a low voltage state and an operating frequency at which the CPU 110 and the CPU 120 can operate at the operating voltage. The voltage change interval information 202 is time interval information for changing the operating voltage of the processor 10. The address offset information 203 is address offset information of a storage area on the SRAM 150 that the CPU 110 assigns to the CPU 120.

タスク管理情報２０４は、ＣＰＵ１１０がＣＰＵ１２０へ与えるタスクの管理情報であり、どのワーカ（ＣＰＵ１２０）がどのタスクを処理していて、全体としてどれだけのタスクが完了しているかなどを示す情報である。タスクキュー２０５はＣＰＵ１１０がＣＰＵ１２０へ与えるタスクのキューであり、ワーカ（ＣＰＵ１２０）はタスクキュー２０５のタスクが無くなるまで、タスクキュー２０５からタスクを受け取って処理する。 The task management information 204 is task management information given to the CPU 120 by the CPU 110, and is information indicating which worker (CPU 120) is processing which task and how many tasks are completed as a whole. The task queue 205 is a queue of tasks that the CPU 110 gives to the CPU 120. The worker (CPU 120) receives and processes tasks from the task queue 205 until there are no more tasks in the task queue 205.

タスク計算結果情報２０６は、ＣＰＵ１２０（ワーカ）が処理したタスクの計算結果の情報であり、計算結果の配置アドレス情報などＣＰＵ１１０（マスタ）が計算結果を取得するための情報である。マスタ作業データ２０７は、ＣＰＵ１１０（マスタ）が処理の途中に生成するデータである。ワーカ作業データ２０８は、ＣＰＵ１２０（ワーカ）が処理の途中に生成するデータである。 The task calculation result information 206 is information on the calculation result of the task processed by the CPU 120 (worker), and is information for the CPU 110 (master) to acquire the calculation result, such as arrangement address information of the calculation result. The master work data 207 is data generated by the CPU 110 (master) during the processing. The worker work data 208 is data generated by the CPU 120 (worker) during the processing.

入力データ２０９は、計算の対象となる入力データであり、例えば機械学習の教師データとなる画像データである。生存確認情報２１０は、ワーカの生存状況を確認するための情報である。目標エラー数２１１は、プロセッサ１０がプログラム実行中の所定処理でカウントするエラーデータ数の目標値である。許容エラー数２１２は、プロセッサ１０がプログラム実行中の所定部分の処理でカウントするエラーデータ数において、アプリケーションが許容可能な閾値である。 The input data 209 is input data to be calculated, for example, image data to be machine learning teacher data. The survival confirmation information 210 is information for confirming the worker's survival status. The target error number 211 is a target value for the number of error data that the processor 10 counts in a predetermined process during program execution. The allowable error number 212 is a threshold that the application can tolerate in the number of error data counted by the processor 10 in the processing of a predetermined part during program execution.

図３は、プロセッサ１０において、ＣＰＵ１１０（マスタ）とＣＰＵ１２０（ワーカ）により実行される並列処理の例を示すタイムチャートである。まず、ＣＰＵ１１０（マスタ）は、標準電圧の状態で、並列処理を行う前までの処理３０１を行う。その後、処理３０２においてタスクキュー作成処理とワーカ起動処理３２１を行う。ＣＰＵ１２０（ワーカ）は、ワーカ起動処理３１１を行い、完了したことをマスタに通知する。全てのワーカ起動完了を確認したマスタは、タイマ１６０に低電圧設定値情報２０１と電圧変更間隔情報２０２を設定し、スリープ処理３０３を行う。 FIG. 3 is a time chart illustrating an example of parallel processing executed by the CPU 110 (master) and the CPU 120 (worker) in the processor 10. First, the CPU 110 (master) performs processing 301 up to the time before performing parallel processing in a standard voltage state. Thereafter, in a process 302, a task queue creation process and a worker activation process 321 are performed. The CPU 120 (worker) performs worker activation processing 311 and notifies the master of completion. The master that has confirmed the completion of all worker activations sets the low voltage set value information 201 and the voltage change interval information 202 in the timer 160, and performs the sleep process 303.

タイマ１６０は、低電圧設定値情報２０１に基づいて電圧／周波数制御ユニット１７０へ動作電圧および動作周波数の設定値変更指示（制御情報１６１）を出力する。電圧／周波数制御ユニット１７０は、タイマ１６０からの設定値変更指示に基づいて動作電圧および動作周波数を変更し、プロセッサ１０を低電圧の状態にする。ワーカは、タスクキュー２０５よりタスクを取得し、入力データ２０９を用いてタスク処理３１２を行う。ワーカは、取得したタスクの処理が完了すると、タスク計算結果の格納アドレスをタスク計算結果情報２０６としてＳＲＡＭ１５０に書込み、タスクキュー２０５から新たなタスクを取得して処理する。ワーカは、タスクキュー２０５のタスクがなくなるまでこれを繰り返す。 The timer 160 outputs an operation voltage and an operation frequency setting value change instruction (control information 161) to the voltage / frequency control unit 170 based on the low voltage setting value information 201. The voltage / frequency control unit 170 changes the operating voltage and the operating frequency based on the setting value change instruction from the timer 160, and puts the processor 10 into a low voltage state. The worker acquires a task from the task queue 205 and performs task processing 312 using the input data 209. When the processing of the acquired task is completed, the worker writes the storage address of the task calculation result to the SRAM 150 as the task calculation result information 206, and acquires and processes a new task from the task queue 205. The worker repeats this until there are no more tasks in the task queue 205.

タイマ１６０は、電圧変更間隔情報２０２に基づいた所定時間経過後に、電圧／周波数制御ユニット１７０へ標準電圧への設定値変更指示（制御情報１６１）を出力し、電圧変更後にマスタに割込み情報１６２を出力する。割込み情報１６２を受信したマスタは、タスクの進捗状況確認とワーカの生存状況の確認などを行う管理処理３０４を行う。ここで、あるワーカ（ワーカ２）が低電圧状態でのＳＲＡＭ１５０にアクセスし、ポインタのデータが破損するなどによって停止するアクシデント３１３が発生していた場合、マスタはワーカ２の再起動処理３２２を実行する。再起動処理３２２においては、マスタは再起動するワーカが使用するＳＲＡＭ１５０上のアドレス領域のオフセット値を変更する。これにより、再起動したワーカ（ワーカ２）は前回とは異なるアドレス領域にアクセスすることになるため、アクシデント３１３と同一の原因で停止することを回避することができる。 The timer 160 outputs a setting value change instruction (control information 161) to the standard voltage to the voltage / frequency control unit 170 after a predetermined time based on the voltage change interval information 202, and after the voltage change, the interrupt information 162 is sent to the master. Output. The master that has received the interrupt information 162 performs management processing 304 for confirming the progress status of the task and the survival status of the worker. Here, when a worker (worker 2) accesses the SRAM 150 in a low voltage state and an accident 313 that stops due to corruption of pointer data or the like has occurred, the master executes the restart process 322 of the worker 2 To do. In the restart process 322, the master changes the offset value of the address area on the SRAM 150 used by the worker to be restarted. As a result, the restarted worker (worker 2) accesses an address area different from the previous one, so that it is possible to avoid stopping due to the same cause as the accident 313.

生存状況の確認は、ＳＲＡＭ１５０上の生存確認情報２１０のデータをワーカが定期的にカウントアップし、マスタがこれを観測するなどによって行うことができる。管理処理３０４において全てのタスクが完了していなければ、マスタは処理３０２と同様にタイマ１６０へ制御情報１１１を出力し、スリープ処理３０３を行う。管理処理３０４において全てのタスクが完了している場合は、ワーカにタスク終了通知３２３を通知し、後処理３０５を行う。 The survival status can be confirmed by periodically counting up the data of the survival confirmation information 210 on the SRAM 150 and observing the data by the master. If all the tasks are not completed in the management process 304, the master outputs the control information 111 to the timer 160 and performs the sleep process 303 as in the process 302. When all the tasks are completed in the management process 304, the task end notification 323 is notified to the worker, and the post-processing 305 is performed.

このようにマスタがＳＲＡＭ１５０へアクセスする際には、常にＳＲＡＭ１５０の電圧が標準電圧の状態であるようにすることによって、マスタが保有するデータを正しく保持することができる。また、タスクキュー２０５、タスク計算結果情報２０６と生存確認情報２１０はＳＲＡＭ１５０上で三重化して保持されており、低電圧状態でも高信頼（訂正処理によって完全にデータ復元が可能である状態）にデータアクセスができる。一方、ワーカがＳＲＡＭ１５０にアクセスする際は、ＳＲＡＭ１５０の電圧を低電圧の状態にすることにより、ＳＲＡＭ１５０の消費電力を削減することができる。 Thus, when the master accesses the SRAM 150, the data held by the master can be correctly held by always setting the SRAM 150 voltage to the standard voltage state. Further, the task queue 205, task calculation result information 206, and survival confirmation information 210 are held in triplicate on the SRAM 150, and data is highly reliable (data can be completely restored by correction processing) even in a low voltage state. It can be accessed. On the other hand, when the worker accesses the SRAM 150, the power consumption of the SRAM 150 can be reduced by setting the voltage of the SRAM 150 to a low voltage state.

次に、ＣＰＵ１１０（マスタ）が再起動するＣＰＵ１２０（ワーカ）が使用するＳＲＡＭ１５０上のアドレス領域のデータ配置を変更する手段を、図４、図５、図６を用いて説明する。図４、図５、図６はワーカが使用するアドレス領域のデータ配置を示す図である。マスタは、ワーカ起動時にワーカが作業に使うアドレス領域を実際に割当てるサイズよりも大きく確保し、ワーカへＳＲＡＭ１５０上のアドレス領域のオフセット値とインデックス値を設定し、ワーカが使用可能なアドレス領域を割当てる。オフセット値はワーカへ割当てるアドレス領域の物理的な先頭アドレスであり、インデックス値はワーカへ割当てたアドレス領域内の論理的な先頭アドレスである。 Next, means for changing the data arrangement of the address area on the SRAM 150 used by the CPU 120 (worker) that is restarted by the CPU 110 (master) will be described with reference to FIGS. 4, 5, and 6. 4, 5, and 6 are diagrams showing the data arrangement of the address area used by the worker. The master secures an address area that the worker uses for work when starting the worker larger than the size that is actually allocated, sets an offset value and an index value of the address area on the SRAM 150 to the worker, and allocates an address area that can be used by the worker . The offset value is the physical start address of the address area assigned to the worker, and the index value is the logical start address in the address area assigned to the worker.

図４に示すように、マスタは、例えばワーカ１へはアドレス領域４０１を確保し、アドレス４１１をオフセット値（先頭アドレス）として設定することでワーカ１へアドレス領域４１０を割当て、残りのアドレス領域４５１をマージン領域とする。同様にワーカ２へはアドレス領域４０２を確保し、アドレス４２１をオフセット値として設定することでアドレス領域４２０を割当てる。なお、インデックス値の初期値はゼロとして設定される。マスタが有するワーカのアドレスオフセットに関する情報はアドレスオフセット情報２０３としてＳＲＡＭ１５０に格納され、ワーカが有するオフセット値とインデックス値は該ワーカのロード／ストアユニット１２２に格納される。 As shown in FIG. 4, for example, the master allocates an address area 401 to the worker 1, assigns an address area 410 to the worker 1 by setting the address 411 as an offset value (start address), and the remaining address area 451. Is a margin area. Similarly, the address area 402 is allocated to the worker 2, and the address area 420 is assigned by setting the address 421 as an offset value. The initial value of the index value is set as zero. Information regarding the address offset of the worker possessed by the master is stored in the SRAM 150 as address offset information 203, and the offset value and index value possessed by the worker are stored in the load / store unit 122 of the worker.

ここで、ワーカ２を再起動させる場合、図５に示すように、マスタはワーカ２のオフセット値をアドレス４２２に変更して再起動する。これにより、ワーカ２のデータ配置が変更されるため、同一の原因でワーカ２が何度も停止する事象を回避できる。 Here, when restarting the worker 2, as shown in FIG. 5, the master changes the offset value of the worker 2 to the address 422 and restarts. Thereby, since the data arrangement of the worker 2 is changed, an event in which the worker 2 stops many times due to the same cause can be avoided.

オフセット値変更を行ってもワーカ２が繰り返し停止する場合、マスタは図６に示すように、ワーカ２のインデックス値を変更する。ワーカ２のロード／ストアユニット１２２は変更されたインデックス値に従って、アドレス領域４２０内でデータを配置するアドレスをリングシフトすることによってデータ配置の変更を行う。図４ではワーカ２のオフセット値がアドレス４２２と設定されており、インデックス値の変更に従ってアドレス４２２がワーカ２の論理的な先頭アドレスとなるようにアドレス変換を行った例を示している。 If the worker 2 repeatedly stops even after changing the offset value, the master changes the index value of the worker 2 as shown in FIG. The load / store unit 122 of the worker 2 changes the data arrangement by ring-shifting the address where the data is arranged in the address area 420 according to the changed index value. FIG. 4 shows an example in which the offset value of the worker 2 is set to the address 422 and the address conversion is performed so that the address 422 becomes the logical start address of the worker 2 in accordance with the change of the index value.

このようにデータ配置の変更を行うことによって、再起動したワーカが過去に停止した原因と同一の原因で何度も停止を繰り返すことを回避することができる。 By changing the data arrangement in this way, it is possible to prevent the restarted worker from repeatedly stopping due to the same cause as the previous stop.

次に、図７と図８を用いて、プロセッサ１０で実行される並列処理を説明する。図７はプロセッサ１０のＣＰＵ１１０（マスタ）が行う処理のフローチャートである。まず、マスタはタスクキュー２０５作成処理（ステップＳ７０１）を実行する。ここで、タスクキュー２０５内の情報は３重化などによって高信頼化されて書込まれる。これにより、ワーカはＳＲＡＭ１５０が低電圧状態でもタスクキュー２０５から正確な情報を取得できる。タスクキュー２０５の情報は全体に比べて非常に小さいため、３重化に伴う電力損失は非常に小さい。その後、ワーカ起動処理（ステップＳ７０２）を行い、電圧変更処理（ステップＳ７０３）としてタイマ１６０へ低電圧設定値情報２０１と電圧変更間隔情報２０２を設定し、スリープ処理（ステップＳ７０４）へ移行する。マスタはタイマ１６０から割込み情報１６２を受信（ステップＳ７０５）すると、スリープ処理を解除し、ステップＳ７０６へ移行してワーカ生存確認とワーカ再起動処理を行う。その後、ステップＳ７０７としてタスク管理情報２０４を参照し、タスク処理状況の進捗確認を行い、タスクキュー２０５の全てのタスクが処理されていたならば、全てのＣＰＵ１２０（ワーカ）に対してタスク終了通知３２３を出力し、ステップＳ７１０へ移行し、タスクキュー２０５の全てのタスクが処理されていなければ、Ｓ７０３へ移行するという分岐処理（ステップＳ７０８）を行う。 Next, parallel processing executed by the processor 10 will be described with reference to FIGS. 7 and 8. FIG. 7 is a flowchart of processing performed by the CPU 110 (master) of the processor 10. First, the master executes task queue 205 creation processing (step S701). Here, the information in the task queue 205 is written with high reliability by triple or the like. Thus, the worker can acquire accurate information from the task queue 205 even when the SRAM 150 is in a low voltage state. Since the information in the task queue 205 is very small compared to the whole, the power loss caused by the triple operation is very small. Thereafter, worker activation processing (step S702) is performed, low voltage set value information 201 and voltage change interval information 202 are set in the timer 160 as voltage change processing (step S703), and the process proceeds to sleep processing (step S704). When the master receives the interrupt information 162 from the timer 160 (step S705), the master cancels the sleep process, moves to step S706, and performs worker survival confirmation and worker restart processing. Thereafter, the task management information 204 is referred to in step S707 to check the progress of the task processing status. If all the tasks in the task queue 205 have been processed, the task end notification 323 is sent to all the CPUs 120 (workers). Is transferred to step S710, and if all the tasks in the task queue 205 have not been processed, branch processing (step S708) of shifting to S703 is performed.

ステップＳ７１０ではワーカが処理したタスクの計算結果が所定のフォーマットを満たしているかのチェックを行う。例えば、教師なし学習の一種であるＫ−ｍｅａｎｓクラスタリングのアルゴリズムにおいては、入力データの各要素が所属するクラスタの番号は必ずクラスタ数Ｋよりも小さくなる。このように、ワーカの計算結果が、計算結果として取りえる値域を満たしているかをチェックする。これにより、マスタがワーカの計算結果を配列の要素番号として使用する際などに、配列オーバフローなどシステムが停止してしまう致命的なエラーを回避することができる。なお、前記所定のフォーマットを満たしていない計算結果は破棄される。 In step S710, it is checked whether the calculation result of the task processed by the worker satisfies a predetermined format. For example, in the K-means clustering algorithm which is a kind of unsupervised learning, the number of the cluster to which each element of the input data belongs is always smaller than the number K of clusters. In this way, it is checked whether the calculation result of the worker satisfies a value range that can be taken as the calculation result. As a result, when the master uses the calculation result of the worker as the element number of the array, it is possible to avoid a fatal error that causes the system to stop such as array overflow. A calculation result that does not satisfy the predetermined format is discarded.

ステップＳ７１１では、前記所定のフォーマットを満たしていない計算結果の数が目標エラー数２１１に近づくように、マスタは信頼性を調整する処理を行う。信頼性の調整は低電圧設定値情報２０１の電圧値を変更することによって行う。破棄されたデータ数が目標エラー数２１１よりも大きい場合はＳＲＡＭ１５０の信頼性を向上させるために、電圧値をより高い値へ設定する。破棄されたデータ数が目標エラー数２１１よりも小さい場合はＳＲＡＭ１５０の電力効率を向上させるために、電圧値をより低い値へ設定する。また、ステップＳ７１１として、前記所定のフォーマットを満たしていない計算結果の数が許容エラー数２１２以上であるとき、ワーカの計算結果を全て破棄し、計算のリトライを行うためにステップＳ７０３へ移行する分岐処理を行う。プロセッサ１０を備える計算機システムは、低電圧設定値情報２０１、目標エラー数２１１、および許容エラー数２１２をユーザが容易に設定することが可能なＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）を有している。なお、精度維持のために特に細やかなエラー数調整が必要ない場合、プロセッサ１０を備える計算機システムはステップＳ７１１を省略することもできる。 In step S711, the master performs a process of adjusting reliability so that the number of calculation results that do not satisfy the predetermined format approaches the target error number 211. The reliability is adjusted by changing the voltage value of the low voltage set value information 201. When the number of discarded data is larger than the target error number 211, the voltage value is set to a higher value in order to improve the reliability of the SRAM 150. When the number of discarded data is smaller than the target error number 211, the voltage value is set to a lower value in order to improve the power efficiency of the SRAM 150. In step S711, when the number of calculation results that do not satisfy the predetermined format is equal to or greater than the allowable error number 212, all of the worker calculation results are discarded, and the process proceeds to step S703 to retry the calculation. Process. The computer system including the processor 10 has an API (Application Programming Interface) that allows the user to easily set the low voltage set value information 201, the target error number 211, and the allowable error number 212. Note that if the number of errors is not particularly finely adjusted to maintain accuracy, the computer system including the processor 10 can omit step S711.

図８は、図７のＳ７０２でＣＰＵ１１０（マスタ）により起動されたＣＰＵ１２０（ワーカ）が行う処理のフローチャートである。起動したワーカは、ステップＳ８０1にてタスクキュー２０５のタスク進捗状況を確認し、ステップＳ８０２として全タスクが完了しているならばステップＳ８２０へ移行し、未処理のタスクが残っているならばＳ８１０へ移行する。Ｓ８２０ではタスク終了通知３２３をマスタから受信するまで待機し、ワーカは処理を終了する。Ｓ８１０ではタスクキュー２０５からタスクを取得し、どのワーカがどのタスクを取得したかが分かるように、取得したタスク識別番号と自身のワーカ識別番号をタスク管理情報２０４へ書込む。ステップＳ８１１として取得したタスクを処理する。ステップＳ８１２として処理したタスクの計算結果をＳＲＡＭ１５０へ出力すると共に、取得したタスクの処理が完了したことが分かるように、処理を完了したタスク識別番号と自身のワーカ識別番号をタスク管理情報２０４へ書込む。ここで、タスク管理情報２０４のデータは３重化などによって高信頼化されて書込まれる。なお、Ｓ８０１からＳ８２０までのフローにおいて、ワーカは所定の間隔で生存確認情報２１０を更新する。 FIG. 8 is a flowchart of processing performed by the CPU 120 (worker) activated by the CPU 110 (master) in S702 of FIG. The activated worker confirms the task progress status in the task queue 205 in step S801, and proceeds to step S820 if all tasks are completed as step S802, and to S810 if unprocessed tasks remain. Transition. In S820, the process waits until the task end notification 323 is received from the master, and the worker ends the process. In S810, the task is acquired from the task queue 205, and the acquired task identification number and its own worker identification number are written in the task management information 204 so that it can be understood which worker has acquired which task. The task acquired as step S811 is processed. The calculation result of the task processed in step S812 is output to the SRAM 150, and the task identification number that has been processed and its worker identification number are written to the task management information 204 so that it can be seen that the processing of the acquired task has been completed. Include. Here, the data of the task management information 204 is written with high reliability by triple or the like. In the flow from S801 to S820, the worker updates the survival confirmation information 210 at a predetermined interval.

以上の構成及び処理により、ＳＲＡＭ内の故障ビットを完全に訂正することなく、システム全体が停止することを回避した低電力なプロセッサ１０を実現できる。 With the above configuration and processing, it is possible to realize the low-power processor 10 that avoids the entire system from being stopped without completely correcting the failure bit in the SRAM.

次に、図９、図１０、図１１を用いて、プロセッサ１０の低電圧設定値情報２０１と目標エラー数２１１を設定する手段を説明する。図７に示したステップＳ７１１の信頼性調整処理では、プログラム中にその目標エラー数２１１を設定する必要がある。プロセッサ１０を含む計算機システムをユーザに提供する場合、ユーザがアプリケーションプログラムを意識して目標エラー数２１１などのパラメータを設定することが困難な場合がある。そのような場合、ユーザはパラメータ調整用プログラム１００３を実行することにより、アプリケーションプログラムを意識することなく最適なパラメータを設定できる。パラメータ調整用プログラム１００３は、ユーザが準備したパラメータ調整用のテストデータと予め設定された計算結果の精度目標値情報を用いてアプリケーションプログラムをプロセッサ１０の上で実行することによって、計算結果の精度が目標値を満たす範囲で電力が最も下がる低電圧設定値情報２０１を取得し、さらに目標エラー数２１１を取得する。 Next, means for setting the low voltage set value information 201 and the target error number 211 of the processor 10 will be described with reference to FIGS. 9, 10, and 11. In the reliability adjustment process of step S711 shown in FIG. 7, it is necessary to set the target error number 211 in the program. When a computer system including the processor 10 is provided to the user, it may be difficult for the user to set parameters such as the target error number 211 in consideration of the application program. In such a case, the user can set an optimum parameter without being aware of the application program by executing the parameter adjustment program 1003. The parameter adjustment program 1003 executes the application program on the processor 10 using the test data for parameter adjustment prepared by the user and the accuracy target value information of the calculation result set in advance, thereby improving the accuracy of the calculation result. The low voltage set value information 201 in which the power is the lowest within the range satisfying the target value is acquired, and the target error number 211 is acquired.

図９は、プロセッサ１０を含む計算機システム１の構成例を示す図である。メモリ２０は、ＤＲＡＭなどで構成されるメモリである。メモリ２０には図１０に示す情報が格納される。入出力ユニット３０は外部システムと計算機システム１を繋ぐユニットである。バス４０は計算機システム１の各コンポーネントを繋ぐバスである。 FIG. 9 is a diagram illustrating a configuration example of the computer system 1 including the processor 10. The memory 20 is a memory composed of a DRAM or the like. Information shown in FIG. 10 is stored in the memory 20. The input / output unit 30 is a unit that connects the external system and the computer system 1. The bus 40 is a bus that connects the components of the computer system 1.

図１０は、メモリ２０に格納されたデータの一例である。アプリケーションプログラム１００１は、パラメータ調整対象となるアプリケーションプログラムである。テストデータ１００２は、低電圧設定値情報２０１と目標エラー数２１１のパラメータを調整するための入力テストデータである。パラメータ調整用プログラム１００３は、アプリケーションプログラム１００１の最適パラメータを探索するためのプログラムである。精度目標値情報１００４は、許容可能な精度劣化を規定する基準情報である。 FIG. 10 is an example of data stored in the memory 20. The application program 1001 is an application program that is a parameter adjustment target. Test data 1002 is input test data for adjusting the parameters of the low voltage set value information 201 and the target error number 211. The parameter adjustment program 1003 is a program for searching for the optimum parameter of the application program 1001. The accuracy target value information 1004 is reference information that defines allowable accuracy degradation.

図１１のパラメータ調整用プログラム１００３のフローチャートを用いて、ユーザがアプリケーションプログラムを意識することなく、低電圧設定値情報２０１と目標エラー数２１１の設定値を取得する方法を説明する。まず、計算機システム１は正解基準データ生成（ステップＳ１１０１）を行う。正解基準データは、プロセッサ１０の低電圧設定値情報２０１を標準電圧値として実行する（すなわち、全ての処理を標準電圧で実行する）ことによって得られる高信頼計算時における計算結果であり、低電圧動作を含む高効率計算時の計算結果と比較するために使用されるデータである。 A method for acquiring the setting values of the low voltage setting value information 201 and the target error number 211 without the user being aware of the application program will be described using the flowchart of the parameter adjustment program 1003 in FIG. First, the computer system 1 performs correct reference data generation (step S1101). The correct reference data is a calculation result at the time of high-reliability calculation obtained by executing the low voltage set value information 201 of the processor 10 as a standard voltage value (that is, executing all processes at the standard voltage). It is data used for comparison with the calculation result at the time of high efficiency calculation including operation.

ステップＳ１１０２で、低電圧設定値情報２０１のパラメータを電圧値更新幅情報１００５だけ小さい値に設定する。すなわち、ここでは標準電圧よりも電圧値更新幅情報１００５だけ小さい値に設定される。次にステップＳ１１０３でアプリケーションプログラム１００１を実行し、低電圧動作を含む高効率計算時の計算結果を得て、ステップＳ１１０４で正解基準データとの比較を行い、高効率計算時における計算精度の劣化の度合いを示す計算精度劣化値を取得する。 In step S1102, the parameter of the low voltage set value information 201 is set to a value smaller by the voltage value update width information 1005. That is, here, the voltage value update width information 1005 is set to a value smaller than the standard voltage. Next, in step S1103, the application program 1001 is executed to obtain a calculation result at the time of high efficiency calculation including low voltage operation. In step S1104, the result is compared with the correct answer reference data, and the calculation accuracy deteriorates at the time of high efficiency calculation. A calculation accuracy deterioration value indicating the degree is acquired.

そして、ステップＳ１１０５で、前記計算精度劣化値と精度目標値情報１００４を比較し、目標とする計算精度を満たしているならばステップＳ１１０２へ移行し、低電圧設定値情報２０１の値をさらに電圧値更新幅情報１００５だけ小さい値に設定する。Ｎ回目の試行におけるステップＳ１１０５の処理において目標とする計算精度を満たしていなければ、Ｎ−１回目の試行における低電圧設定値情報２０１を、アプリケーションプログラム１００１における低電圧設定値情報２０１として得る。さらに、ステップＳ１１１０にてＮ−１回目の試行におけるステップＳ７１０（データ健全性チェック）でカウントした破棄データ数の平均値を目標エラー数２１１として得る。 In step S1105, the calculated accuracy deterioration value and accuracy target value information 1004 are compared. If the target calculation accuracy is satisfied, the process proceeds to step S1102, and the value of the low voltage set value information 201 is further converted to a voltage value. The update width information 1005 is set to a small value. If the target calculation accuracy is not satisfied in the process of step S1105 in the Nth trial, the low voltage setting value information 201 in the N-1th trial is obtained as the low voltage setting value information 201 in the application program 1001. Further, in step S1110, an average value of the number of discarded data counted in step S710 (data integrity check) in the N-1th trial is obtained as the target error number 211.

以上の構成及び処理により、ユーザがアプリケーションプログラムを意識することなく、低電圧設定値情報２０１と目標エラー数２１１の設定値を取得することができ、要求された計算精度を満たしつつ消費電力を削減した計算機システム１を実現できる。なお、ここでは低電圧設定値情報２０１を標準電圧から徐々に低下させる例、即ち高い電圧値から徐々に低い電圧値へ変更することによって最適パラメータを得る例を示したが、低い電圧値から徐々に高い電圧値へ変更することによって最適パラメータを得ることも可能である。 With the above configuration and processing, the user can acquire the set values of the low voltage set value information 201 and the target error number 211 without being aware of the application program, and reduce power consumption while satisfying the required calculation accuracy. The computer system 1 can be realized. Here, an example in which the low voltage set value information 201 is gradually decreased from the standard voltage, that is, an example in which an optimum parameter is obtained by gradually changing from a high voltage value to a low voltage value is shown. It is also possible to obtain optimum parameters by changing to a higher voltage value.

実施例２では、ＤＲＡＭの消費電力を削減した計算機システム３の例を説明する。 In the second embodiment, an example of a computer system 3 in which the power consumption of the DRAM is reduced will be described.

図１２は、本実施例における計算機システム３の構成例である。計算機システム３はプロセッサ１８１０、プロセッサ１８２０、バス４０、入出力ユニット３０、ＤＲＡＭ１８３０を有している。図９と同一のコンポーネントには同一の符号を付し、説明は省略する。 FIG. 12 is a configuration example of the computer system 3 in this embodiment. The computer system 3 includes a processor 1810, a processor 1820, a bus 40, an input / output unit 30, and a DRAM 1830. The same components as those in FIG. 9 are denoted by the same reference numerals, and description thereof is omitted.

プロセッサ１８１０、プロセッサ１８２０はＣＰＵなどで構成されるプロセッサである。計算機システム３は実施例１と同じくマスタ・ワーカ構成の計算を行う計算機システムであり、プロセッサ１８１０はマスタ、プロセッサ１８２０はワーカの役割を担う。メモリ１８３０は本発明に係るメモリであり、ＤＲＡＭなど、データの揮発を防ぐためのリフレッシュを必要とする記憶デバイスで構成される。 The processor 1810 and the processor 1820 are processors configured by a CPU or the like. The computer system 3 is a computer system that performs the calculation of the master-worker configuration as in the first embodiment. The processor 1810 serves as a master and the processor 1820 serves as a worker. The memory 1830 is a memory according to the present invention, and is composed of a storage device such as a DRAM that requires refreshing to prevent data volatilization.

メモリ１８３０は、図１３に示すように入出力ユニット１９１０、制御ユニット１９２０、バス１９４０、記憶ユニット１９３０で構成される。バス１９４０はメモリ１８３０内の各コンポーネントを繋ぐためのバスである。入出力ユニット１９１０はバス４０とメモリ１８３０の内部とつなぐユニットであり通信プロトコルに関する処理を行う。 The memory 1830 includes an input / output unit 1910, a control unit 1920, a bus 1940, and a storage unit 1930 as shown in FIG. A bus 1940 is a bus for connecting components in the memory 1830. The input / output unit 1910 is a unit that connects the bus 40 and the inside of the memory 1830, and performs processing related to the communication protocol.

制御ユニット１９２０はメモリ１８３０の制御部であり、記憶ユニット１９３０へのデータ書込み及び読出し処理や、これに伴うＥＣＣ処理、さらにリフレッシュ処理などを行う。制御ユニット１９２０は記憶ユニット１９２１を有する。 The control unit 1920 is a control unit of the memory 1830, and performs data writing and reading processing to the storage unit 1930, ECC processing associated therewith, and further refresh processing. The control unit 1920 has a storage unit 1921.

記憶ユニット１９２１は、図１４に示すように、第１のリフレッシュレート情報２００１と第２のリフレッシュレート情報２００２を有する。第１のリフレッシュレート情報２００１は記憶ユニット１９３０の高信頼領域１９３１のリフレッシュレートであり、第２のリフレッシュレート情報２００２は低電力領域１９３２のリフレッシュレートである。第１のリフレッシュレート情報２００１と第２のリフレッシュレート情報２００２はプロセッサ１８１０（マスタ）から設定される。リフレッシュレートが高いほど頻繁にリフレッシュを行うので記憶ユニットの信頼性は向上するが消費電力も増加する。このため、低電力領域１９３２のリフレッシュレート（第１のリフレッシュレート情報２００１）は高信頼領域１９３１のリフレッシュレート（第２のリフレッシュレート情報２００２）よりも低く設定されている。 As shown in FIG. 14, the storage unit 1921 has first refresh rate information 2001 and second refresh rate information 2002. The first refresh rate information 2001 is the refresh rate of the high reliability area 1931 of the storage unit 1930, and the second refresh rate information 2002 is the refresh rate of the low power area 1932. The first refresh rate information 2001 and the second refresh rate information 2002 are set from the processor 1810 (master). The higher the refresh rate, the more frequently refreshing is performed, so the reliability of the storage unit is improved but the power consumption is also increased. For this reason, the refresh rate (first refresh rate information 2001) of the low power region 1932 is set lower than the refresh rate (second refresh rate information 2002) of the high reliability region 1931.

記憶ユニット１９３０は、ＤＲＡＭのアレイで構成される記憶デバイスであり、高信頼領域１９３１と低電力領域１９３２を有する。高信頼領域１９３１は保持されるデータの故障ビット数が、制御ユニット１９２０が実施するＥＣＣで正しく訂正可能な範囲内になるように動作を行うアドレス領域である。低電力領域１９３２は、保持されるデータの故障ビット数が、制御ユニット１９２０が実施するＥＣＣで正しく訂正可能な範囲外になるように動作を行うアドレス領域である。即ち、低電力領域１９３２に書込まれたデータは、読出し時に誤りを有しながらバス４０に出力される。 The storage unit 1930 is a storage device composed of an array of DRAMs, and has a high reliability area 1931 and a low power area 1932. The high-reliability area 1931 is an address area that operates so that the number of failed bits of data held is within a range that can be correctly corrected by the ECC executed by the control unit 1920. The low power area 1932 is an address area that operates so that the number of failed bits of data held is outside the range that can be correctly corrected by the ECC executed by the control unit 1920. That is, the data written in the low power region 1932 is output to the bus 40 while having an error at the time of reading.

高信頼領域１９３１が有するデータを図１５に示す。図１５において、図２と同一のデータには同一の符号を付し、説明を省略する。レート変更間隔情報２１０２はリフレッシュレートを変更する間隔の情報である。低電力領域１９３２が有するデータを図１６に示す。図１６において、図２と同一のデータには同一の符号を付し、説明を省略する。高信頼領域１９３１はマスタ及びワーカの双方がアクセスする領域であり、計算機システムを制御するためのデータが格納されている。一方、低信頼領域１９３２はワーカがアクセスする領域であり、画像やテキストなどの入力データや計算の中間データなどが格納されている。 Data included in the high reliability region 1931 is shown in FIG. In FIG. 15, the same data as in FIG. Rate change interval information 2102 is information on an interval for changing the refresh rate. Data included in the low power region 1932 is shown in FIG. In FIG. 16, the same data as in FIG. The high reliability area 1931 is an area accessed by both the master and the worker, and stores data for controlling the computer system. On the other hand, the low-reliability area 1932 is an area accessed by a worker, and stores input data such as images and text, intermediate data for calculation, and the like.

次に計算機システム３の処理フローを、図１７に示す計算機システム３の動作フローチャートを用いて説明する。図１７において、図７と同一の要素については同一の符号を付し、詳しい説明を省略する。 Next, a processing flow of the computer system 3 will be described with reference to an operation flowchart of the computer system 3 shown in FIG. In FIG. 17, the same elements as those of FIG. 7 are denoted by the same reference numerals, and detailed description thereof is omitted.

並列処理において、マスタはタスクキュー作成処理（ステップＳ７０１）を行い、ワーカ起動処理（ステップＳ７０２）を行って、所定時間スリープする（ステップＳ７０４）。本実施例におけるタスクキュー作成処理では、マスタは高信頼領域１９３１に作成したタスクキューを格納する。その後、ステップＳ７０５では、割込み情報受信や内部タイマに基づいてアクティブ状態に遷移し、ワーカ生存確認と再起動処理（ステップＳ７０６）を行い、タスク進捗確認（ステップＳ７０７）を行う。そして、全てのタスクが完了していなければステップＳ７０４へ移行し、全てのタスクが完了していたならば得られた結果に対してデータ健全性チェック（ステップＳ７１０）を行う。ステップＳ２３１１では、実施例１におけるステップＳ７１１と同様の手段で信頼性調整処理を行う。但し、実施例１における計算機システム１ではデータ信頼性（即ち、データ中の故障ビット数又は故障ビット割合）の調整を電圧変更によって行っていたが、本実施例における計算機システム３ではデータ信頼性の調整を、ＤＲＡＭのリフレッシュレートの変更によって行う点が異なる（ステップＳ２３１１）。即ち、計算機システム３では低電力領域１９３２のリフレッシュレートを定める第２のリフレッシュレート情報２００２を変更することによって信頼性調整を行う。破棄されたデータ数が目標エラー数２１１よりも大きい場合はＤＲＡＭ１９３２の信頼性を向上させるために、リフレッシュレートをより高い値へ設定する。破棄されたデータ数が目標エラー数２１１よりも小さい場合はＤＲＡＭ１９３２の電力効率を向上させるために、リフレッシュレートをより低い値へ設定する。 In parallel processing, the master performs task queue creation processing (step S701), performs worker activation processing (step S702), and sleeps for a predetermined time (step S704). In the task queue creation process in this embodiment, the master stores the created task queue in the high reliability area 1931. Thereafter, in step S705, the state transits to an active state based on reception of interrupt information and an internal timer, worker survival confirmation and restart processing (step S706) are performed, and task progress confirmation (step S707) is performed. If all the tasks are not completed, the process proceeds to step S704. If all the tasks are completed, a data soundness check is performed on the obtained result (step S710). In step S2311, reliability adjustment processing is performed by the same means as in step S711 in the first embodiment. However, in the computer system 1 in the first embodiment, the data reliability (that is, the number of failed bits in the data or the failure bit rate) is adjusted by changing the voltage. However, in the computer system 3 in the present embodiment, the data reliability is improved. The difference is that the adjustment is performed by changing the refresh rate of the DRAM (step S2311). That is, the computer system 3 performs the reliability adjustment by changing the second refresh rate information 2002 that determines the refresh rate of the low power region 1932. When the number of discarded data is larger than the target error number 211, the refresh rate is set to a higher value in order to improve the reliability of the DRAM 1932. When the number of discarded data is smaller than the target error number 211, the refresh rate is set to a lower value in order to improve the power efficiency of the DRAM 1932.

本実施例においても、Ｓ７０２の処理により起動したワーカは図８の一連の処理を実行するが、低信頼領域１９３２に格納された入力データ２０８を対象として処理を実行し、その処理結果であるワーカ作業データ２０９を低信頼領域に格納する点が実施例１とは異なる。 Also in the present embodiment, the worker activated by the processing of S702 executes the series of processing of FIG. 8, but the processing is executed on the input data 208 stored in the low-reliability area 1932, and the worker as the processing result is executed. The point that the work data 209 is stored in the low reliability area is different from the first embodiment.

以上の構成及び処理により、ＤＲＡＭ内の故障ビットを完全に訂正することなく、システム全体が停止することを回避した低電力な計算機システム３を実現できる。大容量ＤＲＡＭを用いるシステムにおいては、ＤＲＡＭが消費する電力の大部分はリフレッシュのための電力であるため、本実施例における計算機システムにより、ＤＲＡＭの電力を大きく削減することが可能になる。 With the above configuration and processing, it is possible to realize a low-power computer system 3 that avoids the entire system from being stopped without completely correcting the failure bit in the DRAM. In a system using a large-capacity DRAM, most of the power consumed by the DRAM is for refreshing. Therefore, the computer system in this embodiment can greatly reduce the power of the DRAM.

Claims

A memory that transitions between a first operating state and a second operating state ;
Wherein said access to the memory of the first operating state, set to the data error rate in the memory is shifted to a higher second operating state than said first operation state, access to the memory A first processor to stop;
Computer system and a second processor to access the memory of the second operating state.

A computer system according to claim 1, wherein
In the first operation state , the first processor stores work instruction contents for the second processor in the memory,
In the second operation state , the second processor reads out work instruction contents from the memory and executes processing.

A computer system according to claim 2, wherein
The computer system according to claim 2, wherein the second operation state is a state in which an uncorrectable data error of 1 bit or more occurs in the memory.

A computer system according to claim 2, wherein
A computer system, wherein the second operation state is determined based on a result of executing processing using a test pattern as input data.

A computer system according to claim 2, wherein
In the second processor, the first storage area of the memory is allocated as a usable storage area,
The first processor checks whether the second processor is operating, and if not, allocates a second storage area to the second processor instead of the first storage area A computer system, wherein the second processor is restarted.

A computer system according to claim 2, wherein
The first processor checks whether a processing result of the second processor satisfies a predetermined condition, and re-executes an operation instructed to the second processor if the predetermined result is not satisfied. Computer system.

A computer system according to claim 2, wherein
The memory is SRAM (Static Random Access Memory),
The computer system characterized in that the first and second operating states are determined by an operating voltage of the memory, and the operating voltage in the second operating state is lower than the operating voltage in the first operating state.

A control method for a computer system comprising a first and a second processor and a memory that transitions between a first operating state and a second operating state ,
Said first processor, said access to the memory of the first operating state, set to the data error rate in the memory is shifted to a higher second operating state than the first operating state , Stop accessing the memory,
The second processor accesses to the memory of the second operating state, the control method of the computer system.

A control method for a computer system according to claim 8 , comprising:
In the first operation state , the first processor stores work instruction contents for the second processor in the memory,
Said second processor, said in a second operating state, performs the process by reading the work instruction contents from the memory, the control method of the computer system.

A control method for a computer system according to claim 9 , comprising:
The memory is SRAM (Static Random Access Memory),
The computer system characterized in that the first and second operation states are determined by an operation voltage of the memory, and the operation voltage in the second operation state is lower than the operation voltage in the first operation state. Control method.