JP2011113457A

JP2011113457A - Simultaneous multi-threading processor, control method, program, compiling method, and information processor

Info

Publication number: JP2011113457A
Application number: JP2009271499A
Authority: JP
Inventors: Noritaka Hoshi; 宗王星
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-11-30
Filing date: 2009-11-30
Publication date: 2011-06-09

Abstract

<P>PROBLEM TO BE SOLVED: To reduce thread execution time in a simultaneous multi-threading processor. <P>SOLUTION: The simultaneous multi-threading processor includes: a fetching means which fetches a plurality of instructions belonging to one thread with thread identifiers added thereto for identifying threads where the instructions are executed, so that the instructions may be executed in the plurality of threads after decoded; a decoding means which decodes the plurality of instructions fetched by the fetching means to generate the plurality of threads, and assigns the instructions to threads indicated by the thread identifiers added to the instructions; and an execution means which executes the plurality of instructions by operating, in parallel, the plurality of threads generated by the decoding means. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、同時マルチスレッディング方式で複数のスレッドを実行する技術に関する。 The present invention relates to a technique for executing a plurality of threads by a simultaneous multi-threading method.

近年、物理的に単一のプロセッサ上で、複数のスレッドを実行するマルチスレッド方式がいくつかのプロセッサに採用されている。特許文献１および特許文献２に開示されたプロセッサは、このマルチスレッド方式を採用しており、複数のスレッドを切り替えることで演算器の利用率を向上させたり、メモリアクセスレイテンシなどを隠蔽したりすることができる。 In recent years, a multi-thread method for executing a plurality of threads on a physically single processor has been adopted for some processors. The processors disclosed in Patent Document 1 and Patent Document 2 adopt this multi-thread method, and improve the usage rate of an arithmetic unit by hiding a plurality of threads, or conceal memory access latency and the like. be able to.

マルチスレッド方式には、スレッド切り替えのタイミングをどのように制御するかに関して複数の方式がある。 The multi-thread method includes a plurality of methods regarding how to control the timing of thread switching.

例えば、一定時間ごとにスレッドを切り替える時分割型マルチスレッディング方式、キャッシュミスなどによってプロセッサが待機状態になるときに別スレッドに切り替える方式、複数スレッドの命令列に対して個々の命令ごとに実行の可否を判断する同時マルチスレッディング方式などがある。 For example, a time-division multi-threading method that switches threads at regular intervals, a method that switches to another thread when the processor enters a standby state due to a cache miss, etc. There are simultaneous multi-threading methods to judge.

ここでは、同時マルチスレッディング方式について説明する。非特許文献１には、同時マルチスレッディングを導入したハイパースレッディング（登録商標）技術が記載されている。 Here, the simultaneous multi-threading method will be described. Non-Patent Document 1 describes Hyper-Threading (registered trademark) technology in which simultaneous multi-threading is introduced.

図１７は、非特許文献１に記載された同時マルチスレッディングプロセッサ８０１の構成を示すブロック図である。同図を参照すると、同時マルチスレッディングプロセッサ８０１において、実行リソース８０４は単一であるが、プロセスの状態を保持するレジスタ等のアーキテクチャ・ステート（８０２および８０３）は、スレッドの数だけ多重化されている。一般的には、アーキテクチャ・ステートは二重化される。 FIG. 17 is a block diagram illustrating a configuration of the simultaneous multi-threading processor 801 described in Non-Patent Document 1. Referring to the figure, in the simultaneous multi-threading processor 801, the execution resource 804 is single, but the architecture state (802 and 803) such as a register for holding the process state is multiplexed by the number of threads. . In general, the architecture state is duplicated.

図１８は、非特許文献１に記載されたプロセッサのパイプライン構造を説明するための図である。同図では、説明の簡略化のために、キャッシュヒットした場合における、命令フェッチからアウト・オブ・オーダー実行のレジスタ書き込みまでの部分のみを抜き出し、その他の部分を省略している。 FIG. 18 is a diagram for explaining the pipeline structure of the processor described in Non-Patent Document 1. In the figure, for simplification of explanation, only a part from instruction fetch to register writing for out-of-order execution in the case of a cache hit is extracted, and the other parts are omitted.

命令ポインタ９０１は、各スレッドの命令実行状況を追跡するために用いられる。キャッシュメモリ９０２には、実行対象の命令列がキャッシュされる。 The instruction pointer 901 is used for tracking the instruction execution status of each thread. In the cache memory 902, the instruction sequence to be executed is cached.

プロセッサは、キャッシュメモリ９０２にキャッシュされた命令列のうち、命令ポインタ９０１で指定された命令をフェッチする。命令フェッチキュー９０３は、フェッチされた命令を保持する。 The processor fetches the instruction specified by the instruction pointer 901 from the instruction sequence cached in the cache memory 902. The instruction fetch queue 903 holds fetched instructions.

デコード部９０４は、命令フェッチキュー９０３から取り出された命令を解釈し、マイクロオペコードにより記述された命令に変換する。 The decoding unit 904 interprets the instruction fetched from the instruction fetch queue 903 and converts it into an instruction described by the micro opcode.

オペコードキュー９０５は、変換後の命令列を保持する。レジスタ・リネーム及び割り当て制御ステージ９０６では、アウト・オブ・オーダー実行を行っても誤動作しないように、レジスタ・リネーミングと、命令へのリソースの割り当てとが行われる。リソースが割り当てられた命令は、キュー９０７に格納され、スケジューラ９０８によりアウト・オブ・オーダーで実行判定が行われる。 The opcode queue 905 holds the converted instruction sequence. In the register renaming and allocation control stage 906, register renaming and allocation of resources to instructions are performed so as not to malfunction even if out-of-order execution is performed. The instruction to which the resource is allocated is stored in the queue 907, and execution determination is performed out-of-order by the scheduler 908.

レジスタ読み出しステージ９０９で、ソースレジスタからデータが読み出され、実行ステージ９１０で、読み出されたデータが、命令に対応する回路で処理される。処理結果は、レジスタ書き込みステージ９１１で、デスティネーションレジスタに書き込まれる。 In the register read stage 909, data is read from the source register, and in the execution stage 910, the read data is processed by a circuit corresponding to the instruction. The processing result is written to the destination register at the register write stage 911.

なお、ここで、非特許文献１に記載されたプロセッサは、実際には、デコード済みの命令を格納するトレースキャッシュを有し、プログラム内の命令は、ほとんどの場合、このトレースキャッシュからフェッチされる。しかし、図１８では、簡略化のため、トレースキャッシュを省略し、プロセッサは、キャッシュメモリ９０２から直接、命令ポインタ９０１の示す命令を読み出し、デコード後にオペコードキュー９０５に入力する構成としている。 Note that the processor described in Non-Patent Document 1 actually has a trace cache for storing decoded instructions, and instructions in a program are almost always fetched from the trace cache. . However, in FIG. 18, for simplification, the trace cache is omitted, and the processor reads the instruction indicated by the instruction pointer 901 directly from the cache memory 902 and inputs the instruction to the opcode queue 905 after decoding.

また、実行ステージ９１０の後には、実際には、レジスタ書き込みステージの前にストアバッファおよびＬ１Ｄキャッシュへの書き込みステージがあるが、これらも、説明の簡略化のため、図１８では省略されている。 Further, after the execution stage 910, there is actually a write stage to the store buffer and the L1D cache before the register write stage, but these are also omitted in FIG. 18 for the sake of simplicity of explanation.

このような、命令のフェッチ、デコード、および実行の手順は、同時マルチスレッディング方式を導入するか否かに関わらず、プロセッサの一般的な動作である。 Such an instruction fetch, decode, and execution procedure is a general operation of a processor regardless of whether or not a simultaneous multi-threading scheme is introduced.

次に、同時マルチスレッディング方式を採用する場合の動作の特徴について説明する。この場合、命令ポインタ９０１は多重化され、プロセッサは、それらの命令ポインタを参照することにより、複数スレッドの命令列を命令フェッチキュー９０３にフェッチし、それぞれをデコードしてオペコードキュー９０５に追加する。 Next, characteristics of the operation when the simultaneous multi-threading method is adopted will be described. In this case, the instruction pointer 901 is multiplexed, and the processor fetches an instruction sequence of a plurality of threads into the instruction fetch queue 903 by referring to these instruction pointers, decodes each of them, and adds them to the opcode queue 905.

そして、図１８に示すように、命令ポインタ９０１、命令フェッチキュー９０３、およびオペコードキュー９０５は、それぞれ二重化されている。 As shown in FIG. 18, the instruction pointer 901, the instruction fetch queue 903, and the opcode queue 905 are duplicated.

レジスタ・リネーム及び割り当て制御ステージ９０６では、プロセッサは、スレッドごとにレジスタを管理する。つまり、プロセッサは、スレッド１の命令では、そのスレッド１に属するレジスタを使用し、スレッド２の命令では、そのスレッド２に属するレジスタを使用する。 In the register rename and allocation control stage 906, the processor manages registers for each thread. That is, the processor uses the register belonging to the thread 1 for the instruction of the thread 1 and uses the register belonging to the thread 2 for the instruction of the thread 2.

但し、レジスタ・リネーム後の物理レジスタ番号は、スレッドごとに区別して管理されない。このため、レジスタ・リネーム及び割り当て制御ステージ９０６においては、命令はスレッドごとに管理される必要がなく、スレッド１の命令と、スレッド２の命令とが混在することになる。 However, physical register numbers after register renaming are not managed separately for each thread. For this reason, in the register renaming and allocation control stage 906, the instruction does not need to be managed for each thread, and the instruction of the thread 1 and the instruction of the thread 2 are mixed.

また、スケジューラ９０８以降のステージでは、レジスタは、物理レジスタ番号によって区別されているため、プロセッサは、実行対象の命令がスレッド１、２いずれの命令であるかを意識する必要はない。プロセッサは、物理レジスタ番号で指定されたレジスタを使用して、処理を実行する。 In the stage after the scheduler 908, since the registers are distinguished by physical register numbers, the processor does not need to be aware of which instruction of the thread 1 or 2 is the instruction to be executed. The processor uses the register designated by the physical register number to execute processing.

同時マルチスレッディング方式を導入したプロセッサは、上述の手順で、複数の命令列より供給された命令から、スケジューラが実行可能なものを選択して実行ステージに送り込む。 A processor that has introduced the simultaneous multi-threading method selects instructions that can be executed by the scheduler from instructions supplied from a plurality of instruction sequences and sends them to the execution stage in the above-described procedure.

単一の命令列から命令が供給された場合、スケジューラが実行可能な命令を選択できず、実行ステージで空き時間が生じることがある。この場合であっても、同時マルチスレッディング方式のプロセッサでは、この空き時間に別スレッドの命令を実行できる可能性がある。この結果、実行リソースの利用率が向上する。これにより、データ待ち時間が隠蔽され、全体の処理時間、すなわち複数スレッドの処理時間の総計を小さくできるという効果が得られる。 When an instruction is supplied from a single instruction sequence, an instruction that can be executed by the scheduler cannot be selected, and an idle time may occur in the execution stage. Even in this case, the simultaneous multi-threading type processor may be able to execute an instruction of another thread during this idle time. As a result, the utilization rate of execution resources is improved. As a result, the data waiting time is concealed, and the total processing time, that is, the total processing time of a plurality of threads can be reduced.

特開２００１−２３６２２１号公報JP 2001-236221 A 特開２００６−３９８１５号公報JP 2006-39815 A

Deborah T.Marr, Frank Ninns, David L.Hill, Glenn Hinton, David A Koufaty, J Alan Miller, Michel Upton、「ハイパー・スレッディング・テクノロジのアーキテクチャとマイクロアーキテクチャ」、インテル・テクノロジー・ジャーナル Q1, 2002, Vol.6, Issue 1Deborah T. Marr, Frank Ninns, David L. Hill, Glenn Hinton, David A Koufaty, J Alan Miller, Michel Upton, "Hyper-Threading Technology Architecture and Microarchitecture", Intel Technology Journal Q1, 2002, Vol. .6, Issue 1

しかし、非特許文献１に記載されたプロセッサでは、スレッドの実行時間の削減が困難であるというという問題があった。 However, the processor described in Non-Patent Document 1 has a problem that it is difficult to reduce the thread execution time.

上述したように、非特許文献１に記載された、同時マルチスレッディング方式の採用により、演算器等のリソースの利用効率は向上する。このため、プロセッサの処理能力を有効に活用できる。 As described above, by using the simultaneous multi-threading method described in Non-Patent Document 1, the utilization efficiency of resources such as computing units is improved. For this reason, the processing capability of the processor can be effectively utilized.

ところが、同時マルチスレッディングプロセッサでは、複数のスレッドが物理的に単一のリソースをシェアするので、単一のスレッドのみに着目すれば、実行時間が増加してしまうという問題があった。 However, in the simultaneous multithreading processor, since a plurality of threads physically share a single resource, there is a problem that if only a single thread is focused, the execution time increases.

この実行時間の増加について、図１５を参照して説明する。同図（ａ）に示すように、スレッド１を単独で実行した場合の実行時間がＴ１であるとする。このうち、実際に演算器が使用された時間は、ｔ１であるとすると、データの到着待ちなどの理由で「Ｔ１−ｔ１」の間、その演算器は使用されていないことになる。 This increase in execution time will be described with reference to FIG. As shown in FIG. 5A, it is assumed that the execution time when the thread 1 is executed alone is T1. Among these, if the time when the computing unit is actually used is t1, the computing unit is not used for "T1-t1" for reasons such as waiting for arrival of data.

そして、図１５（ｂ）に示すように、スレッド１に続いて、スレッド２を単独で実行した場合の実行時間がＴ２であり、演算器が使用された時間がｔ２であるとすると、「Ｔ２−ｔ２」の間、その演算器は使用されていないことになる。 Then, as shown in FIG. 15B, if the execution time when the thread 2 is executed independently of the thread 1 is T2, and the time when the arithmetic unit is used is t2, then "T2 The calculator is not used during “−t2”.

シングルスレッドのみしか実行できないプロセッサでは、スレッド１およびスレッド２の処理を完了するには、「Ｔ１＋Ｔ２」の実行時間を要する。 In a processor that can execute only a single thread, the execution time of “T1 + T2” is required to complete the processing of the thread 1 and the thread 2.

これに対し、同時マルチスレッディング方式のプロセッサであれば、スレッド１で演算器が使用されていない時間に、スレッド２の命令を実行できる。時間Ａ、Ｂに対して、時間ｔ１、ｔ２が十分に小さければ、プロセッサは、処理時間の長い方のスレッド、例えばスレッド１の待ち時間「Ｔ１−ｔ１」の時間だけで、もう一方のスレッド２の処理を完了できる可能性がある。 On the other hand, the processor of the simultaneous multi-threading method can execute the instruction of the thread 2 when the arithmetic unit is not used in the thread 1. If the times t1 and t2 are sufficiently small with respect to the times A and B, the processor can execute the other thread 2 in the thread having the longer processing time, for example, the waiting time “T1−t1” of the thread 1. May be able to complete the process.

このため、時間Ｔ１、Ｔ２に対して、時間ｔ１、ｔ２が十分に小さければ、同時マルチスレッディングにより性能は十分に改善しうる。 For this reason, if the times t1 and t2 are sufficiently smaller than the times T1 and T2, the performance can be sufficiently improved by simultaneous multithreading.

しかしながら、時間Ｔ１、Ｔ２に対して、時間ｔ１、ｔ２があまり小さくない場合、処理時間が、処理時間の長い方のスレッドの待ち時間内に、もう一方のスレッドの処理を完了できないことがある。この場合、データ待ちの時間を隠蔽できたとしても、性能改善は、ごく僅かにとどまる。２スレッド分の命令を供給するために生じるオーバーヘッドや、キャッシュメモリ上でのスレッド同士の競合により、シングルスレッドのプロセッサより性能が低下するケースもある。 However, if the times t1 and t2 are not so small with respect to the times T1 and T2, the processing of the other thread may not be completed within the waiting time of the thread having the longer processing time. In this case, even if the data waiting time can be concealed, the performance improvement is negligible. In some cases, the performance may be lower than that of a single-thread processor due to the overhead caused by supplying instructions for two threads and the competition between threads on the cache memory.

このように、データ待ち時間に対して、演算器の使用時間が長いと、同時マルチスレッディング方式を適用しても、実行時間が増えて、性能が向上しないことがある。 As described above, when the operation time of the arithmetic unit is long with respect to the data waiting time, even if the simultaneous multi-threading method is applied, the execution time increases and the performance may not be improved.

単一のスレッドの実行時間を短縮するには、データ待ち時間を削減することが有効である。データ待ち時間を削減するための方法には、データをレジスタに供給するロード命令を、プロセッサがなるべく早く発行するという方法や、プロセッサがロード命令を同時に多数発行できるように制御するという方法がある。 In order to shorten the execution time of a single thread, it is effective to reduce the data waiting time. As a method for reducing the data waiting time, there are a method in which a processor issues a load instruction for supplying data to a register as early as possible, and a method in which the processor controls so that a number of load instructions can be issued simultaneously.

しかしながら、データ待ち時間を削減するための、これらの方法は、プログラムが利用可能なレジスタであるソフト見えレジスタの数によって、その効果が制限されてしまう。例えば、各命令が利用可能な空きレジスタがなければ、プロセッサは、そもそもロード命令を先行して発行することができない。このため、データ待ち時間を削減するには、十分な数のソフト見えレジスタが必要となる。 However, the effectiveness of these methods for reducing data latency is limited by the number of soft-looking registers that are available to the program. For example, if there is no free register available for each instruction, the processor cannot issue a load instruction in the first place. For this reason, a sufficient number of soft appearance registers are required to reduce data latency.

ロード命令の発行に関して、例えば、論理的なソフト見えレジスタ数に対して多数の物理レジスタを用意して、レジスタ名をハードウェアがリネーム（すなわち、レジスタ・リネーム）することで、ロード命令を先行して発行する技術がある。 Regarding issuance of a load instruction, for example, a large number of physical registers are prepared for the number of logical soft-looking registers, and the register name is renamed by hardware (ie, register renaming). There is a technology to issue.

しかし、このレジスタ・リネーミングを行っても、ロード・ストア命令そのものは削除されない。このため、レジスタ上のデータを明示的に入れ替えるレジスタスピルが起こるようなケースにおいては、データ待ち時間を削減する効果は限定的なものになってしまう。 However, even with this register renaming, the load / store instruction itself is not deleted. For this reason, in the case where a register spill that explicitly replaces the data on the register occurs, the effect of reducing the data waiting time is limited.

このため、同時マルチスレッディングプロセッサにおいて、スレッドの実行時間、特にデータ待ち時間を削減することが困難であった。 For this reason, in the simultaneous multithreading processor, it is difficult to reduce the thread execution time, particularly the data waiting time.

本発明は、同時マルチスレッディングプロセッサで実行する、スレッドの実行時間を削減する技術を提供することを目的とする。 An object of the present invention is to provide a technique for reducing the execution time of a thread executed by a simultaneous multi-threading processor.

上記目的を達成するために、本発明の同時マルチスレッディングプロセッサは、デコード後に複数のスレッドで実行されるように、該複数のスレッドのうち、該命令が実行されるスレッドを識別するためのスレッド識別子が付加された、単一のスレッドに属する複数の命令をフェッチするフェッチ手段と、前記フェッチ手段によりフェッチされた前記複数の命令をデコードし、前記複数のスレッドを生成し、それぞれの該命令に付加された前記スレッド識別子の示すスレッドに、該命令を割り当てるデコード手段と、前記デコード手段により生成された前記複数のスレッドを並列に動作させることにより、前記複数の命令を実行する実行手段と、を有する。 In order to achieve the above object, the simultaneous multithreading processor of the present invention has a thread identifier for identifying a thread in which the instruction is executed among the plurality of threads so that the thread is executed in a plurality of threads after decoding. A fetch means for fetching a plurality of instructions belonging to a single thread, and a plurality of instructions fetched by the fetch means are decoded to generate the plurality of threads, which are added to the instructions. A decoding unit that assigns the instruction to the thread indicated by the thread identifier; and an execution unit that executes the plurality of instructions by operating the plurality of threads generated by the decoding unit in parallel.

本発明の同時マルチスレッディングプロセッサの制御方法は、デコード後に複数のスレッドで実行されるように、該複数のスレッドのうち、該命令が実行されるスレッドを識別するためのスレッド識別子が付加された、単一のスレッドに属する複数の命令をフェッチし、フェッチした前記複数の命令をデコードし、前記複数のスレッドを生成し、それぞれの該命令に付加された前記スレッド識別子の示すスレッドに、該命令を割り当て、生成した前記複数のスレッドを並列に動作させることにより、前記複数の命令を実行する、同時マルチスレッディングプロセッサの制御方法である。 The control method of the simultaneous multi-threading processor of the present invention is such that a thread identifier for identifying a thread in which the instruction is executed among the plurality of threads is added so that the thread is executed by a plurality of threads after decoding. Fetch a plurality of instructions belonging to one thread, decode the fetched instructions, generate the plurality of threads, and assign the instructions to threads indicated by the thread identifier added to the instructions A method for controlling a simultaneous multi-threading processor that executes the plurality of instructions by operating the plurality of generated threads in parallel.

本発明のプログラムは、コンパイルを行うためのプログラムであって、コンピュータに、デコード後に複数のスレッドで実行されるように、ソースコードを単一のスレッドに属する複数の命令に変換する変換手順、及び前記複数のスレッドのうち、命令が実行されるスレッドを識別するためのスレッド識別子を、それぞれの前記命令に付加する付加手順、を実行させるためのプログラムである。 A program according to the present invention is a program for compiling, and a conversion procedure for converting a source code into a plurality of instructions belonging to a single thread so that the computer can be executed by a plurality of threads after decoding, and It is a program for executing an additional procedure for adding a thread identifier for identifying a thread in which an instruction is executed among the plurality of threads to each of the instructions.

本発明のコンパイル方法は、デコード後に複数のスレッドで実行されるように、ソースコードを単一のスレッドに属する複数の命令に変換し、前記複数のスレッドのうち、命令が実行されるスレッドを識別するためのスレッド識別子を、それぞれの前記命令に付加する、コンパイル方法である。 The compiling method of the present invention converts the source code into a plurality of instructions belonging to a single thread so that it is executed in a plurality of threads after decoding, and identifies a thread in which the instructions are executed among the plurality of threads. This is a compiling method in which a thread identifier for performing is added to each instruction.

本発明の情報処理装置は、デコード後に複数のスレッドで実行されるように、ソースコードを単一のスレッドに属する複数の命令に変換し、該複数のスレッドのうち、命令が実行されるスレッドを識別するためのスレッド識別子を、それぞれの該命令に付加する、コンパイラと、前記コンパイラによりスレッド識別子が付加された前記複数の命令をフェッチし、フェッチした該複数の命令をデコードし、前記複数のスレッドを生成し、それぞれの該命令に付加された前記スレッド識別子の示すスレッドに、該命令を割り当て、生成した該複数のスレッドを並列に動作させることにより、該複数の命令を実行する、同時マルチスレッディングプロセッサと、を有する。 The information processing apparatus according to the present invention converts the source code into a plurality of instructions belonging to a single thread so that the information is executed by a plurality of threads after decoding. A compiler for adding a thread identifier for identification to each of the instructions; and fetching the plurality of instructions to which the thread identifier is added by the compiler; decoding the plurality of fetched instructions; A multithreading processor that executes the plurality of instructions by assigning the instruction to a thread indicated by the thread identifier added to each instruction and causing the plurality of generated threads to operate in parallel. And having.

本発明によれば、同時マルチスレッディングプロセッサは、単一のスレッドに属する複数の命令をフェッチし、複数のスレッドを生成して、スレッド識別子の示すスレッドに命令を割り当てるので、デコード前に単一のスレッドに属する複数の命令を、デコード後に複数のスレッドで処理することができる。このため、単一のスレッドが、複数のスレッド分のリソースを使用することができ、スレッドの実行時間が削減される。 According to the present invention, the simultaneous multi-threading processor fetches a plurality of instructions belonging to a single thread, generates a plurality of threads, and assigns an instruction to the thread indicated by the thread identifier. A plurality of instructions belonging to can be processed by a plurality of threads after decoding. For this reason, a single thread can use resources for a plurality of threads, and the execution time of the thread is reduced.

本発明の第１の実施形態の情報処理装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the information processing apparatus of the 1st Embodiment of this invention. 本発明の第１の実施形態のコンパイラ方法を示すフローチャートである。It is a flowchart which shows the compiler method of the 1st Embodiment of this invention. 本発明の第１の実施形態の命令語のフォーマットの一例を示す図である。It is a figure which shows an example of the format of the instruction word of the 1st Embodiment of this invention. 本発明の第１の実施形態のプロセッサの一構成例を示すブロック図である。It is a block diagram which shows one structural example of the processor of the 1st Embodiment of this invention. 本発明の第１の実施形態の命令語のフォーマットの一例を示す図である。It is a figure which shows an example of the format of the instruction word of the 1st Embodiment of this invention. 本発明の第１の実施形態の実行部の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the execution part of the 1st Embodiment of this invention. 本発明の第１の実施形態のパイプラインの一例を示す図である。It is a figure which shows an example of the pipeline of the 1st Embodiment of this invention. （ａ）本発明の第１の実施形態のデコード部の動作を説明するための図である。（ｂ）本発明の第１の実施形態のデコード部の動作を説明するための図である。(A) It is a figure for demonstrating operation | movement of the decoding part of the 1st Embodiment of this invention. (B) It is a figure for demonstrating operation | movement of the decoding part of the 1st Embodiment of this invention. （ａ）本発明の第１の実施形態のデコード部の動作を説明するための図である。（ｂ）本発明の第１の実施形態のデコード部の動作を説明するための図である。(A) It is a figure for demonstrating operation | movement of the decoding part of the 1st Embodiment of this invention. (B) It is a figure for demonstrating operation | movement of the decoding part of the 1st Embodiment of this invention. 本発明の第１の実施形態のデコード部の動作を説明するための図である。It is a figure for demonstrating operation | movement of the decoding part of the 1st Embodiment of this invention. （ａ）本発明の第１の実施形態の制御部及び実行部の動作結果の一例を示す図である。（ｂ）本発明の第１の実施形態の制御部及び実行部の動作結果の一例を示す図である。(A) It is a figure which shows an example of the operation result of the control part and execution part of the 1st Embodiment of this invention. (B) It is a figure which shows an example of the operation result of the control part and execution part of the 1st Embodiment of this invention. （ａ）本発明の第１の実施形態の制御部及び実行部の動作結果の一例を示す図である。（ｂ）本発明の第１の実施形態の制御部及び実行部の動作結果の一例を示す図である。(A) It is a figure which shows an example of the operation result of the control part and execution part of the 1st Embodiment of this invention. (B) It is a figure which shows an example of the operation result of the control part and execution part of the 1st Embodiment of this invention. 本発明の第１の実施形態のソースコードの一例である。It is an example of the source code of the 1st Embodiment of this invention. 本発明の第１の実施形態の命令フェッチキューに保持された命令列の一例である。It is an example of the instruction sequence held in the instruction fetch queue according to the first embodiment of the present invention. （ａ）本発明の第１の実施形態のオペコードキューに保持された命令列の一例である。（ｂ）本発明の第１の実施形態のオペコードキューに保持された命令列の一例である。(A) It is an example of the instruction sequence hold | maintained at the opcode queue of the 1st Embodiment of this invention. (B) It is an example of the instruction sequence held in the opcode queue according to the first embodiment of this invention. 本発明の第２の実施形態の制御部及び実行部の動作結果の一例を示す図である。It is a figure which shows an example of the operation result of the control part and execution part of the 2nd Embodiment of this invention. 一般的なプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of a general processor. 一般的なパイプラインを示す図である。It is a figure which shows a general pipeline. （ａ）一般的な、単一スレッドの実行時間を示す図である。（ｂ）一般的な、単一スレッドの実行時間を示す図である。（ｃ）一般的な、単一スレッドの実行時間を示す図である。（ｄ）一般的な、単一スレッドの実行時間を示す図である。（ｅ）一般的な、単一スレッドの実行時間を示す図である。（ｆ）一般的な、単一スレッドの実行時間を示す図である。(A) It is a figure which shows the execution time of a general single thread. (B) It is a figure which shows the execution time of a general single thread. (C) It is a figure which shows the execution time of a general single thread. (D) It is a figure which shows the general execution time of a single thread. (E) It is a figure which shows the general execution time of a single thread. (F) It is a figure which shows the execution time of a general single thread.

（第１の実施形態）
本発明を実施するための第１の実施形態について図面を参照して詳細に説明する。図１は、本実施形態の情報処理装置１の一構成例を示すブロック図である。同図を参照すると、情報処理装置１は、プロセッサ１０、メインメモリ２０、および記憶部３０を有する。 (First embodiment)
A first embodiment for carrying out the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the information processing apparatus 1 according to the present embodiment. Referring to FIG. 1, the information processing apparatus 1 includes a processor 10, a main memory 20, and a storage unit 30.

プロセッサ１０は、複数のスレッドを同時マルチスレッディング方式で実行可能なマイクロプロセッサである。本発明の特徴となる処理メインメモリ２０は、データやプログラムを一時的に蓄えておく主記憶装置である。記憶部３０には、コンパイラ３０１が格納される。 The processor 10 is a microprocessor capable of executing a plurality of threads by a simultaneous multi-threading method. The processing main memory 20 that is a feature of the present invention is a main storage device that temporarily stores data and programs. The storage unit 30 stores a compiler 301.

図２を参照して、コンパイラ３０１について説明する。同図は、本実施形態の婚コンパイル方法を示すフローチャートである。このフローチャートは、コンパイラ３０１をプロセッサ１０が実行することにより実現される。 The compiler 301 will be described with reference to FIG. This figure is a flowchart showing the marriage compilation method of the present embodiment. This flowchart is realized when the processor 301 executes the compiler 301.

プロセッサ１０は、ソースコードを、プロセッサ１０で実行可能な命令列を含むオブジェクトコードに変換する（ステップＳ１）。本実施形態では、コンパイラ１０は、変換において、デコード後に複数のスレッドで実行することを前提として、単一のスレッドに属する命令列を生成する。 The processor 10 converts the source code into an object code including an instruction sequence that can be executed by the processor 10 (step S1). In the present embodiment, the compiler 10 generates an instruction sequence belonging to a single thread on the premise that the conversion is executed by a plurality of threads after decoding.

そして、プロセッサ１０は、各命令に、スレッド識別子を付加する（ステップＳ２）。ここで、スレッド識別子とは、命令が実行されるスレッドを識別するための識別子である。 Then, the processor 10 adds a thread identifier to each instruction (step S2). Here, the thread identifier is an identifier for identifying a thread in which an instruction is executed.

各命令にスレッド識別子の付加する方法の一例について説明する。例えば、ＤＯループ文を、デコード後に２つのスレッドで実行することを想定する。この場合、プロセッサ１０は、ループ変数が奇数番号の処理は一方のスレッドで実行され、ループ変数が偶数番号の処理は他方のスレッドで実行されるように、それぞれの処理に対応する命令にスレッド識別子を付加する。 An example of a method for adding a thread identifier to each instruction will be described. For example, assume that a DO loop statement is executed by two threads after decoding. In this case, the processor 10 assigns a thread identifier to an instruction corresponding to each process so that the process with the odd number of the loop variable is executed by one thread and the process with the even number of the loop variable is executed by the other thread. Is added.

次に、各命令について、その内容を記述するための命令語のフォーマットについて説明する。図３は、本実施形態で使用される命令語４０１のフォーマットの一例を示す図である。同図を参照すると、命令語４０１は、固定長のビット列であり、命令コード４０２、デスティネーションレジスタ指定フィールド４０３、ソースレジスタ指定フィールド４０４、およびスレッド識別子４０５を有する。 Next, the format of an instruction word for describing the contents of each instruction will be described. FIG. 3 is a diagram showing an example of the format of the instruction word 401 used in the present embodiment. Referring to the figure, an instruction word 401 is a fixed-length bit string, and includes an instruction code 402, a destination register designation field 403, a source register designation field 404, and a thread identifier 405.

命令コード４０２には、プロセッサ１０に実行させる命令を指定するための番号が格納される。デスティネーションレジスタ指定フィールド４０３には、命令の実行結果の書き込み先のレジスタの番号が格納される。ソースレジスタ指定フィールド４０４には、命令実行時に利用されるデータの読み出し先のレジスタの番号が格納される。スレッド識別子４０５は、命令コード４０２に対応するスレッド識別子である。 The instruction code 402 stores a number for designating an instruction to be executed by the processor 10. The destination register designation field 403 stores the register number to which the instruction execution result is written. The source register designation field 404 stores the number of the register from which data used at the time of instruction execution is read. The thread identifier 405 is a thread identifier corresponding to the instruction code 402.

ここで、ソースレジスタ指定フィールド４０４では通常、１つまたは複数のレジスタ番号が指定されるが、その数は命令コードごとに異なる。命令セットアーキテクチャによっては、暗黙的に使用用途が限定されたレジスタが、デスティネーションレジスタやソースレジスタとして使用され、命令語中の明示的な指定が省略されることもある。いずれの場合もデスティネーションレジスタ、またはソースレジスタとして指定可能なレジスタは、単一スレッドのアーキテクチャ・ステート内のレジスタに限られる。 Here, in the source register designation field 404, one or a plurality of register numbers are usually designated, but the number differs for each instruction code. Depending on the instruction set architecture, a register whose use is implicitly limited is used as a destination register or a source register, and an explicit specification in the instruction word may be omitted. In either case, the registers that can be designated as destination registers or source registers are limited to those in the single-threaded architectural state.

本実施形態では、スレッド識別子４０５は、図３に示すように、命令語４０１のフィールド内に格納され、プロセッサ１０により、命令コード４０２と合わせてデコードされる。 In the present embodiment, the thread identifier 405 is stored in the field of the instruction word 401 and is decoded together with the instruction code 402 by the processor 10 as shown in FIG.

プロセッサ１０は、コンパイラ３０１を実行することにより、スレッド識別子が付加された命令列を含むオブジェクトコードを生成し、実行する。 The processor 10 generates and executes an object code including an instruction sequence to which a thread identifier is added by executing the compiler 301.

図４は、本実施形態のプロセッサ１０の一構成例を示すブロック図である。同図を参照すると、プロセッサ１０は、命令フェッチ部１０１、デコード部１０２、制御部１０３、および実行部１０４を有する。 FIG. 4 is a block diagram illustrating a configuration example of the processor 10 according to the present embodiment. Referring to FIG. 1, the processor 10 includes an instruction fetch unit 101, a decoding unit 102, a control unit 103, and an execution unit 104.

デコード部１０２は、スレッド識別子判定部１０２１を有し、制御部１０３は、出力先識別子判定部１０３１を有する。 The decoding unit 102 includes a thread identifier determination unit 1021, and the control unit 103 includes an output destination identifier determination unit 1031.

命令フェッチ部１０１は、コンパイラ３０１から出力されたオブジェクトコードの中から、命令ポインタで指定される命令をフェッチする。 The instruction fetch unit 101 fetches an instruction specified by an instruction pointer from the object code output from the compiler 301.

デコード部１０２は、フェッチされた複数の命令をデコードする。スレッド識別子判定部１０２１は、複数のスレッドを生成し、各命令に付加されたスレッド識別子を参照して、そのスレッド識別子に対応するスレッドに、その命令を割り当てる。 The decoding unit 102 decodes a plurality of fetched instructions. The thread identifier determination unit 1021 generates a plurality of threads, refers to the thread identifier added to each instruction, and assigns the instruction to the thread corresponding to the thread identifier.

また、スレッド識別子判定部１０２１は、命令語に付加されたスレッド識別子に応じて、各命令語に出力先識別子を付加する。出力先識別子は、命令の実行結果の出力先とするレジスタを識別するための識別子である。より具体的には、出力先識別子は、複数のスレッドのうち、出力先のレジスタの属するスレッドを特定するための識別子である。 Further, the thread identifier determination unit 1021 adds an output destination identifier to each command word according to the thread identifier added to the command word. The output destination identifier is an identifier for identifying a register as an output destination of an instruction execution result. More specifically, the output destination identifier is an identifier for specifying a thread to which an output destination register belongs among a plurality of threads.

制御部１０３は、出力先識別子に従って、レジスタ・リネーミング、および割り当て制御を行う。 The control unit 103 performs register renaming and assignment control according to the output destination identifier.

図５は、出力先識別子が付加された、命令語４０１を示す図である。同図に示すように、命令語４０１には、スレッド識別子４０５の代わりに出力先識別子４０６が付加される。 FIG. 5 is a diagram illustrating an instruction word 401 to which an output destination identifier is added. As shown in the figure, an output destination identifier 406 is added to the instruction word 401 instead of the thread identifier 405.

実行部１０４について説明する。図６は、実行部１０４の一構成例を示すブロック図である。同図に示すように、実行部１０４は、レジスタ群１０４１、および１０４２と、演算器１０４３とを有する。 The execution unit 104 will be described. FIG. 6 is a block diagram illustrating a configuration example of the execution unit 104. As shown in the figure, the execution unit 104 includes register groups 1041 and 1042 and an arithmetic unit 1043.

レジスタ群１０４１、および１０４２は、それぞれ異なるスレッドに属し、演算器１０４３を共有する。 The register groups 1041 and 1042 belong to different threads and share the arithmetic unit 1043.

図６のレジスタ群１０４１、および１０４２は、図１７のアーキテクチャ・ステートに相当する。図６の演算器１０４３は、図１７の実行リソースの一例である。 The register groups 1041 and 1042 in FIG. 6 correspond to the architectural state in FIG. The computing unit 1043 in FIG. 6 is an example of the execution resource in FIG.

実際には、レジスタファイル自体にスレッドの区別はなく、その都度、物理番号でスレッドに対応するレジスタが指定されるのだが、図６では、説明を容易にするため、論理的にスレッドごとに区分された状態のレジスタ群を表記している。 Actually, there is no distinction between threads in the register file itself, and each time a register corresponding to a thread is specified by a physical number. In FIG. 6, for ease of explanation, logical division is made for each thread. A group of registers in the selected state is shown.

また、実行リソースには、通常、複数の演算器が含まれるが、図６では、説明を容易にするため、演算器１０４３以外の演算器を省略している。 In addition, the execution resource usually includes a plurality of arithmetic units, but in FIG. 6, arithmetic units other than the arithmetic unit 1043 are omitted for easy explanation.

実行部１０４は、アウト・オブ・オーダー実行で、スケジューリングを行う。実行部１０４は、命令に対応する演算器で、ソースレジスタから読み出したデータを処理し、処理結果をデスティネーションレジスタに格納する。 The execution unit 104 performs scheduling by out-of-order execution. The execution unit 104 is an arithmetic unit corresponding to the instruction, processes the data read from the source register, and stores the processing result in the destination register.

続いて、プロセッサ１０のパイプライン構造について詳細に説明する。図７は、プロセッサ１０のパイプライン構造の一例を示す図である。 Next, the pipeline structure of the processor 10 will be described in detail. FIG. 7 is a diagram illustrating an example of a pipeline structure of the processor 10.

キャッシュメモリ１０１２には、オブジェクトコード内の命令列が供給される。プロセッサ１０は、その命令列の中から、命令ポインタ１０１１で指定された命令をフェッチし、命令フェッチキューＣ１に追加する。命令フェッチキューＣ１は、フェッチされた命令を保持する。 The cache memory 1012 is supplied with an instruction sequence in the object code. The processor 10 fetches the instruction specified by the instruction pointer 1011 from the instruction sequence and adds it to the instruction fetch queue C1. The instruction fetch queue C1 holds fetched instructions.

デコード部１０２は、命令フェッチキューＣ１から取り出された命令を解釈し、マイクロオペコードで記述された命令に変換する。 The decoding unit 102 interprets the instruction fetched from the instruction fetch queue C1, and converts it into an instruction described in micro opcode.

スレッド識別子判定部１０２１は、スレッド識別子に応じて、どのオペコードキューに命令を追加するかを判定する。本実施形態では、オペコードキューは、Ｃ２、およびＣ３に多重化されており、オペコードキューＣ１、およびＣ２のうち、いずれか一方、または、両方に命令が追加される。 The thread identifier determination unit 1021 determines to which opcode queue the instruction is added according to the thread identifier. In this embodiment, the opcode queue is multiplexed with C2 and C3, and an instruction is added to one or both of the opcode queues C1 and C2.

また、スレッド識別子判定部１０２１は、命令に付加されたスレッド識別子に応じて、命令に出力先識別子を付加する。 Further, the thread identifier determination unit 1021 adds an output destination identifier to the instruction according to the thread identifier added to the instruction.

オペコードキューＣ１、およびＣ２は、追加された命令を、付加された出力識別子とともに保持する。 The opcode queues C1 and C2 hold the added instruction together with the added output identifier.

出力先識別子判定部１０３１は、命令に付加された出力識別子に基づいて、出力先のレジスタを、いずれのスレッドに属するレジスタにするかを決定する。言い換えれば、出力先識別子判定部１０３１は、出力先識別子に応じて、デスティネーションレジスタの属するスレッドを切り替える。 Based on the output identifier added to the instruction, the output destination identifier determination unit 1031 determines which thread belongs to the output destination register. In other words, the output destination identifier determination unit 1031 switches the thread to which the destination register belongs according to the output destination identifier.

制御部１０３は、出力先識別子判定部１０３１の決定に従って、各命令について、レジスタ・リネーミング、およびレジスタの割り当てを行う。制御部１０３は、レジスタが割り当てられた命令をキューＣ４に追加する。キューＣ４は、追加された命令を保持する。 The control unit 103 performs register renaming and register allocation for each instruction according to the determination of the output destination identifier determination unit 1031. The control unit 103 adds the instruction to which the register is assigned to the queue C4. The queue C4 holds the added instruction.

実行部１０４内のスケジューラは、キューＣ４から取り出した命令について、アウト・オブ・オーダーで実行判定を行う（Ｓ１１）。 The scheduler in the execution unit 104 performs execution determination out-of-order with respect to the instruction fetched from the queue C4 (S11).

実行部１０４は、ソースレジスタからデータを読み出す（Ｓ１２）。実行部１０４は、読み出したデータを命令に対応する回路で処理することにより、命令を実行する（Ｓ１３）。実行部１０４は、命令の実行結果をデスティネーションレジスタに書き込む（Ｓ１３）。 The execution unit 104 reads data from the source register (S12). The execution unit 104 executes the instruction by processing the read data with a circuit corresponding to the instruction (S13). The execution unit 104 writes the execution result of the instruction to the destination register (S13).

図７で説明した本実施形態のパイプライン構造と、図１８で説明した、一般的なパイプライン構造との間の相違について整理して説明する。 Differences between the pipeline structure of the present embodiment described in FIG. 7 and the general pipeline structure described in FIG. 18 will be summarized and described.

まず、上述したように、本実施形態のデコード部１０２は、図７に示すようにスレッド識別子判定部１０２１を有しているのに対し、図１８のデコード部９０４は有していない。 First, as described above, the decoding unit 102 according to the present embodiment includes the thread identifier determination unit 1021 as illustrated in FIG. 7, but does not include the decoding unit 904 illustrated in FIG.

次に、本実施形態のオペコードキューＣ２、Ｃ３は、図７に示すように、命令のみならず、デコード前に命令に付加された出力先識別子を保持する。これに対して、図１８のオペコードキュー９０５には、出力先識別子が保持されない。 Next, as shown in FIG. 7, the operation code queues C2 and C3 of this embodiment hold not only instructions but also output destination identifiers added to the instructions before decoding. On the other hand, the output code identifier is not held in the operation code queue 905 of FIG.

更に、本実施形態の制御部１０３は、図７に示すように出力先識別子判定部１０３１を有し、出力先識別子に従って、出力先のレジスタを指定するが、図１８のレジスタ・リネーム及び割り当て制御ステージ９０６では、出力先識別子に従ってレジスタの指定が行われることはない。 Further, the control unit 103 according to the present embodiment has an output destination identifier determination unit 1031 as shown in FIG. 7, and designates an output destination register according to the output destination identifier, but the register renaming and assignment control of FIG. In stage 906, no register is designated according to the output destination identifier.

ここで、本実施形態の命令ポインタ１０１１、および命令フェッチキューＣ１は、図７では多重化されていないように記載されているが、実際には、図１８の命令ポインタおよび命令フェッチキューのように物理的または論理的に多重化されている。本実施形態のプロセッサ１０は、単一スレッドの性能を向上させることを目的としており、以後の説明に不要であるため、図７では、命令ポインタ１０１１、および命令フェッチキューＣ１の多重化に関する記載を省略している。 Here, the instruction pointer 1011 and the instruction fetch queue C1 of the present embodiment are described as not multiplexed in FIG. 7, but actually, like the instruction pointer and instruction fetch queue of FIG. Multiplexed physically or logically. The processor 10 of this embodiment is intended to improve the performance of a single thread and is not necessary for the following description. Therefore, FIG. 7 shows a description regarding multiplexing of the instruction pointer 1011 and the instruction fetch queue C1. Omitted.

命令ポインタ１０１１、および命令フェッチキューＣ１が多重化されている場合、多重化されたもののうち、少なくとも１つが有効に動作していれば、本発明を適用可能である。 When the instruction pointer 1011 and the instruction fetch queue C1 are multiplexed, the present invention can be applied if at least one of the multiplexed ones is operating effectively.

まとめると、本実施形態のパイプライン構造は、スレッド識別子判定部１０２１、および出力先識別子判定部１０３１を有し、オペコードキューが出力先識別子を更に保持する点で、図１８のパイプライン構造と異なる。 In summary, the pipeline structure of this embodiment has a thread identifier determination unit 1021 and an output destination identifier determination unit 1031, and is different from the pipeline structure of FIG. 18 in that the operation code queue further holds the output destination identifier. .

図８〜図１０を参照して、スレッド識別子判定部１０２１の動作について説明する。図８〜図１０は、スレッド識別子判定部１０２１の動作を説明するための図である。 The operation of the thread identifier determination unit 1021 will be described with reference to FIGS. 8 to 10 are diagrams for explaining the operation of the thread identifier determination unit 1021.

本実施形態において、スレッド識別子は３ビットで定義される。最上位のビットは、命令が割り当てられるスレッドの個数を示すビットとする。２ビット目は、１つのスレッドに命令を割り当てる場合に、２つのスレッドのうち、いずれのスレッドに割り当てるのかを示すビットとする。最下位のビットは、付加する出力先識別子を示すビットとする。 In this embodiment, the thread identifier is defined by 3 bits. The most significant bit is a bit indicating the number of threads to which an instruction is assigned. The second bit is a bit indicating which of the two threads is allocated when an instruction is allocated to one thread. The least significant bit is a bit indicating an output destination identifier to be added.

例えば、最上位のビットが「０」である場合、１つのスレッドに命令が割り当てられ、「１」である場合、２つのスレッドに割り当てられる。２ビット目が「０」である場合、ＯＣ２１が属するスレッドに命令が割り当てられ、「１」である場合、ＯＣ２２が属するスレッドに割り当てられる。最下位のビットが「０」である場合、出力先識別子「０」が付加され、「１」である場合、出力先識別子「１」が付加される。 For example, when the most significant bit is “0”, an instruction is assigned to one thread, and when it is “1”, it is assigned to two threads. When the second bit is “0”, an instruction is assigned to the thread to which the OC 21 belongs, and when it is “1”, the instruction is assigned to the thread to which the OC 22 belongs. When the least significant bit is “0”, the output destination identifier “0” is added, and when it is “1”, the output destination identifier “1” is added.

出力先識別子「０」は、デスティネーションレジスタを、ソースレジスタの属するスレッドと同じスレッドに属するレジスタにするための識別子とする。出力先識別子「１」は、デスティネーションレジスタを、ソースレジスタの属するスレッドと異なるスレッドに属するレジスタにするための識別子とする。 The output destination identifier “0” is an identifier for making the destination register a register belonging to the same thread as the thread to which the source register belongs. The output destination identifier “1” is an identifier for making the destination register a register belonging to a thread different from the thread to which the source register belongs.

出力先識別子は、命令列の示す処理内容や、スレッドの分割の仕方などに応じて、コンパイラ３０１が決定する。 The output destination identifier is determined by the compiler 301 according to the processing content indicated by the instruction sequence, the method of dividing the thread, and the like.

例えば、処理を複数スレッドに分割した後、それらのスレッドの結果を集約する場合、出力識別子「１」が付加される。 For example, when the process is divided into a plurality of threads and then the results of those threads are aggregated, the output identifier “1” is added.

具体的には、ループ変数を１〜２５６の「I」として、A＝A＋B（I）の計算を行う場合を考える。コンパイラ３０１は、スレッド１で「I」が奇数のときのBの総和を求め、スレッド２で「I」が奇数のときのBの総和を求め、ループ外で両スレッドの実行結果を加算することを前提として、命令列を生成する。この場合、両スレッドの実行結果を加算する命令に出力先識別子「１」が付加される。 Specifically, consider a case where the loop variable is set to “I” of 1 to 256, and A = A + B (I) is calculated. The compiler 301 obtains the sum of B when “I” is an odd number in the thread 1, obtains the sum of B when “I” is an odd number in the thread 2, and adds the execution results of both threads outside the loop. An instruction sequence is generated assuming that In this case, the output destination identifier “1” is added to the instruction for adding the execution results of both threads.

図８（ａ）は、スレッド識別子「０００」が付加された命令をデコードする場合の、スレッド識別子判定部１０２１の動作を説明するための図である。最上位のビットが「０」なので、スレッド識別子判定部１０２１は、１つのスレッドに命令を割り当て、２ビット目が「０」なので、命令が割り当てられるスレッドは、ＯＣ２１が属するスレッドである。また、最下位のビットが「０」なので、スレッド識別子判定部１０２１は、出力先識別子「０」を付加する。 FIG. 8A is a diagram for explaining the operation of the thread identifier determination unit 1021 when an instruction to which the thread identifier “000” is added is decoded. Since the most significant bit is “0”, the thread identifier determination unit 1021 assigns an instruction to one thread, and since the second bit is “0”, the thread to which the instruction is assigned is a thread to which the OC 21 belongs. Since the least significant bit is “0”, the thread identifier determination unit 1021 adds the output destination identifier “0”.

図８（ｂ）は、スレッド識別子「００１」が付加された命令をデコードする場合の、スレッド識別子判定部１０２１の動作を説明するための図である。最上位のビットが「０」、２ビット目が「０」なので、スレッド識別子判定部１０２１は、ＯＣ２１が属するスレッドに命令を割り当てる。また、最下位のビットが「１」なので、スレッド識別子判定部１０２１は、出力先識別子「１」を付加する。 FIG. 8B is a diagram for explaining the operation of the thread identifier determination unit 1021 when an instruction to which the thread identifier “001” is added is decoded. Since the most significant bit is “0” and the second bit is “0”, the thread identifier determination unit 1021 assigns an instruction to the thread to which the OC 21 belongs. Since the least significant bit is “1”, the thread identifier determination unit 1021 adds the output destination identifier “1”.

図９（ａ）は、スレッド識別子「０１０」が付加された命令をデコードする場合の、スレッド識別子判定部１０２１の動作を説明するための図である。最上位のビットが「０」、２ビット目が「１」なので、スレッド識別子判定部１０２１は、ＯＣ２２が属するスレッドに命令を割り当てる。また、最下位のビットが「０」なので、スレッド識別子判定部１０２１は、出力先識別子「０」を付加する。 FIG. 9A is a diagram for explaining the operation of the thread identifier determination unit 1021 when decoding an instruction to which the thread identifier “010” is added. Since the most significant bit is “0” and the second bit is “1”, the thread identifier determination unit 1021 assigns an instruction to the thread to which the OC 22 belongs. Since the least significant bit is “0”, the thread identifier determination unit 1021 adds the output destination identifier “0”.

図９（ｂ）は、スレッド識別子「０１１」が付加された命令をデコードする場合の、スレッド識別子判定部１０２１の動作を説明するための図である。最上位のビットが「０」、２ビット目が「１」なので、スレッド識別子判定部１０２１は、ＯＣ２２が属するスレッドに命令を割り当てる。また、最下位のビットが「１」なので、スレッド識別子判定部１０２１は、出力先識別子「１」を付加する。 FIG. 9B is a diagram for explaining the operation of the thread identifier determination unit 1021 when decoding an instruction to which the thread identifier “011” is added. Since the most significant bit is “0” and the second bit is “1”, the thread identifier determination unit 1021 assigns an instruction to the thread to which the OC 22 belongs. Since the least significant bit is “1”, the thread identifier determination unit 1021 adds the output destination identifier “1”.

図１０は、スレッド識別子「１００」が付加された命令をデコードする場合の、スレッド識別子判定部１０２１の動作を説明するための図である。最上位のビットが「１」なので、スレッド識別子判定部１０２１は、２つのスレッドに命令を割り当てる。また、最下位のビットが「０」なので、スレッド識別子判定部１０２１は、出力先識別子「０」を付加する。 FIG. 10 is a diagram for explaining the operation of the thread identifier determination unit 1021 when an instruction to which the thread identifier “100” is added is decoded. Since the most significant bit is “1”, the thread identifier determination unit 1021 assigns instructions to two threads. Since the least significant bit is “0”, the thread identifier determination unit 1021 adds the output destination identifier “0”.

図１１および図１２を参照して、出力先識別子判定部１０３１の動作結果について説明する。図１１および図１２は、出力先識別子判定部１０３１の制御による、実行部１０４の動作を説明するための図である。 The operation result of the output destination identifier determination unit 1031 will be described with reference to FIGS. 11 and 12. 11 and 12 are diagrams for explaining the operation of the execution unit 104 under the control of the output destination identifier determination unit 1031.

出力先識別子判定部１０３１は、命令に付加された出力先識別子に基づいて、レジスタ・リネーミングと、リソースの割り当てとを行う。 The output destination identifier determination unit 1031 performs register renaming and resource allocation based on the output destination identifier added to the instruction.

具体的には、出力先識別子が「０」の場合、出力先識別子判定部１０３１は、ソースレジスタの属するスレッドと、デスティネーションレジスタの属するスレッドが、同じスレッドとなるように、レジスタ・リネーミング、および割り当て制御を行う。 Specifically, when the output destination identifier is “0”, the output destination identifier determination unit 1031 performs register renaming, so that the thread to which the source register belongs and the thread to which the destination register belongs are the same thread. And perform allocation control.

出力先識別子が「１」の場合、出力先識別子判定部１０３１は、ソースレジスタの属するスレッドと、デスティネーションレジスタの属するスレッドが、異なるスレッドとなるように、レジスタ・リネーミング、および割り当て制御を行う。 When the output destination identifier is “1”, the output destination identifier determination unit 1031 performs register renaming and allocation control so that the thread to which the source register belongs and the thread to which the destination register belongs are different threads. .

出力先識別子が「０」で、ソースレジスタがレジスタ群１０４１に属するレジスタであった場合、図１１（ａ）に示すように、演算器１０４３は、実行結果を、同一のレジスタ群１０４１に属するレジスタへ出力する。 When the output destination identifier is “0” and the source register is a register belonging to the register group 1041, as shown in FIG. 11A, the computing unit 1043 displays the execution result as a register belonging to the same register group 1041. Output to.

出力先識別子が「０」で、ソースレジスタがレジスタ群１０４２に属するレジスタであった場合、図１１（ｂ）に示すように、演算器１０４３は、実行結果を、同一のレジスタ群１０４２に属するレジスタへ出力する。 When the output destination identifier is “0” and the source register is a register belonging to the register group 1042, the computing unit 1043 displays the execution result as a register belonging to the same register group 1042, as shown in FIG. Output to.

出力先識別子が「１」で、ソースレジスタがレジスタ群１０４１に属するレジスタであった場合、図１２（ａ）に示すように、演算器１０４３は、実行結果を、異なるレジスタ群１０４２に属するレジスタへ出力する。 When the output destination identifier is “1” and the source register is a register belonging to the register group 1041, the arithmetic unit 1043 transfers the execution result to a register belonging to a different register group 1042, as shown in FIG. Output.

出力先識別子が「１」で、ソースレジスタがレジスタ群１０４２に属するレジスタであった場合、図１２（ｂ）に示すように、演算器１０４３は、実行結果を、異なるレジスタ群１０４１に属するレジスタへ出力する。 When the output destination identifier is “1” and the source register is a register belonging to the register group 1042, the arithmetic unit 1043 transfers the execution result to a register belonging to a different register group 1041, as shown in FIG. Output.

図１１、図１２において、レジスタ群から演算器への矢印は、処理対象の入力データを示している。また、演算器からレジスタ群への矢印は、実行結果のデータを示している。演算器１０４３は、例えば、加算器である。この場合、レジスタ群から演算器への２本の矢印は、被加算値、加算値のデータであり、演算器からレジスタ群への矢印は、加算結果のデータである。 11 and 12, the arrows from the register group to the computing unit indicate the input data to be processed. An arrow from the arithmetic unit to the register group indicates execution result data. The computing unit 1043 is, for example, an adder. In this case, the two arrows from the register group to the computing unit are data of the added value and the added value, and the arrow from the computing unit to the register group is the data of the addition result.

続いて、図１３〜図１５を参照して、本実施形態のソースコード、オブジェクトコードの一例について説明する。 Subsequently, an example of the source code and the object code of the present embodiment will be described with reference to FIGS.

図１３は、プロセッサ１０で所定の処理を実行するためのソースコードの一例である。このソースコードには、ループ変数「Ｉ」の値を変えながら、同じ処理を繰り返し実行するためのＤＯループ文が、記述されている。このソースコードは、単一のスレッドで処理を実行することを前提として記載されている。 FIG. 13 is an example of source code for executing predetermined processing by the processor 10. In this source code, a DO loop statement for repeatedly executing the same process while changing the value of the loop variable “I” is described. This source code is described on the assumption that processing is executed by a single thread.

プロセッサ１０は、コンパイラ３０１の実行により、単一のスレッドに命令列を、デコード後に複数のスレッドで実行できるように、図１３のソースコードをオブジェクトコードの命令列に変換し、各命令にスレッド先識別子を付加する。 The processor 10 converts the source code of FIG. 13 into an instruction sequence of an object code so that the instruction sequence can be executed by a plurality of threads after decoding by executing the compiler 301, and each instruction has a thread destination. Add an identifier.

図１４は、図１３のソースコードから変換され、命令フェッチキューＣ１に入力される命令列の一例である。図１４の命令列は、前述したように、複数スレッドに割り当てることを前提として生成されている。例えば、ポインタを設定するための命令として、スレッド１用の命令ＯＣ１およびＯＣ２のほか、スレッド２用の命令ＯＣ３およびＯＣ４が生成される。 FIG. 14 is an example of an instruction sequence converted from the source code of FIG. 13 and input to the instruction fetch queue C1. As described above, the instruction sequence in FIG. 14 is generated on the assumption that it is assigned to a plurality of threads. For example, as instructions for setting a pointer, instructions OC3 and OC4 for thread 2 are generated in addition to instructions OC1 and OC2 for thread 1.

また、命令列内の命令コード４０２のそれぞれには、命令が実行されるスレッドに対応するスレッド識別子４０５が付加されている。 Also, a thread identifier 405 corresponding to the thread in which the instruction is executed is added to each instruction code 402 in the instruction sequence.

図１４において、デスティネーションレジスタ指定フィールド４０３、およびソースレジスタ指定フィールド４０４は省略されている。また、命令コード４０２の欄には、命令コード自体でなく、命令コードの定義が記述されている。 In FIG. 14, the destination register designation field 403 and the source register designation field 404 are omitted. In the column of the instruction code 402, the definition of the instruction code is described instead of the instruction code itself.

命令ＯＣ１および命令ＯＣ２は、スレッド１の使用する変数のポインタを初期設定するための命令である。これらポインタは、スレッド１のみで使用するため、上述したスレッド識別子の定義に従って、これらの命令にはスレッド識別子「０００」が付加される。 The instruction OC1 and the instruction OC2 are instructions for initializing variable pointers used by the thread 1. Since these pointers are used only by the thread 1, the thread identifier “000” is added to these instructions in accordance with the definition of the thread identifier described above.

命令ＯＣ３およびＯＣ４は、スレッド２の使用する変数のポインタのポインタを初期設定するための命令である。これらのポインタは、スレッド２のみで使用するため、これらの命令にはスレッド識別子「０１０」が付加される。 Instructions OC3 and OC4 are instructions for initializing pointers of pointers of variables used by the thread 2. Since these pointers are used only by the thread 2, the thread identifier “010” is added to these instructions.

命令ＯＣ５は、ループ変数「Ｉ」の初期設定をするための命令である。この命令は、スレッド１のみで実行すればよいので、この命令にはスレッド識別子「０００」が付加される。 The instruction OC5 is an instruction for initial setting of the loop variable “I”. Since this instruction only needs to be executed by the thread 1, the thread identifier “000” is added to this instruction.

命令ＯＣ６およびＯＣ７は、変数Ａ、Ｂをロードするための命令、命令ＯＣ８は、加算を実行するための命令、そして、命令ＯＣ９は、計算結果Ａのストアを実行するための命令である。これらの命令は、ポインタを変えて、スレッド１、２のそれぞれで実行される。具体的には、ループ変数「Ｉ」が奇数の場合の処理は、スレッド１で実行され、ループ変数「Ｉ」が偶数の場合の処理は、スレッド２で実行される。両スレッドで実行する必要があるので、これらの命令にはスレッド識別子「１００」が付加される。 The instructions OC6 and OC7 are instructions for loading the variables A and B, the instruction OC8 is an instruction for executing addition, and the instruction OC9 is an instruction for executing a store of the calculation result A. These instructions are executed in each of the threads 1 and 2 while changing the pointer. Specifically, the process when the loop variable “I” is an odd number is executed by the thread 1, and the process when the loop variable “I” is an even number is executed by the thread 2. Since it is necessary to execute in both threads, a thread identifier “100” is added to these instructions.

命令ＯＣ１１およびＯＣ１２は、各スレッドが使用する変数Ａ，Ｂのポインタを更新するための命令である。この命令は、両スレッドで実行する必要があるので、これらの命令にはスレッド識別子「１００」が付加される。 Instructions OC11 and OC12 are instructions for updating the pointers of variables A and B used by each thread. Since this instruction needs to be executed by both threads, a thread identifier “100” is added to these instructions.

命令ＯＣ１３は、分岐命令であり、スレッド１のみで実行されるので、スレッド識別子「０００」が付加される。 Since the instruction OC13 is a branch instruction and is executed only by the thread 1, the thread identifier “000” is added.

スレッド識別子判定部１０２１は、スレッド識別子に応じて、オペコードキューＣ２、Ｃ３のうち、どのオペコードキューに命令を追加するかを判定する。また、スレッド識別子判定部１０２１は、命令に付加されたスレッド識別子に応じて、命令に出力先識別子を付加する。 The thread identifier determination unit 1021 determines to which opcode queue the operation code queues C2 and C3 are to be added according to the thread identifier. Further, the thread identifier determination unit 1021 adds an output destination identifier to the instruction according to the thread identifier added to the instruction.

図１５（ａ）は、オペコードキューＣ２に追加された命令列の一例である。図１５（ｂ）は、オペコードＣ３に追加された命令列の一例である。同図（ａ）、（ｂ）に示すように、命令列内の命令コード４０２のそれぞれに、出力先識別子４０６が付加されている。図１４において、デスティネーションレジスタ指定フィールド４０３、およびソースレジスタ指定フィールド４０４は省略されている。また、命令コード４０２の欄には、命令コード自体でなく、命令コードの定義が記述されている。 FIG. 15A is an example of an instruction sequence added to the opcode queue C2. FIG. 15B is an example of an instruction sequence added to the operation code C3. As shown in FIGS. 4A and 4B, an output destination identifier 406 is added to each instruction code 402 in the instruction sequence. In FIG. 14, the destination register designation field 403 and the source register designation field 404 are omitted. In the column of the instruction code 402, the definition of the instruction code is described instead of the instruction code itself.

図１５（ａ）に示すように、スレッド１に対応するオペコードキューＣ２には、スレッド識別子「０００」または「１００」が付加されていた命令ＯＣ１、ＯＣ２、およびＯＣ５〜ＯＣ１３が入力される。図１５（ｂ）に示すように、スレッド２に対応するオペコードキューＣ３には、スレッド識別子「０１０」が付加されていた命令ＯＣ３、ＯＣ４が入力され、スレッド識別子「１００」が付加されていたＯＣ６〜ＯＣ１２が、複製して入力される。また、全ての命令に、出力先識別子「０」が付加されている。 As shown in FIG. 15A, instructions OC1, OC2, and OC5 to OC13, to which the thread identifier “000” or “100” is added, are input to the opcode queue C2 corresponding to the thread 1. As shown in FIG. 15B, instructions OC3 and OC4 to which the thread identifier “010” has been added are input to the opcode queue C3 corresponding to the thread 2, and OC6 to which the thread identifier “100” has been added. -OC12 is duplicated and input. Further, the output destination identifier “0” is added to all the instructions.

図１５に示したように、プロセッサ１０は、単一のスレッドに属する命令列を分解または複製し、複数スレッドのそれぞれに割り当てる。そして、プロセッサ１０は、各命令に、スレッドに対応するレジスタを割り当てる。例えば、オペコードキューＣ１、およびＣ２には、ＯＣ８など、同じ命令が入力されている。しかし、これらの命令は、別々のスレッドに属する命令なので、別々のスレッドに属するレジスタが割り当てられる。従って、ＯＣ８等についても、別々の演算が実行される。 As shown in FIG. 15, the processor 10 decomposes or duplicates an instruction sequence belonging to a single thread and assigns it to each of a plurality of threads. Then, the processor 10 assigns a register corresponding to the thread to each instruction. For example, the same instruction such as OC8 is input to the operation code queues C1 and C2. However, since these instructions belong to different threads, registers belonging to different threads are assigned. Therefore, separate calculations are performed for OC8 and the like.

図７で説明したキューＣ７以降は、単一のスレッドに属する命令列と同様に、本実施形態の命令列が処理される。しかし、図１８で示した方式と比較すると、本実施形態では、ソースコードにおける単一のスレッドに、レジスタ群１０４１、１０４２などの多重化されたアーキテクチャ・ステートを割り当てることができる。このため、利用できるソフト見えレジスタ数が増加し、データ待ち時間の隠蔽によって、ソースコードにおける単一のスレッドの実行時間を短縮することができる。 After the queue C7 described in FIG. 7, the instruction sequence of this embodiment is processed in the same manner as the instruction sequence belonging to a single thread. However, compared with the method shown in FIG. 18, in this embodiment, multiplexed architecture states such as register groups 1041 and 1042 can be assigned to a single thread in the source code. This increases the number of software-visible registers that can be used, and concealing data latency can reduce the execution time of a single thread in the source code.

なお、図３では、命令コード４０２からソースレジスタ指定フィールド４０４までと、スレッド識別子４０５とを１つの命令語として定義する例を示したが、命令コード４０２からソースレジスタ指定フィールド４０４までと、スレッド識別子４０５とを別々の命令語として定義してもよい。その場合、スレッド識別子４０５を含む命令語は、後続の命令語のスレッド指定を変更する特殊命令として解釈される。 FIG. 3 shows an example in which the instruction code 402 to the source register designation field 404 and the thread identifier 405 are defined as one instruction word, but the instruction code 402 to the source register designation field 404 and the thread identifier 405 may be defined as separate instruction words. In that case, the instruction word including the thread identifier 405 is interpreted as a special instruction for changing the thread designation of the subsequent instruction word.

本実施形態では、出力先識別子は、１ビットで定義されるが、２ビット以上で定義してもよいのは勿論である。単一のスレッドを、３つ以上のスレッドで処理する場合、出力先のレジスタが属するスレッドを指定するために、２ビット以上の出力先識別子が定義される。 In this embodiment, the output destination identifier is defined by 1 bit, but may be defined by 2 bits or more. When a single thread is processed by three or more threads, an output destination identifier of 2 bits or more is defined to specify a thread to which an output destination register belongs.

本実施形態では、制御部１０３に、出力先識別子判定部１０３１を設ける構成としているが、出力先のレジスタに対応するスレッドを切り替える必要がなければ、出力先識別子判定部１０３１を設けない構成としてもよい。 In this embodiment, the control unit 103 is provided with the output destination identifier determination unit 1031. However, if there is no need to switch the thread corresponding to the output destination register, the output destination identifier determination unit 1031 may not be provided. Good.

図１３に例示したソースコードには、繰り返し処理が記述されているが、繰り返し処理以外の処理が記述されたソースコードをコンパイラ３０１が、複数スレッドで実行するために変換する構成であってもよいのは、勿論である。 In the source code illustrated in FIG. 13, iterative processing is described. However, the compiler 301 may convert the source code in which processing other than the iterative processing is described to be executed by a plurality of threads. Of course.

以上、説明したように、本実施形態によれば、同時マルチスレッディングプロセッサは、単一のスレッドに属する複数の命令をフェッチし、複数のスレッドを生成して、スレッド識別子の示すスレッドに命令を割り当てるので、デコード前に単一のスレッドに属する複数の命令を、デコード後に複数のスレッドで処理することができる。このため、単一のスレッドが、複数のスレッド分のリソースを使用することができ、単一のスレッドの実行時間が削減される。 As described above, according to this embodiment, the simultaneous multi-threading processor fetches a plurality of instructions belonging to a single thread, generates a plurality of threads, and assigns the instruction to the thread indicated by the thread identifier. A plurality of instructions belonging to a single thread before decoding can be processed by a plurality of threads after decoding. For this reason, a single thread can use resources for a plurality of threads, and the execution time of the single thread is reduced.

そして、本発明を適用する場合、スケジューラ、レジスタファイル、および、演算器などの実行リソースへの改造は不要であり、デコード部、および制御部以外は、通常の同時マルチスレッディングプロセッサの機構をそのまま流用できる。 When the present invention is applied, it is not necessary to modify the execution resources such as the scheduler, the register file, and the arithmetic unit, and the mechanism of the normal simultaneous multithreading processor can be used as it is except for the decoding unit and the control unit. .

また、同時マルチスレッディングプロセッサは、出力先識別子に基づいて、出力先のレジスタが属するスレッドを切り替えるので、多様な処理内容の命令列を処理できる。 Further, the simultaneous multithreading processor switches the thread to which the output destination register belongs based on the output destination identifier, so that it can process instruction strings having various processing contents.

繰り返し処理は、複数のスレッドに分割しやすいので、コンパイラ３０１が対象とする処理を、単一のスレッドに属する繰り返し処理とすれば、コンパイラの負荷が軽減される。 Since the iterative process is easily divided into a plurality of threads, if the process targeted by the compiler 301 is an iterative process belonging to a single thread, the load on the compiler is reduced.

（第２の実施形態）
本発明の第２の実施形態について図１６を参照して説明する。本実施形態のプロセッサは、あるスレッドに属する命令の実行結果を、多重化された全てのスレッドの、それぞれに属するレジスタへ書き込む点で、第１の実施形態のプロセッサと異なる。 (Second Embodiment)
A second embodiment of the present invention will be described with reference to FIG. The processor according to the present embodiment is different from the processor according to the first embodiment in that an execution result of an instruction belonging to a certain thread is written to a register belonging to each of all multiplexed threads.

図１６は、図４に示した出力先識別子判定部１０３１の制御による、実行部１０４の本実施形態における動作を説明するための図である。他の構成については、第１の実施形態と同様であるため、本実施形態では、その詳細な説明を省略する。 FIG. 16 is a diagram for explaining the operation of the execution unit 104 in the present embodiment under the control of the output destination identifier determination unit 1031 illustrated in FIG. 4. Since other configurations are the same as those in the first embodiment, detailed description thereof is omitted in the present embodiment.

図１６に示すように、演算器１０４３は、レジスタ群１０４１、および１０４２の両方に、命令の実行結果を出力する。 As shown in FIG. 16, the arithmetic unit 1043 outputs the instruction execution result to both the register groups 1041 and 1042.

なお、全てのスレッドに対応するレジスタでなく、３つ以上のスレッドのうち、いずれか２つ以上のスレッドに属するレジスタへ、演算器が実行結果を出力する構成としてもよいのは、勿論である。 Of course, the arithmetic unit may output the execution result to a register belonging to any two or more of the three or more threads instead of the registers corresponding to all the threads. .

以上説明したように、本実施形態によれば、第１の実施形態の構成と比較して、単一のスレッドの命令の実行結果を、複数のスレッドの命令で使用する場合に、より効率的に命令列を処理することが可能となる。 As described above, according to the present embodiment, compared to the configuration of the first embodiment, the execution result of a single thread instruction is more efficient when used by a plurality of thread instructions. It is possible to process an instruction sequence.

１情報処理装置
１０プロセッサ
２０メインメモリ
３０記憶部
１０１命令フェッチ部
１０２デコード部
１０３制御部
１０４実行部
３０１コンパイラ
４０１命令コード
４０２デスティネーションレジスタ指定フィールド
４０３ソースレジスタ指定フィールド
４０４スレッド識別子
４０５出力先識別子
８０１プロセッサ
８０２、８０３アーキテクチャ・ステート
８０４実行リソース
９０１命令ポインタ
９０２キャッシュメモリ
９０３命令フェッチキュー
９０４デコード部
９０５オペコードキュー
９０６制御部
９０７キュー
９０８スケジューラ
９０９レジスタ読み出しステージ
９１０実行ステージ
９１１レジスタ書き込みステージ
１０１１命令ポインタ
１０１２キャッシュメモリ
１０２１スレッド識別子判定部
１０３１出力先識別子判定部
１０４１、１０４２レジスタ群
１０４３演算器
３０１１変換部
３０１２スレッド識別子付加部
Ｃ１命令フェッチキュー
Ｃ２、Ｃ３オペコードキュー
Ｃ４キュー
Ｓ１スケジューラ
Ｓ２レジスタ読み出しステージ
Ｓ３実行ステージ
Ｓ４レジスタ書き込みステージ DESCRIPTION OF SYMBOLS 1 Information processing apparatus 10 Processor 20 Main memory 30 Storage part 101 Instruction fetch part 102 Decoding part 103 Control part 104 Execution part 301 Compiler 401 Instruction code 402 Destination register designation field 403 Source register designation field 404 Thread identifier 405 Output destination identifier 801 Processor 802, 803 Architecture state 804 Execution resource 901 Instruction pointer 902 Cache memory 903 Instruction fetch queue 904 Decode unit 905 Opcode queue 906 Control unit 907 Queue 908 Scheduler 909 Register read stage 910 Execution stage 911 Register write stage 1011 Instruction pointer 1012 Cache memory 1021 Thread identifier Constant unit 1031 Output destination identifier determination unit 1041, 1042 Register group 1043 Operation unit 3011 Conversion unit 3012 Thread identifier addition unit C1 Instruction fetch queue C2, C3 Opcode queue C4 queue S1 Scheduler S2 Register read stage S3 Execution stage S4 Register write stage

Claims

Fetch a plurality of instructions belonging to a single thread to which a thread identifier for identifying a thread in which the instruction is executed is added among the plurality of threads so as to be executed by a plurality of threads after decoding. Fetch means;
Decoding means for decoding the plurality of instructions fetched by the fetch means, generating the plurality of threads, and assigning the instructions to threads indicated by the thread identifiers added to the respective instructions;
Execution means for executing the plurality of instructions by operating the plurality of threads generated by the decoding means in parallel;
A simultaneous multi-threading processor.

2. The decoding unit according to claim 1, wherein the decoding unit adds the instruction to a queue corresponding to a thread indicated by the thread identifier added to the instruction among a plurality of queues holding instructions executed by the thread. Simultaneous multi-threading processor.

The decoding means adds, to the instruction, an output destination identifier for identifying an architecture unit that is an output destination of the execution result of the instruction among a plurality of architecture units based on the thread identifier added to the instruction. And
The execution means causes the plurality of threads to operate in parallel by using execution resources shared by the plurality of architecture units, executes the instructions belonging to the respective threads, and executes the execution results of the instructions, The simultaneous multithreading processor according to claim 1, wherein the simultaneous multithreading processor outputs to the architecture unit indicated by the output destination identifier added to the instruction.

Control means for designating the architecture unit indicated by the output destination identifier added to the instruction by the decoding means as an output destination of the instruction;
4. The simultaneous multithreading processor according to claim 3, wherein the execution means outputs the execution result of the instruction to the architecture unit designated by the control means.

The simultaneous multithreading processor according to claim 3 or 4, wherein the output destination identifier is an identifier having all of the plurality of architecture states as output destinations.

Fetch a plurality of instructions belonging to a single thread to which a thread identifier for identifying a thread in which the instruction is executed is added among the plurality of threads so that the plurality of threads are executed after decoding. ,
Decoding the plurality of fetched instructions;
Creating the plurality of threads;
Assigning the instruction to a thread indicated by the thread identifier added to each instruction;
A method for controlling a simultaneous multi-threading processor, wherein the plurality of instructions are executed by operating the plurality of generated threads in parallel.

A program for compiling,
On the computer,
A conversion procedure for converting the source code into a plurality of instructions belonging to a single thread so as to be executed by a plurality of threads after decoding, and a thread for identifying a thread in which the instructions are executed among the plurality of threads An additional procedure for adding an identifier to each said instruction;
A program for running

The program according to claim 7, wherein the source code is a source code in which an iterative process is described.

Convert the source code into multiple instructions belonging to a single thread so that it can be executed in multiple threads after decoding,
A compiling method, wherein a thread identifier for identifying a thread in which an instruction is executed among the plurality of threads is added to each of the instructions.

The source code is converted into a plurality of instructions belonging to a single thread so as to be executed in a plurality of threads after decoding, and a thread identifier for identifying a thread in which the instructions are executed among the plurality of threads, A compiler attached to each of the instructions;
The plurality of instructions to which a thread identifier is added by the compiler are fetched, the plurality of fetched instructions are decoded, the plurality of threads are generated, and the thread indicated by the thread identifier added to each of the instructions A simultaneous multi-threading processor that executes the plurality of instructions by allocating the instructions and operating the generated threads in parallel;
An information processing apparatus.