JP3776652B2

JP3776652B2 - Vector arithmetic unit

Info

Publication number: JP3776652B2
Application number: JP32874699A
Authority: JP
Inventors: 洋子磯部
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 1999-11-18
Filing date: 1999-11-18
Publication date: 2006-05-17
Anticipated expiration: 2019-11-18
Also published as: JP2001147914A

Description

【０００１】
【発明の属する技術分野】
本発明は、ベクトル演算装置に関し、特に、ベクトル総和演算を行う際に指数合わせによる誤差を小さくして高精度な演算処理を行いうるようにしたベクトル演算装置に関する。
【０００２】
【従来の技術】
ベクトル総和演算を行う従来のベクトル演算装置を図５に示す。図５に示すとおり、従来のベクトル演算装置は、ベクトルデータを格納するベクトルレジスタ１１１と、総和演算処理を行う加算器１２１と、演算結果を格納するベクトルレジスタ１１２とを具える。図５において、命令制御部（図示せず）から総和命令、すなわち命令語３００が発行されると、命令語３００中の読み出しレジスタを指定するＲＲフィールド３０２によって指定されたベクトルレジスタ１１１からベクトルデータが読み出され、加算器１２１に入力する。加算器１２１は、入力データの総和演算処理を行い、命令語３００中の書き込みレジスタを指定するＷＲフィールド３０３によって指定されたベクトルレジスタ１１２に最終結果を格納する。加算器１２１における総和演算処理方法は以下の通りである。
【０００３】
まず、ベクトルデータを格納しているベクトルレジスタ１１１から、１クロック毎に１つのベクトルデータが読み出され、加算器１２１に順次入力される。加算器１２１は、最終結果以外の加算結果が再び加算器１２１に入力され、ベクトルレジスタから読み出されたベクトルデータと加算されるように構成されている。
【０００４】
図６は、図５に示すベクトル演算装置の加算器１２１内の処理を示す模式図である。例えば加算器１２１内の処理時間を４クロックとすると、ベクトルレジスタ１１１から読み出された要素は、４クロック後に再び加算器２１に入力され加算される。
【０００５】
ここで、ベクトルデータをａ１、ａ２、ａ３、ａ４、ａ５、ａ７、ａ６、ａ７、ａ８．．．とする。データａ１は、１クロック目でベクトルレジスタ１１１から読み出されて、２クロック目で加算器１２１内の加算処理１に入力され、即値”０”との加算が行われる。加算器内処理２、３が順次行われ、５クロック目には加算器内処理４によってａ１＋０の結果が求められる。そして次クロック（６クロック目）で、この結果ａ１＋０は再度加算器内処理１に入力される。
【０００６】
一方で、５クロック目にはベクトルレジスタ１１１からデータａ５が読み出されており、６クロック目でこのデータａ５が加算器内処理１に入力され、再度入力されたａ１＋０と加算されて、ａ１＋ａ５の処理が行われる。データａ２、ａ３、ａ４についてもａ１の場合と同様に、加算器内処理で即値０との加算が行われ（ａ２＋０、ａ３＋０、ａ４＋０）、その結果が７、８、９クロック目に再度加算器内処理１に入力され、ａ２＋ａ６、ａ３＋ａ７、ａ４＋ａ８の結果が求められる。
【０００７】
ここで、ａ１＋ａ５、ａ２＋ａ６、ａ３＋ａ７、ａ４＋ａ８の加算結果をそれぞれＳ１、Ｓ２、Ｓ３、Ｓ４とすると、加算器内処理１には、１０クロック目にＳ１とａ９、１１クロック目にＳ２とａ１０、１２クロック目にＳ３とａ１１、１３クロック目にＳ４とａ１２が入力され、Ｓ１＋ａ９、Ｓ２＋ａ１０、Ｓ３＋ａ１１、Ｓ４＋ａ１２の結果が得られる。
【０００８】
最終結果の算出方法は様々であるが、ここでは図７に示す方法で求めるものとする。図７では、ベクトル処理要素数が８の時の最終結果の算出方法を示す。Ｓ１、Ｓ２、Ｓ３、Ｓ４は上述した方法により算出される。Ｓ３が加算器内処理１に入力するタイミングで、Ｓ１が加算器内処理１に入力されるように演算器内で制御され、Ｓ１＋Ｓ３＝Ｓ５が計算されるものとする。同様に、Ｓ４が加算器内処理１に入力されるタイミングで、Ｓ２が加算器内処理１に入力されるように演算器内で制御され、Ｓ２＋Ｓ４＝Ｓ６が計算されるものとする。また、Ｓ６が加算器内処理１に入力されるタイミングでＳ５が加算器内処理１に入力されるように演算器内で制御され、Ｓ５＋Ｓ６＝総和が計算されるものとする。
【０００９】
このように、最終的に１つの総和結果が得られ、その結果はベクトルレジスタ１２に格納される。一般的に、ベクトル演算装置における総和処理は上述のように加算器内でデータを回帰させることによって処理の高速化を図るようにしている。スカラ処理を行う場合は、通常ａ１から要素順に加算されるが、ベクトル演算装置で処理する場合は、要素順には加算されないので、スカラ処理時とは加算順序が異なる。ところが、加算処理は公知のように加算処理前に以下のような指数合わせを行う必要がある。
【００１０】
計算機上で扱われる浮動小数点データは通常１６進数で表され、指数部と仮数部で示すことができる。加算処理において、一般には指数合わせは指数の大きい方へ指数の小さい方を合わせる。加算する２つのデータの指数部の差分だけ指数の小さい方の仮数部を右にシフトするので、シフトアウトしたビットは切捨てられる。このように、加算処理においては指数合わせを行う必要があるので、加算する２つのデータの指数差が大きい場合には、シフトアウトされるビット数が多くなり、加算結果が誤差を含むことになる。
【００１１】
この誤差について、計算機上で扱われるデータを１０進数で表して以下に簡単に説明する。加算器の有効桁数は小数点以下５桁であると仮定し、ベクトル処理要素数が８の場合の具体的な動作について説明する。ａ１〜ａ８の各データを、ａ１＝１．０００００Ｅ＋００、ａ２＝２．０００００Ｅ＋００、ａ３＝３．０００００Ｅ＋００、ａ４＝４．０００００Ｅ＋００、ａ５＝５．０００００Ｅ＋００、ａ６＝１．０００００Ｅ＋００、ａ７＝２．０００００Ｅ＋００、ａ８＝１．０００００Ｅ＋０６とすると、途中加算結果であるＳ１〜Ｓ６と最終結果（総和）は以下のようになる。
S1=a1+a5=1.00000E+00+5.00000E+00=6.00000E+00
S2=a2+a6=2.00000E+00+1.00000E+00=3.00000E+00
S3=a3+a7=3.00000E+00+2.00000E+00=5.00000E+00
S4=a4+a8=4.00000E+00+1.00000E+06=0.00000E+06+1.00000E+06=1.00000E+06
S5=S1+S3=6.00000E+00+5.00000E+00=1.1000E+01
S6=S2+S4=3.00000E+00+1.00000E+06=0.00000E+06+1.00000E+06=1.00000E+06
総和=S5+S6=1.10000E+01+1.00000E+06=0.00001E+06+1.00000E+06=1.00001E+06
総和処理の最終結果は、1.00001E+06になる。
【００１２】
同様の計算をスカラで処理した場合は、ａ１からａ８まで要素順に加算するので、最終結果（総和）は以下のようになる。
a1+a2=1.00000E+00+2.00000E+00=3.00000E+00…S1
S1+a3=3.00000E+00+3.00000E+00=6.00000E+00…S2
S2+a4=6.00000E+00+4.00000E+00=1.00000E+01…S3
S3+a5=3.00000E+01+5.00000E+00=1.50000E+01…S4
S4+a6=3.00000E+01+1.00000E+00=1.60000E+01…S5
S5+a7=1.60000E+01+2.00000E+00=1.80000E+01…S6
S6+a8=1.80000E+01+1.00000E+06=1.00002E+06…総和
このように、入力データに依存するものの、最終総和結果は加算順序によってベクトル処理とスカラ処理とで変わる可能性がある。
【００１３】
【発明が解決しようとする課題】
しかしながら、スカラ処理時とベクトル処理時とで総和処理結果が異なる場合は、ほとんどの場合スカラ処理時の結果が正しいとされている。これは、プログラム上は要素順に加算するように書かれているので、スカラ処理時における指数合わせによって発生する誤差は避けられないものであるからである。このように総和処理結果が異なる場合は、プログラムソース中の総和処理が発生する箇所のみスカラ処理を行うことが可能である。しかし、この場合は、ベクトル演算で処理できる箇所をあえてスカラで処理するため、処理時間が遅くなるという問題がある。
【００１４】
また、スカラ処理時以上に高精度の演算結果が必要な場合は、処理対象要素を昇順に並べ換えた後、スカラで処理を行うことによって指数合わせなどによる誤差を小さくすることが可能になる。しかし、この場合は、処理要素を並べ換えるためのプログラムソースが必要となり、人手がかかる上に、処理時間も遅くなってしまう。
【００１５】
上述したように、ベクトル演算処理では、総和演算処理の高速化を図っているために総和処理時の加算順序がスカラ処理時と異なることになるが、従来のベクトル演算装置では、ベクトル総和処理時に加算順序を変更することができないため、処理するデータによってはスカラ処理による総和結果と比べて、ベクトル処理の総和結果の精度が劣るという問題がある。更に、入力データを並べ換える手段がないため、スカラ処理で行う以上の高精度の演算結果は期待できない。
【００１６】
本発明は上記課題を解決すべくなされたものであり、ベクトル総和処理時に入力データを昇順に並べ換え、更に要素順に加算を行うことによって、高精度の総和処理結果を得ることができるベクトル演算装置を提供することを目的とする。
【００１７】
【課題を解決するための手段】
上記課題を解決するため、本発明のベクトル演算装置は、少なくとも１つのベクトルレジスタと、当該ベクトルレジスタ内のベクトルデータを総和演算処理する加算器とを具えるベクトル処理装置において、前記ベクトルレジスタ内のベクトルデータを並べ換えるデータ並べ替え手段と、当該データ並べ換え手段で並べ換えたデータを前記加算器に順次入力して、前記データをキー値が小さい要素順に加算して総和演算処理を行うようにしたことを特徴とする。
【００１８】
このように構成することによって、小さいデータから順に加算処理を行うことができるので、指数合わせにより発生する誤差を抑えることが可能となる。このため、高精度な総和処理結果を得ることができる。
【００１９】
又、本発明のベクトル演算装置は、前記データ並べ換え手段が、前記ベクトルデータを並べ換えるデータソート手段と、当該データソート手段により並べ換えたデータを格納するデータ格納手段と、当該データ格納手段に格納したベクトルデータを所定のタイミングで読み出して前記加算器に送出するデータ読み出しタイミング指示手段を具えることを特徴とする。
【００２０】
更に本発明のベクトル演算装置は、前記加算器が最終加算結果以外の途中加算結果が再度当該加算器に入力されるように構成されており、前記データ読み出しタイミング指示手段が、先行要素の加算結果が前記加算器に再度入力されるのと同時に次要素を前記データ格納手段から読み出して前記加算器に入力するように指示することを特徴とする。
【００２１】
このように構成することにより、ベクトルデータを並べ換え、その後に所定のタイミング、特に加算器における動作タイミングに同期してデータを読み出して加算器に送出することができるので、高精度で高速な総和演算処理を実行することができる。
【００２２】
又、本発明のベクトル演算装置は、前記ベクトルレジスタ内のベクトルデータを直接前記加算器に送り込む第１のデータラインと、前記ベクトルデータを前記データ並べ換え手段を介して前記加算器に送り込む第２のデータラインとを具え、前記加算器へのベクトルデータの供給ラインを前記第１のデータラインと第２のデータラインとの間で切り換えるデータ供給切り換え手段を具えることを特徴とする。
【００２３】
このように、ベクトルデータを並べ替える必要のない時は、データをそのままの順で加算処理することが可能となるので、その場合に応じた適切な加算処理を行うことができる。
【００２４】
又、本発明のベクトル演算装置は、前記ベクトルデータの並べ換えが昇順で行われることを特徴とする。
【００２５】
このように、ベクトルデータを昇順に並べ換えることによって、小さいデータから順に加算処理を行うことができるので、指数合わせによって発生する誤差を小さくすることが可能となる。
【００２６】
又、本発明のベクトル演算装置は、コンパイラが作成した命令語内に、前記ベクトルデータの並べ換え処理を行うか否かの指示信号が設定されていることを特徴とする。
【００２７】
又、本発明のベクトル演算装置は、プロセス状態語内に前記ベクトルデータの並べ換え処理を行うか否かの指示信号が設定されていることを特徴とする。
【００２８】
このように構成することによって、プログラム実行者は任意の箇所の総和処理のみデータを昇順に並べ換えて行うことができるので、全体のプログラムの実行時間を極端に遅くすることなく高精度な結果を得ることができる
【００２９】
又、本発明のベクトル演算装置は、当該演算装置内で行われる総和演算のうちの一部を指定して、当該指定した部分について前記並べ替え手段を介してベクトル総和演算処理を行うようにしたことを特徴とする。
【００３０】
このように構成することによって、全体のプログラムの実行時間を極端に遅くすることなく高精度な総和演算処理を実行することができる。
【００３１】
【発明の実施の形態】
本発明のベクトル処理装置を添付の図面を参照して詳細に説明する。図１は、本発明の第１の実施形態の構成を示すブロック図である。本発明のベクトル処理装置は、ベクトルデータを格納するベクトルレジスタ１１と、総和演算を処理する加算器２１と、ベクトルデータの並べ替えを行うか否かの切り替えをする入力切り替え手段３０と、ベクトルデータを昇順に並べ換えるデータ並べ換え手段４０と、演算結果を保存するベクトルレジスタ１２とを具える。
【００３２】
入力切り替え手段３０は、データソート指示信号５０にしたがってベクトルレジスタ１１から読み出したベクトルデータを加算器２１に直接入力するか、あるいはデータ並べ換え手段４０を介して加算器２１に供給するかを切り替える。データ並べ換え手段４０は、データソート手段４１と、データ格納手段４２と、読み出しタイミング制御手段４３とを具える。データソート手段４１は、入力切り替え手段３０を介して送出されてきた複数要素のベクトルデータを昇順に並べ換え、その結果をデータ格納手段４２に格納する。データ格納手段４２内のベクトルデータは、読み出しタイミング指示手段４３によって読み出され、その後加算器２１に供給される。読み出しタイミング指示手段４３は、各ベクトルデータをデータ格納手段４２から読み出すタイミングを指示するが、このタイミングは加算器２１内の処理時間によって決定し、先行要素の加算結果が加算器２１へ入力されるのと同時に次要素をデータ格納手段４２から読み出して加算器２１へ入力されるように指示する。尚、加算器２１は、最終演算結果以外の途中結果は再度加算器２１に入力され、最終結果のみをベクトルレジスタ１２に格納するように構成されている。
【００３３】
図２は、本発明のベクトル処理装置における命令語を示す図である。命令語１００において、ＯＰフィールド１０１は命令を示すオペレーションコードを示し、ＲＲフィールド１０３はデータを読み出すベクトルレジスタを指定し、ＷＲフィールドは１０４は演算結果を書き込むベクトルレジスタを指定する。また、ソート指示ビット１０２は、入力データを昇順に並べ換えるべきか並べ換えずに通常の総和処理を行うかを指定するビットである。
【００３４】
データの並べ替えを行わずに通常通りの総和処理を行う場合はソース指示ビット１０２を”０”に、入力データを並べ換える場合は”１”に指定するようにする。このビットは、コンパイラが命令語を生成する際に”０”か”１”かを指定する。尚、プログラム中の一部にコンパイラ指示行が書かれている場合は、指定された部分のみの総和演算命令を対象にして、並べ替えを行うようにする。一方、コンパイルオプションなどで指定する場合は、プログラム中のすべての総和演算命令を対象にして、並べ替えを行うことができる。このソート指示ビット１０２が”０”の場合は、図１に示すデータソート指示信号５０は”０”となり、通常の総和演算処理が行われ、ソート指示ビット１０２が”１”の場合は、ソート指示信号は”１”となって、データの並べ替えを行って演算処理が実行される。
【００３５】
又、上述の実施形態では、ソート指示信号５０は命令語１００中のソート指示ビットによって決定しているが、例えば図３に示すようなプロセス状態語２００内にソート指示信号を設け、ＨＷの初期設定値として設定し前総和演算命令を対象とするようにしてもよい。
【００３６】
次に、本発明に係るベクトル処理装置の動作の流れについて図１及び図２を参照しながら説明する。図１において、命令制御部（図示せず）から総和演算命令が発行されると、命令語１００（図２参照）中のＲＲフィールド１０３によってベクトルレジスタ１１が指定され、このレジスタからベクトルデータが読み出される。次いで、データ切り替え手段３０は、データソート指示信号５０が”０”の場合は、ベクトルレジスタ１１から読み出したベクトルデータを、加算器２１に直接入力するようにデータレートの切り替えを行う。一方、データソート指示信号５０が”１”の場合は、データ並べ換え手段４０にルートを切り替えて、ベクトルデータをデータ並べ換え手段４０入力させる。
【００３７】
データソート信号５０が”１”の場合は、ベクトルレジスタ１１からのベクトルデータはデータ並べ換え手段４０に送出され、データソート手段４１にて、複数要素のデータを昇順に並べ換える。この結果はデータ格納手段４２に格納される。読み出しタイミング指示手段４３にタイミング信号に応じてデータ格納手段４２から読み出されて、加算器２１に入力される。読み出しタイミング手段４３がベクトルデータを読み出すタイミングは、加算器２１内の処理時間によって決定し、先行要素の加算結果が加算器２１に入力されるのと同時に次要素がデータ格納手段４２から読み出され、加算器２１へ入力される。加算器２１での最終結果以外の途中結果は再度加算器２１に入力され、最終結果のみ命令語１００中のＷＲフィールド１０４で指定されたベクトルレジスタ、この場合ベクトルレジスタ１２に送出され、格納される。
【００３８】
データソート信号５０が”０”の場合、すなわちベクトルレジスタ１１から読み出したベクトルデータを加算器２１に直接入力する場合は、従来の総和処理と同様であるので、ここでの説明は省略する。又、加算器２１における浮動小数点加算における加算方法は、公知であるのでここでの説明は省略する。
【００３９】
次に、図４を参照してソート指示信号５０が”１”の場合の本発明の加算器２１内の処理動作を詳細に説明する。ここで、加算器２１内の処理時間を４クロックと仮定し、データ並べ換え手段４０によって昇順に並べ換えられたデータを、Ａ１、Ａ２、Ａ３、Ａ４．．．とする。
【００４０】
図４に示すとおり、クロック１においてデータ格納手段４２から第１要素（Ａ１）が読み出され、次のクロック２で加算器内処理１へ入力される。この時、加算器２１ではＡ１を即値”０”と加算する。次いで加算器内処理２、３を経て、５クロック目には加算器内処理４においてＡ１＋０の結果が求められる。この結果は次のクロック（６クロック目）で再度加算器内処理１に入力される。Ａ１＋０の結果が６クロック目に加算器内処理１に入力されることは予め予測することが可能であるので、読み出しタイミング指示手段４３は、５クロック目にデータ格納手段４２からＡ２を読み出して、６クロック目に次要素（Ａ２）が加算器内処理１に入力するようにする。すなわち、読み出しタイミング指示手段４３は、４クロック毎にデータ格納手段４２からデータを読み出すようにする。
【００４１】
図４に示すように、６クロック目には、加算器内処理１においてＡ１＋Ａ２（＝Ｓ１）が求められる。以後同様に、９クロック目にデータ格納手段４２から第３要素（Ａ３）が読み出され、１０クロック目にはＳ１とＡ３が加算器内処理１に入力され、加算器内処理２、３を経て１３クロック目にはＳ１＋Ａ３（＝Ｓ２）が算出される。このようにして、データソート手段４１によって昇順に並べ換えられたベクトルデータをデータの小さい順に加算することによって総和演算処理を行い、最終的に求められた総和をベクトルレジスタ１２に格納する。
【００４２】
【発明の効果】
上述したように、本発明のベクトル処理装置によれば、総和処理を行うベクトルデータを昇順に並べ換えた後、小さいデータから順に加算処理を行うことができるので、指数合わせによって発生する誤差を抑えることができる。このため、高精度な演算結果を得ることができる。
【００４３】
又、プログラム実行者が任意の箇所の総和演算処理のみについてデータを並べ換えて演算処理を行うようにすることができるので、全体のプログラムの実行時間を極端に遅くすることなく高精度な結果を得ることができる。
【図面の簡単な説明】
【図１】図１は、本発明のベクトル演算装置の構成を示すブロック図である。
【図２】図２は、本発明のベクトル演算装置における命令語を示す図である。
【図３】図３は、本発明のベクトル演算装置におけるプロセス状態語を示す図である。
【図４】図４は、本発明のベクトル演算装置の加算器内の処理動作を示す図である。
【図５】図５は、従来のベクトル演算装置の構成を示すブロック図である。
【図６】図６は、従来のベクトル演算装置の加算器内の処理動作を示す図である。
【図７】図７は、従来のベクトル演算装置の加算器内のもう一つの処理動作を示す図である。
【符号の説明】
１１、１２、１１１、１１２ベクトルレジスタ
２１、１２１加算器
３０入力切り替え手段
４０データ並べ換え手段
４１データソート手段
４２データ格納手段
４３読み出しタイミング指示手段
５０データソート信号
１００命令語
１０１、３０１ＯＰフィールド
１０２ソート指示ビット
１０３、３０３ＲＲフィールド
１０４、３０４ＷＲフィールド
２００プロセス状態語
２０１ソート指示ビット
３００命令語[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a vector arithmetic apparatus, and more particularly to a vector arithmetic apparatus that can perform high-precision arithmetic processing by reducing an error due to exponent matching when performing vector summation.
[0002]
[Prior art]
FIG. 5 shows a conventional vector arithmetic apparatus that performs vector summation. As shown in FIG. 5, the conventional vector operation device includes a vector register 111 that stores vector data, an adder 121 that performs a sum operation process, and a vector register 112 that stores an operation result. In FIG. 5, when a sum instruction, that is, an instruction word 300 is issued from an instruction control unit (not shown), vector data is transferred from the vector register 111 designated by the RR field 302 that designates a read register in the instruction word 300. Read out and input to the adder 121. The adder 121 performs a sum operation process on the input data, and stores the final result in the vector register 112 designated by the WR field 303 that designates the write register in the instruction word 300. The summation processing method in the adder 121 is as follows.
[0003]
First, one vector data is read every clock from the vector register 111 storing vector data, and sequentially input to the adder 121. The adder 121 is configured such that an addition result other than the final result is input to the adder 121 again and added to the vector data read from the vector register.
[0004]
FIG. 6 is a schematic diagram showing processing in the adder 121 of the vector arithmetic apparatus shown in FIG. For example, if the processing time in the adder 121 is 4 clocks, the elements read from the vector register 111 are input to the adder 21 and added again after 4 clocks.
[0005]
Here, the vector data are a1, a2, a3, a4, a5, a7, a6, a7, a8. . . And The data a1 is read from the vector register 111 at the first clock, and is input to the addition process 1 in the adder 121 at the second clock, and is added with the immediate value “0”. In-adder processing 2 and 3 are sequentially performed, and the result of a1 + 0 is obtained by in-adder processing 4 at the fifth clock. Then, at the next clock (sixth clock), the result a1 + 0 is input to the adder processing 1 again.
[0006]
On the other hand, the data a5 is read from the vector register 111 at the fifth clock, and this data a5 is input to the adder internal processing 1 at the sixth clock, and added again with the input a1 + 0, and a1 + a5 Processing is performed. Similarly to the case of a1, the data a2, a3, and a4 are also added to the immediate value 0 in the adder processing (a2 + 0, a3 + 0, a4 + 0), and the result is added again at the seventh, eighth, and ninth clocks. The result is input to the internal process 1, and the results of a2 + a6, a3 + a7, a4 + a8 are obtained.
[0007]
Here, if the addition results of a1 + a5, a2 + a6, a3 + a7, and a4 + a8 are S1, S2, S3, and S4, respectively, S1 and a9 at the 10th clock, and S2 and a10, 12 at the 11th clock. S3 and a11 are input to the clock, S4 and a12 are input to the 13th clock, and the results of S1 + a9, S2 + a10, S3 + a11, and S4 + a12 are obtained.
[0008]
Although there are various methods for calculating the final result, the method shown in FIG. 7 is used here. FIG. 7 shows a method of calculating the final result when the number of vector processing elements is eight. S1, S2, S3, and S4 are calculated by the method described above. It is assumed that S1 is input to the in-adder process 1 at the timing when S3 is input to the in-adder process 1, and is controlled in the arithmetic unit so that S1 + S3 = S5 is calculated. Similarly, at the timing when S4 is input to the in-adder process 1, the arithmetic unit is controlled so that S2 is input to the in-adder process 1, and S2 + S4 = S6 is calculated. In addition, it is assumed that S5 is input to the in-adder process 1 at the timing when S6 is input to the in-adder process 1, and is controlled in the arithmetic unit so that S5 + S6 = sum is calculated.
[0009]
Thus, one total result is finally obtained, and the result is stored in the vector register 12. In general, the summation process in the vector arithmetic unit is designed to speed up the process by regressing data in the adder as described above. When performing scalar processing, it is usually added in element order starting from a1, but when it is processed by a vector arithmetic unit, it is not added in element order, so the order of addition is different from that during scalar processing. However, in the addition process, it is necessary to perform the following index matching before the addition process as is well known.
[0010]
Floating-point data handled on a computer is usually represented by a hexadecimal number and can be represented by an exponent part and a mantissa part. In addition processing, in general, index matching is performed by adjusting the smaller index to the larger index. Since the mantissa having the smaller exponent is shifted to the right by the difference between the exponents of the two data to be added, the shifted out bits are discarded. Thus, since it is necessary to perform exponent matching in the addition processing, if the exponent difference between the two data to be added is large, the number of bits shifted out increases and the addition result includes an error. .
[0011]
This error will be briefly described below by representing data handled on a computer in decimal. A specific operation when the number of significant digits of the adder is assumed to be 5 digits after the decimal point and the number of vector processing elements is 8 will be described. a1 = 1.000000E + 00, a2 = 2.00000E + 00, a3 = 3.00000E + 00, a4 = 4.00000E + 00, a5 = 5.000000E + 00, a6 = 1.00000E + 00, a7 = 2.00000E + 00, Assuming that a8 = 1.00000E + 06, S1 to S6 as intermediate addition results and the final result (total) are as follows.
S1 = a1 + a5 = 1.00000E + 00 + 5.00000E + 00 = 6.00000E + 00
S2 = a2 + a6 = 2.00000E + 00 + 1.00000E + 00 = 3.00000E + 00
S3 = a3 + a7 = 3.00000E + 00 + 2.00000E + 00 = 5.00000E + 00
S4 = a4 + a8 = 4.00000E + 00 + 1.00000E + 06 = 0.00000E + 06 + 1.00000E + 06 = 1.00000E + 06
S5 = S1 + S3 = 6.00000E + 00 + 5.00000E + 00 = 1.1000E + 01
S6 = S2 + S4 = 3.00000E + 00 + 1.00000E + 06 = 0.00000E + 06 + 1.00000E + 06 = 1.00000E + 06
Sum = S5 + S6 = 1.10000E + 01 + 1.00000E + 06 = 0.00001E + 06 + 1.00000E + 06 = 1.00001E + 06
The final result of the summation process is 1.00001E + 06.
[0012]
When the same calculation is processed with a scalar, since a1 to a8 are added in element order, the final result (sum) is as follows.
a1 + a2 = 1.00000E + 00 + 2.00000E + 00 = 3.00000E + 00 ... S1
S1 + a3 = 3.00000E + 00 + 3.00000E + 00 = 6.00000E + 00 ... S2
S2 + a4 = 6.00000E + 00 + 4.00000E + 00 = 1.00000E + 01… S3
S3 + a5 = 3.00000E + 01 + 5.00000E + 00 = 1.50000E + 01… S4
S4 + a6 = 3.00000E + 01 + 1.00000E + 00 = 1.60000E + 01… S5
S5 + a7 = 1.60000E + 01 + 2.00000E + 00 = 1.80000E + 01… S6
S6 + a8 = 1.80000E + 01 + 1.00000E + 06 = 1.00002E + 06 ... Sum In this way, although depending on the input data, the final sum result may vary between vector processing and scalar processing depending on the order of addition. .
[0013]
[Problems to be solved by the invention]
However, if the summation processing results are different between the scalar processing and the vector processing, the result of the scalar processing is assumed to be correct in most cases. This is because, in the program, it is written to add in the order of elements, so that an error caused by exponent matching at the time of scalar processing is unavoidable. In this way, when the sum processing results are different, it is possible to perform the scalar processing only at the place where the sum processing occurs in the program source. However, in this case, there is a problem that the processing time is slow because a portion that can be processed by the vector operation is intentionally processed by the scalar.
[0014]
In addition, when a calculation result with higher accuracy than that at the time of scalar processing is required, it is possible to reduce errors due to exponent matching by rearranging the processing target elements in ascending order and then performing processing with the scalar. However, in this case, a program source for rearranging the processing elements is required, which requires labor and slows down the processing time.
[0015]
As described above, in the vector calculation process, the sum calculation process is speeded up, so the addition order during the sum process is different from the scalar process. However, in the conventional vector calculation device, during the vector sum process, Since the order of addition cannot be changed, there is a problem that the accuracy of the summation result of the vector processing is inferior to the summation result of the scalar processing depending on the data to be processed. Furthermore, since there is no means for rearranging input data, it is not possible to expect a calculation result with higher accuracy than that performed by scalar processing.
[0016]
The present invention has been made to solve the above-described problem. A vector arithmetic apparatus capable of obtaining a high-precision summation processing result by rearranging input data in ascending order during vector summation processing and further performing addition in element order. The purpose is to provide.
[0017]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, a vector operation device according to the present invention is a vector processing device comprising at least one vector register and an adder that performs a sum operation on vector data in the vector register. A data rearrangement unit for rearranging vector data, and the data rearranged by the data rearrangement unit are sequentially input to the adder, and the sum is calculated by adding the data in the order of elements having the smallest key value. It is characterized by.
[0018]
With such a configuration, addition processing can be performed in order from the smallest data, so that errors caused by index matching can be suppressed. For this reason, a highly accurate summation processing result can be obtained.
[0019]
Further, in the vector operation device of the present invention, the data rearrangement means stores the data sort means for rearranging the vector data, the data storage means for storing the data rearranged by the data sort means, and the data storage means. Data read timing instruction means for reading vector data at a predetermined timing and sending it to the adder is provided.
[0020]
Further, in the vector operation device according to the present invention, the adder is configured so that an intermediate addition result other than the final addition result is input again to the adder, and the data read timing instruction means includes the addition result of the preceding element. Is input to the adder again, and at the same time, the next element is read from the data storage means and instructed to be input to the adder.
[0021]
By configuring in this way, vector data can be rearranged, and then the data can be read out and sent to the adder in synchronization with a predetermined timing, particularly the operation timing in the adder. Processing can be executed.
[0022]
The vector arithmetic unit of the present invention includes a first data line for directly sending the vector data in the vector register to the adder, and a second data line for sending the vector data to the adder via the data rearranging means. And a data supply switching means for switching a supply line of vector data to the adder between the first data line and the second data line.
[0023]
As described above, when it is not necessary to rearrange the vector data, the data can be added in the order as they are, so that an appropriate addition process corresponding to the case can be performed.
[0024]
In the vector operation device of the present invention, the vector data is rearranged in ascending order.
[0025]
In this way, by rearranging the vector data in ascending order, addition processing can be performed in order from the smallest data, so that an error caused by index matching can be reduced.
[0026]
Also, the vector arithmetic unit of the present invention is characterized in that an instruction signal for determining whether or not the vector data is rearranged is set in an instruction word created by a compiler.
[0027]
The vector operation device of the present invention is characterized in that an instruction signal for determining whether or not the vector data is rearranged is set in a process state word.
[0028]
With this configuration, the program executor can rearrange the data in ascending order only for the summation process at an arbitrary location, so that a highly accurate result can be obtained without extremely slowing the execution time of the entire program. [0029]
In the vector operation device of the present invention, a part of the sum operation performed in the operation device is specified, and the specified portion is subjected to vector sum operation processing via the rearranging means. It is characterized by that.
[0030]
With this configuration, it is possible to execute high-precision summation processing without extremely delaying the execution time of the entire program.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
The vector processing apparatus of the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a block diagram showing the configuration of the first exemplary embodiment of the present invention. The vector processing apparatus of the present invention includes a vector register 11 for storing vector data, an adder 21 for processing a summation operation, an input switching means 30 for switching whether or not to rearrange vector data, vector data Are arranged in ascending order, and a vector register 12 for storing operation results.
[0032]
The input switching unit 30 switches whether the vector data read from the vector register 11 is directly input to the adder 21 or supplied to the adder 21 via the data rearrangement unit 40 according to the data sort instruction signal 50. The data rearranging unit 40 includes a data sorting unit 41, a data storage unit 42, and a read timing control unit 43. The data sorting unit 41 rearranges the vector data of a plurality of elements sent via the input switching unit 30 in ascending order, and stores the result in the data storage unit 42. The vector data in the data storage means 42 is read by the read timing instruction means 43 and then supplied to the adder 21. The read timing instruction means 43 instructs the timing for reading each vector data from the data storage means 42. This timing is determined by the processing time in the adder 21, and the addition result of the preceding element is input to the adder 21. At the same time, the next element is read from the data storage means 42 and instructed to be input to the adder 21. The adder 21 is configured so that intermediate results other than the final operation result are input to the adder 21 again and only the final result is stored in the vector register 12.
[0033]
FIG. 2 is a diagram showing instruction words in the vector processing apparatus of the present invention. In the instruction word 100, an OP field 101 indicates an operation code indicating an instruction, an RR field 103 specifies a vector register from which data is read, and a WR field 104 specifies a vector register into which an operation result is written. The sort instruction bit 102 is a bit that designates whether the input data should be rearranged in ascending order or whether normal summation processing is performed without rearrangement.
[0034]
When normal summation processing is performed without data rearrangement, the source instruction bit 102 is designated as “0”, and when input data is rearranged, “1” is designated. This bit specifies “0” or “1” when the compiler generates an instruction word. When a compiler instruction line is written in a part of the program, the rearrangement is performed only on the sum calculation instruction of the designated part. On the other hand, when it is specified by a compile option or the like, it can be rearranged for all the sum operation instructions in the program. When this sort instruction bit 102 is “0”, the data sort instruction signal 50 shown in FIG. 1 is “0”, the normal summation processing is performed, and when the sort instruction bit 102 is “1”, the sort is performed. The instruction signal is “1”, and the arithmetic processing is executed by rearranging the data.
[0035]
In the above-described embodiment, the sort instruction signal 50 is determined by the sort instruction bit in the instruction word 100. For example, a sort instruction signal is provided in the process state word 200 as shown in FIG. It may be set as a set value so as to target the previous sum operation instruction.
[0036]
Next, the operation flow of the vector processing apparatus according to the present invention will be described with reference to FIGS. In FIG. 1, when a sum operation instruction is issued from an instruction control unit (not shown), the vector register 11 is designated by the RR field 103 in the instruction word 100 (see FIG. 2), and vector data is read from this register. It is. Next, when the data sort instruction signal 50 is “0”, the data switching unit 30 switches the data rate so that the vector data read from the vector register 11 is directly input to the adder 21. On the other hand, when the data sort instruction signal 50 is “1”, the route is switched to the data rearrangement unit 40 and the vector data is input to the data rearrangement unit 40.
[0037]
When the data sort signal 50 is “1”, the vector data from the vector register 11 is sent to the data rearrangement means 40, and the data sort means 41 rearranges the data of a plurality of elements in ascending order. This result is stored in the data storage means 42. The data is read from the data storage means 42 according to the timing signal to the read timing instruction means 43 and input to the adder 21. The timing at which the read timing means 43 reads the vector data is determined by the processing time in the adder 21, and the next element is read from the data storage means 42 at the same time as the addition result of the preceding element is input to the adder 21. , And input to the adder 21. The intermediate result other than the final result in the adder 21 is input to the adder 21 again, and only the final result is sent to the vector register designated by the WR field 104 in the instruction word 100, in this case the vector register 12, and stored. .
[0038]
When the data sort signal 50 is “0”, that is, when the vector data read from the vector register 11 is directly input to the adder 21, the description is omitted here because it is the same as the conventional summation process. Further, since the addition method in the floating point addition in the adder 21 is known, the description thereof is omitted here.
[0039]
Next, the processing operation in the adder 21 of the present invention when the sort instruction signal 50 is “1” will be described in detail with reference to FIG. Here, assuming that the processing time in the adder 21 is 4 clocks, the data rearranged in the ascending order by the data rearranging means 40 are A1, A2, A3, A4. . . And
[0040]
As shown in FIG. 4, the first element (A1) is read from the data storage means 42 at the clock 1 and input to the adder internal processing 1 at the next clock 2. At this time, the adder 21 adds A1 to the immediate value “0”. Next, through the adder processes 2 and 3, the result of A1 + 0 is obtained in the adder process 4 at the fifth clock. This result is input again to the in-adder process 1 at the next clock (sixth clock). Since it is possible to predict in advance that the result of A1 + 0 is input to the adder processing 1 at the sixth clock, the read timing instruction means 43 reads A2 from the data storage means 42 at the fifth clock, The next element (A2) is input to the adder processing 1 at the sixth clock. That is, the read timing instruction unit 43 reads data from the data storage unit 42 every four clocks.
[0041]
As shown in FIG. 4, A1 + A2 (= S1) is obtained in the adder process 1 at the sixth clock. Thereafter, similarly, the third element (A3) is read from the data storing means 42 at the ninth clock, and S1 and A3 are input to the adder processing 1 at the tenth clock, and the adder processing 2, 3 are performed. Then, S1 + A3 (= S2) is calculated at the 13th clock. In this manner, the vector data rearranged in the ascending order by the data sorting means 41 is added to the data in ascending order to perform the sum operation processing, and the finally obtained sum is stored in the vector register 12.
[0042]
【The invention's effect】
As described above, according to the vector processing device of the present invention, the vector data to be summed can be rearranged in ascending order, and then addition processing can be performed in ascending order, so that errors caused by index matching can be suppressed. Can do. For this reason, a highly accurate calculation result can be obtained.
[0043]
In addition, since the program executor can rearrange the data for only the sum calculation processing at an arbitrary place and perform the calculation processing, a highly accurate result can be obtained without extremely slowing the execution time of the entire program. be able to.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a vector arithmetic apparatus according to the present invention.
FIG. 2 is a diagram showing instruction words in the vector operation device of the present invention.
FIG. 3 is a diagram showing process state words in the vector operation device of the present invention.
FIG. 4 is a diagram showing a processing operation in an adder of the vector arithmetic apparatus according to the present invention.
FIG. 5 is a block diagram showing a configuration of a conventional vector operation device.
FIG. 6 is a diagram illustrating a processing operation in an adder of a conventional vector operation device.
FIG. 7 is a diagram showing another processing operation in the adder of the conventional vector arithmetic apparatus.
[Explanation of symbols]
11, 12, 111, 112 Vector register 21, 121 Adder 30 Input switching means 40 Data rearranging means 41 Data sorting means 42 Data storage means 43 Read timing instruction means 50 Data sort signal 100 Command word 101, 301 OP field 102 Sort instruction Bit 103, 303 RR field 104, 304 WR field 200 Process status word 201 Sort instruction bit 300 Instruction word

Claims

A vector that includes at least one vector register and an adder that changes the addition order of vector data by regressing an addition output, and performs summation processing on the vector data in the vector register using the adder In the processing device,
Data sorting means for rearranging vector data in the vector register ;
Data storage means for storing the data rearranged by the data sorting means;
Data reading for reading out the next element of the rearranged vector data stored in the data storage means at a timing according to the operation timing of the adder, sending it to the adder, and adding it to the regressed addition output And a timing instruction means .

2. The vector arithmetic unit according to claim 1 , wherein the adder is configured so that a halfway addition result other than the final addition result is input again to the adder, and the data read timing instruction means includes a preceding element. A vector arithmetic unit characterized in that at the same time when the addition result is inputted again to the adder, the next element is read from the data storage means and inputted to the adder.

In a vector processing apparatus comprising at least one vector register and an adder that performs a sum operation on the vector data in the vector register using a floating-point operation ,
Data rearranging means for rearranging vector data in the vector register;
First data line to feed vector data in the vector register to said adder,
A second data line for sending the vector data to the adder via the data rearranging means ;
Comprising a data supply switching means for switching between said first data line and the second data lines supply lines of the vector data to the adder,
When the second data line is designated , the vector arithmetic apparatus is characterized in that the data rearranged by the rearranging means is sequentially input to the adder, and the data is added in ascending order to perform a sum operation process .

In the vector calculation apparatus according to any one of claims 1 to 3, the vector operation unit, wherein the vector data reordering is performed in ascending order.

And one vector register even without low, the vector data using a floating-point operation in said vector register, the vector processing device comprising an adder for summation processing,
Data rearranging means for rearranging vector data in the vector register; and input switching means for switching whether the vector data is input directly from the vector register to the adder or input via the data rearranging means,
The input switching means sequentially inputs vector data to the adder via the data rearranging means when an instruction signal for rearranging the vector data is set in an instruction word created by a compiler. A vector arithmetic apparatus characterized in that the data are added in ascending order to perform summation processing .

And one vector register even without low, the vector data using a floating-point operation in said vector register, the vector processing device comprising an adder for summation processing,
Data rearranging means for rearranging vector data in the vector register, and input switching means for switching whether the vector data is input directly from the vector register to the adder or input via the data rearranging means,
The input switching means sequentially inputs vector data to the adder via the data rearranging means when an instruction signal for rearranging the vector data is set in the process state word . A vector arithmetic apparatus characterized in that the data is added in ascending order to perform a sum operation process .