CN1107597A - Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method - Google Patents
Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method Download PDFInfo
- Publication number
- CN1107597A CN1107597A CN94101719.2A CN94101719A CN1107597A CN 1107597 A CN1107597 A CN 1107597A CN 94101719 A CN94101719 A CN 94101719A CN 1107597 A CN1107597 A CN 1107597A
- Authority
- CN
- China
- Prior art keywords
- data
- register
- treatment element
- control
- pipeline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8015—One dimensional arrays, e.g. rings, linear arrays, buses
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
Abstract
The array processing structure comprises pipeline processing unit, register and multiplexer. Several registers and multiplexer are added to input and output of each processing unit for data transfer. The data transmission to and from each processing unit is conducted in broadcast and palpitation mode and controlled by a single controller. Said structure may be used for data calculation, shift and conversion with higher speed. Each processing unit only needs less memory and the memory has high utilization efficiency by control of multi-port memory.
Description
The present invention relates to the ARRAY PROCESSING framework and the method thereof of a kind of pipeline and heartbeat type and single-instruction multiple-data stream (SIMD), the mode of particularly relevant a kind of mixing broadcast type (Broadcasting) and heartbeat type (Systolic) connects a plurality of pipelines (Pipelined) treatment element (Processing element), finish single instrction (Single Instruction) and handle the array architecture of multiple data stream (Multiple Data Stream) and the method for design thereof, it can be used for the computing machine parallel processor, design such as image processor and digital signal processor, it can be used for the computing machine parallel processor, design such as image processor and digital signal processor, and transmit the processing that better efficient is arranged with the transfer party mask in data, can also make a monolithic with its practicality that doubles.
The ARRAY PROCESSING framework and the method thereof of pipeline of the present invention (pipelined) and heartbeat type (systolic) and single-instruction multiple-data stream (SIMD) (SIMD), its purpose is to provide modes such as a kind of data I/O displacement, translation operation, make its processing speed comparatively fast, more efficient, this is its fundamental purpose.
ARRAY PROCESSING framework and method thereof according to pipeline of the present invention and heartbeat type and single-instruction multiple-data stream (SIMD), because its data I/O is effectively used, make it can save its data line (Data lines) and integrated circuit pinnumber (pincount), the trouble of avoiding control line quantity to bring too much, promote the service efficiency of storer, and can be made into a monolithic, this is its secondary objective.
According to the ARRAY PROCESSING framework and the method thereof of pipeline of the present invention and heartbeat type and single-instruction multiple-data stream (SIMD), it can be applicable on the framework of one dimension and two dimension, and this is its another purpose.
ARRAY PROCESSING framework and method thereof according to pipeline of the present invention and heartbeat type and single-instruction multiple-data stream (SIMD), it can make monolithic and directly be installed on computing machine or the TV, reaches multiple image processing effect, has practicality, convenience, and the saving usage space, this is its another purpose.
For achieving the above object, the present invention is by pipeline (pipelined) treatment element (Processing Element), register (register) and multiplexer compositions such as (multiplexer), front and back I/O end at treatment element (PE), connect register (register) and multiplexer (multiplexer), in the mode of mixing broadcast type (Broadcasting) and heartbeat type (Systolic) data are sent to treatment element (PE), because the I/O end at treatment element (PE) adds register and multiplexer, and with each register and multiplexer control in succession mutually, so make the present invention when when action especially Data Update, do not need total data is written into again, only need the data that are short of are written into, and with original data based need conversion being written into, like this, just can save the time that it is written into data, also save simultaneously the quantity of data line and control line, more helped the realization of integrated circuit of the present invention.
As for detailed construction of the present invention, application principle, effect and effect, then the explanation of doing with reference to following adjoint can be understood completely:
Fig. 1 is the line frame composition of the ARRAY PROCESSING framework of pipeline of the present invention and heartbeat type and single-instruction multiple-data stream (SIMD).
Fig. 2 is the inner bay composition of treatment element of the present invention (Processing element).
Fig. 3 is the pattern CROM (control read only memory) input and output truth table of treatment element of the present invention.
Fig. 4 is the inner first pattern architecture figure of treatment element of the present invention (PE).
Fig. 5 is the inner second pattern architecture figure of treatment element of the present invention.
Fig. 6 is the inner three-mode Organization Chart of treatment element of the present invention.
Fig. 7 is the inner four-mode Organization Chart of treatment element of the present invention.
Fig. 8 is inner the 5th pattern architecture figure of treatment element of the present invention.
Fig. 9 is inner the 6th pattern architecture figure of treatment element of the present invention.
Figure 10 is the line frame composition of treatment element matrix operation of the present invention (matrix computation).
Figure 11 is during for processing array computing of the present invention, and (clock cycle) is written into synoptic diagram for the data of benchmark by the clock cycle.
Figure 12 is during for processing array computing of the present invention, presses the clock cycle (clock cycle) to be the data-switching synoptic diagram of benchmark.
Figure 13 handles the line frame composition of finite impulse response filter (FIR Filters) for the present invention.
When Figure 14 handles finite impulse response filter (FIR Filters) for the present invention, be the data processing synoptic diagram of benchmark by the clock cycle.
Figure 15 handles the line frame composition of infinite impulse response filter (IIR Filters) for the present invention.
When Figure 16 handles infinite impulse response filter (IIR Filters) for the present invention, be the data processing synoptic diagram of benchmark by the clock cycle.
Figure 17 handles the line frame composition of rim detection (Edge detection) and level and smooth (smoothing) for the present invention.
When Figure 18 handles rim detection (Edge detection) and level and smooth (smoothing) for the present invention, be the data processing synoptic diagram of benchmark by the clock cycle.
When Figure 19 handles rim detection (Edge detection) and handles level and smooth (smoothing) for the present invention, be each control signal action synoptic diagram of benchmark by the clock cycle.
Figure 20 handles the line frame composition of one-dimensional discrete cosine conversion (Two-Dimensional DCT) for the present invention.
Figure 21 handles control and the data-signal synoptic diagram that the constant (constants) of 2-D discrete cosine conversion is written into for the present invention.
When Figure 22 and Figure 23 handle the 2-D discrete cosine conversion for the present invention, be the control and the data-signal synoptic diagram of benchmark by the clock cycle.
Figure 24 implements illustration for two-dimensional array framework of the present invention.
Figure 25 is the enforcement illustration of two-dimensional array framework of the present invention and treatment element (Processing element) inside structure.
Figure 26 is written into synoptic diagram for constant (constants) data that the present invention handles 2-D discrete cosine conversion (Two-Dimensional DCT) with the two-dimensional array framework.
When Figure 27 and Figure 28 handle the 2-D discrete cosine conversion for the present invention with the two-dimensional array framework, press the clock cycle (Clock Cycle) and be control of benchmark (control) and data (Data) signal schematic representation.
Figure 29 handles the line frame composition that image moves assessment (Motion estimation) and model comparison (template matching) for the present invention with the two-dimensional array framework.
Figure 30 moves assessment and the two-dimensional array framework of model comparison and the enforcement illustration of treatment element (Processing element) inside structure for the present invention is used for handling image.
When Figure 31 handles image and moves assessment (Motion estimation) and model and compare (template matching) with two-dimensional array framework (Two-Dimensional array architecture) for the present invention, be the data-signal synoptic diagram of benchmark by the clock cycle.
When Figure 32 handles image assessment and model comparison for the present invention with the two-dimensional array framework, be the control signal synoptic diagram of benchmark by the clock cycle.
Figure 33 adopts the array architecture figure of sublevel pipeline (stage pipelined) framework embodiment for the present invention.
Figure 34 is that the present invention is in order to calculate the embodiment array architecture figure of 1008 discrete Fourier transform.
The embodiment array architecture figure that Figure 35 combines with heartbeat type structure (systolic Architecture) for the present invention.
Figure 36 is applied to the array architecture figure of image compressibility embodiment for the present invention.
As shown in Figure 1, it is the wiring diagram of pipeline of the present invention (pipelined) with the ARRAY PROCESSING framework of heartbeat type (systolic) and single-instruction multiple-data stream (SIMD) (SIMD), it comprises several treatment elements (Processing element) PE1-PEn, input end installing register rs11-rsln at treatment element PE1-PEn, rs21-rs2n, rb and the multiplexed device Mu11-Mu1n that uses, Mu21-Mu2n, Mb, and also add register rol-rol and multiplexer Mol-Mon at the output terminal of treatment element, Mob, and register rs11-rs1n at the I/O end, rs21-rs2n, rb, the front end of ros1-rosn adds a multiport memory (multi-port memory) M, control the ARRAY PROCESSING framework of this pipeline (pipelined) and heartbeat type (systolic) and single-instruction multiple-data stream (SIMD) (SIMD) in addition with a controller (controller) C, and each control signal of controller (controller) C then is described below:
Control signal 1: shift register array (Shift register Array) rs21-rs2n is written into the control (load control) of displacement.
Control signal 2: to the control (clear control) of shift register array rs21-rs2n zero clearing.
Control signal 3: shift register array rs11-rsln is written into the control of displacement.
Control signal 4: to the control of shift register array rsll-rsln zero clearing.
Control signal 5: the data of multiplexer Mull-Muln are selected (Data Select) control.
Control signal 6: the data to multiplexer Mu21-Mu2n are selected control.
Control signal 7: the broadcast data of multiplexer Mb is selected (Broadcasting Data Select) control.
Control signal 8: register rb is written into the control of broadcast data (Broadcasting data).
Control signal 9: to the function control (Processing element function control) of treatment element PE1-PEn.
Control signal 10: to the control that resets (Processing element reset) of treatment element PE1-PEn.
Control signal 11: to the output displacement control of shift register array rol-ron.
Control signal 12: select the data of multiplexer Mol-Mon to select control to the data of output displaced array.
Control signal 13: select the data of multiplexer Mob to select control to the data of output displaced array.
Control signal 14:, comprise address wire (Addresses), read-write (read/write), activation (Enable) etc. to the control of multiport memory.
Data transmit control signal 15: for the data that add the ARRAY PROCESSING framework of controlling pipeline and heartbeat type and single-instruction multiple-data stream (SIMD) transmit control signal.
Data transmit control signal 16: the data conveyer line that is sent to other processors for control data.
According to data processing operation of the present invention, when the input data when treatment element PEl-PEn carries out data processing, 1-8 controls by control signal, control signal 2 as be the 1(noble potential) time, then register rs21-rs2n being removed is 0, control signal 4 as be the 1(noble potential) time, be 0 also with register rsll-rsln removing.And control signal 1 as be 1 o'clock, then the data of register rs21 are then by multiport memory M(ms2) input, originally the data of rs21 then are transferred to rs22, the original data of rs22 are transferred to rs23, so analogize, constitute the transfer of data, control signal 3 as be 1 o'clock, it also is the transfer of carrying out data, with multiport memory M(msl) data pass to rsll, and all contain two multiplexers (example: contain multiplexer Mu11 and Mu21 between register rsll and the rsl2) in each register.And multiplexer Mull-Mu1n is by control signal 5 controls, multiplexer Mu21-Mu2n is by control signal 6 controls, so input data of register rs12, data isl after then being handled by two multiplexers supplies with, register rs13 is then imported by data is2, so analogize, as for multiplexer Mu11-Mu1n, by control signal 5 controls, when control signal 5 is the 1(noble potential), then the output data of multiplexer Mu11-Mu1n is identical with the output data of register rs11-rsln, if control signal is the 0(electronegative potential), then then the output data Oil-Oil with treatment element PEl-PEn is identical for the output data of multiplexer Mull-Muln, multiplexer Mu21-Mu2n is then by control signal 6 controls, when control signal 6 is 1, then then the output data with a last multiplexer Mu11-Muln is identical for the output data of multiplexer Mu21-Mu2n, when control signal 6 is 0, then the output data of multiplexer Mu21-Mu2n is identical with the output data of register rs21-rs2n, the data of 8 control register rb of control signal are written in addition, if control signal 8 is 1, then the data of register rb promptly are written into from multiport memory (multi-port memory) M, and control signal 7 is promptly controlled the data selection of multiplexer Mb, when control signal 7 is 0, output signal ib=Ob then, control signal 7 is 1 o'clock, data in the output signal ib=register rb then, and aspect output control, then control by control signal 11-13, its control method is also identical with aforesaid control method, data by control signal 11 control shift register array rol-ron shift, the control signal 12 then data of control multiplexer Mol-Mon is selected, 13 of control signals are that the data of multiplexer Mob are selected control, as for control signal 14, then be made for the control of multiport memory M, carry out various data write controls.
Shown in Figure 2, internal structural map for its pipeline of the present invention (pipelined) treatment element (Processing Element) PE, it is by a first in first out (first-in first-out) stack 100, literal register shelves (constant register file) 101, multiplexer 102,103,108 and 114, register 106,107 and 110, multiplier (Multiplier) 104, absolute difference device (Absolute-Difference util) 105, totalizer (Adder) 109, data register shelves (Data register file) 113, three-state buffer (tristate buffer) 111 and 112 compositions of code translator (Decoder), and carry out function control (Function control) by the control signal 9 of controller (Controller) C, and its function control can be further divided into:
Advanced control (FIFO control) 91, pattern control (mode control) 92, register are written into control (Register load control) 93, totalizer control (Adder control) 94, identification code (ID) 95, literal register shelves control (constant register file control) 96, data register shelves control (Data register file control) 97.
With pattern control (mode control) 92, it installs a ROM (read-only memory) (ROM) 921, drive the control of its output terminal C0-C7, mode (as shown in Figure 3) with its control, it is to use ROM (read-only memory) (ROM) 921 actions, and make its output terminal C0-C7 that 8 kinds of mode producing be arranged, and C0, C1 is the input option of control multiplexer 102, C2, C3, C4 then controls multiplexer 103, C5, C6 controls multiplexer 108, C7 then controls multiplexer 114, like this, make treatment element because of being subjected to this pattern control (MODE CONTROL) 92, and its inter-process pattern of treatment element (Processing Element) is changed, be that the reportedly defeated difference of its factor changes, and its treatment element (Porcessing Element) external morphology is changed, six kinds of modal variations can be arranged (as Fig. 4 and its treatment element of the present invention is controlled, Fig. 5, Fig. 6, Fig. 7, Fig. 8, shown in Figure 9), and the array structure of various forms, can handle different array operations, make the present invention can handle multiple computing more efficiently.
Purposes, control mode and drawing label as for other control lines then are described as follows:
911 ... the data of first in first out stack 100 are read control signal.
912 ... the data write control signal of first in first out stack 100.
913 ... the reseting controling signal of first-in first-out register 100.
931 ... the data of register 106 are written into control signal.
932 ... the data of register 107 are written into control signal.
933 ... the data of register 110 are written into control signal.
94 ... the addition control signal of totalizer 109.
95 ... being the identification code (ID) of treatment element, also is the input signal of code translator (Decoder) 112.
961 ... the data of literal register shelves 101 are read control signal.
962 ... the data of literal register shelves 101 are read address signal.
963 ... the data write control signal of literal register shelves 101.
964 ... the data of literal register shelves 101 write address signal.
971 ... the data of data register shelves 113 are read control signal.
972 ... the data of data register shelves 113 are read address signal.
973 ... the data write control signal of data register shelves 113.
974 ... the data of data register shelves 113 write address signal.
As shown in figure 10, it implements illustration for the array structure that the present invention is applied to processing array computing (Matrix computation), the present invention is when the processing array computing, C controls by controller (controller), so the inner structure of its treatment element (Processing element) just is in (that is shown in Figure 4) under first pattern, the control signal 5 of its control multiplexer Mull-Muln, the control signal 6 of control multiplexer Mu21-Mu2n, the control signal 13 of the control signal 7 of control multiplexer Mb and control multiplexer Mob, all be in the 1(noble potential) state control, so multiplexer Mull-Muln, Mu21-Mu2n, the selection mode of Mb and its data transmission of Mob just as shown in figure 10, and use 2 treatment elements to carry out better simply explanation here, and to calculate
Be the example explanation:
To carry out above-mentioned matrix operation (matrix computation), present embodiment treatment element PE1 is written into: a00, a01, a02, a03, a20, a21, a22, these data of a23, treatment element PE2 is written into: a10, a11, a12, a13, a30, a31, a32, a33, it utilizes register rs11 and register rs12 to carry out data and is written into, and cooperate each control signal (please cooperate) with reference to Figure 11, its control signal 3 all is in the 1(noble potential) state, handle so register rs11 and rs12 can be written into data displaced array, its data are supplied with by multiport memory M, send into data a10 earlier to register rs11, when a data a00 sends here instantly, just a10 is given register rs12, and when data a11 sends again, just with data a10, a00 transfers in the treatment element, and at this moment, data write control signal 963 actions of control literal register shelves 101 are the 1(noble potential) state, make these data of treatment element PE1 intercepting a00, treatment element PE2 is data intercept a10 then, so continues the transmission of data, intercepting causes treatment element PE1 to be written into: a00, a01, a02, a03, a20, a21, a22, a23, treatment element PE2 then is written into: a10, a11, a12, a13, a30, a31, a32, a33 is aspect computing, then must depend on treatment element PE1, its inner computing of PE2, and cooperate register rb(to cooperate with reference to Figure 12), with matrix operation (matrix computation), its operation result is as follows:
y00=a00x00+a01x10+a02x20+a03x30
y10=a10x00+a11x10+a12x20+a13x30
y20=a20x00+a21x10+a22x20+a23x30
y30=a30x00+a31x10+a32x20+a33x30
y01=a00x01+a01x11+a02x21+a03x31
y11=a10x01+a11x11+a12x21+a13x31
y21=a20x01+a21x11+a22x21+a23x31
y31=a30x01+a31x11+a32x21+a33x31
Owing to before all [aij] was written in the treatment element pe1 and pe2, sent into data x00 among the register rb by multiport memory M this moment again, give register rb and then send into next record data x10, cause original data x00 to send in the treatment element pe1 and pe2, literal register shelves 101 in the treatment element pe1 and pe2 are also sent the data of a00 and a10 simultaneously, deliver in the register 106 after causing multiplier (Multiplier) 104 computings, and totalizer (Adder) 109 is subjected to the action of its add-subtract control signal 94, when its control signal 94 is the 1(noble potential) time, just the data of sending into directly are sent in the register 110, be back to the input end of totalizer 109 again by register 110, carry out additive operation with the subsequent operation data, and control multiplexer M01, the control signal 12 of M02, when its control signal is the 0(electronegative potential) time, then among the data importing register r02 with O2, among the data importing register r01 of O1, again by register r01, the displacement that r02 carries out data transmits, if perseveration, total data after causing the present invention with matrix operation (matrix computation) is sent out for using, and reaches processing array calculation function of the present invention.
Shown in Figure 13, be applied to handle the array junctions composition of finite impulse response filter (FIRFilters) embodiment for the present invention, this moment, the state of controller (controller) each control line was: the inner structure of treatment element (Processing element) PE is controlled to be in (as shown in Figure 5) under second pattern, and the control signal 5 of control multiplexer Mull-Muln is in the 1(noble potential) state, control signal 7 and the control signal 13 of control multiplexer M6 and multiplexer Mob also are in the 1(noble potential) state, it is to adopt two treatment element (PE1 simultaneously, PE2) describe, so its array structure is as shown in figure 13, and to calculate
Yi=a0Xi+a1Xi-1+a2Xi-2+a3Xi-3 is the example explanation:
Calculate yi=a0Xi+a1Xi-1+a2Xi-2+a3Xi-3
So y0=a0X0+a1X-1+a2X-2+a3X-3
y1=a0X1+a1X0+a2X-1+a3X-2
y2=a0X2+alX1+a2X0+a3X-1
y3=a0X3+alX2+a2X1+a3X0
y4=a0X4+a1X3+a2X2+a3X1
y5=a1X5+a1X4+a2X3+a3X2
So extend and analogize.
When carrying out data processing, (please cooperate) with reference to Figure 14, be to utilize register rs21, rs22 and rs11, rs12 to close [X(m)] data, register rb then controls being written into of [a(n)] data, and the action of cooperation multiplexer Mu21, Mu22 and control signal 6, data can successfully be delivered to carry out calculation process in the treatment element pe1 and pe2, simultaneously at output facet, by control signal 12 control multiplexer M01, M02, make it to transmit operation result, transmission as for its data is written into, computing, now is described as follows:
By multiport memory (Multi-Port Memory) M, transmit the X1 data and give register rs21, transmit second data X0 to register rs21, and original data X1 transfers to register rs22, and control multiplexer Mu21 this moment, the control signal 6 of Mu22, it gives control signal is the 0(electronegative potential) state, so make treatment element PE1, the input end is1 of PE2, is2 intercepting register rs21, the data of rs22, multiport memory M transmits the a0 data to register rb simultaneously, make treatment element PE1 carry out the computing of a0 and X0, treatment element PE2 carries out the computing of a0 and X1, follow Mu21, the control signal 6 of Mu22 becomes the 1(noble potential) state, so again by register rs11, rs12 carries out data transfer successively, simultaneously by treatment element PE1, the add-subtract control signal 94 of PE2 internal control totalizer 109 comes adding up of deal with data, and by control multiplexer M01, the control signal 12 of M02 is handled the output of operation result, make it to obtain to give register r01 behind the operation result, r02, cause register r01, the data of r02 are transferred to multiport memory M or other function processors for using, and reach the function of finite impulse response filter (FIR Filter).
As shown in figure 15, handle the array junctions composition of infinite impulse response filter (IIR Filter) embodiment for the present invention, this moment, each control signal of controller C made the state of its array structure be: treatment element (Processing element) PE1, PE2 is under second pattern, and pair control of reset input ob signal is arranged, also use simultaneously and remove register rs21, the control signal 2 of rs22, control multiplexer Mu21, the control signal 6 of Mu22, the control signal 7 of control multiplexer Mb and control multiplexer M01, the control signal 12 of M02, use these control signals and come the handover of control data, change, make it to reach the function of infinite impulse response filter (IIR Filter), existing to calculate
yi+b1yi-1+b2yi-2+b3yi-3=
A0Xi+a1Xi-1+a2Xi-2+a3Xi-3 is the example explanation:
So
y0+b1y-1+b2y-2+b3y-3=a0X0+a1X-1+a2X-2+a3X-3
y1+b1y0+b2y-1+b3y-2=a0X1+a1X0+a2X-1+a3X-2
y2+b1y1+b2y0+b3y-1=a0X2+a1X1+a2X0+a3X-1
y3+b1y2+b2y1+b3y0=a0X3+a1X2+a2X1+a3X0
So extend and analogize (please cooperate) with reference to Figure 16, it calculates y0(=-b1y-1-b2y-2-b3y-3+a0X0+a1X-1+a2X-2+a3X-3 by treatment element PE1), y2, y4 etc., calculate y1(=-b1y0-b2y-1-b3y-2+a0X1+a1X0+a2X-1+a3X-2 by treatment element PE2), y3, y5 etc., as for being written into of data, load mode then is described as follows:
It is delivered to data X1 among the register rs21 by multiport memory (Multi-Port Memory) M earlier, again data X0 is sent into register rs21, and original data X1 shifts among the register rs22, control multiplexer Mu21 this moment, the control signal 6 of Mu22 is controlled to be the 0(electronegative potential as option), make treatment element PE1, the input signal is1 of PE2, is2 intercepting working storage rs21, the data of rs22, and register rb is written into data a0 simultaneously, make it to carry out calculation process, so proceed being written into of data, shift, computing, when treatment element PE1 computing to aOX0+alX-1, treatment element PE2 computing is during to a0X1+a1X0, the control signal 2 of then moving is the 1(noble potential), make register rs21, it is 0 that rs22 removes, then be written into the y(m of computing-b(n) again) data, and feedback data y0 and import computing, and pending element PE1, the operation result y0 of PE2, when y1 produces, then control multiplexer M01, control signal 12 actions of M02 are the 0(electronegative potential), make data export register r01 to, r02, new data is written into computing more simultaneously, so continuous computing of circulation just reaches the function of handling infinite impulse response filter (IIR filter).
As shown in figure 17, be applied to handle the array junctions composition of rim detection (Edge detction) and level and smooth (smoothing) for the present invention, it is to use four treatment elements (Processing Element) PE1, PE2, PE3, PE4 to describe, and the control of the controlled device in inside of treatment element PE1, PE2, PE3, PE4 is under second pattern, also use first in first out stack 100 simultaneously, and to calculate
y30=X50?W20+X51?W21+X52?W22 y31=X51?W20+X52?W21+X53?W22
+X40?W10+X41?W11+X42?W12 +X41?W10+X42?W11+X43?W12
+X30?W00+X31 W01+X32?W02 +X31?W00+X32?W01+X33 W02
y20=X40?W20+X41?W21+X42?W22 y21=X41?W20+X42?W21+X43?W22
+X30?W10+X31?W11+X22?W12 +X31?W10+X32?W11+X33?W12
+X20?W00+X21?W01+X22?W02 +X21?W00+X22?W01+X23?W02
y10=X30?W20+X31?W21+X32?W22 y11=X31?W20+X32?W21+X33?W22
+X20?W10+X21?W11+X22?W12 +X21?W10+X22?W11+X23?W12
+X10?W00+X11?W01+X12?W02 +X11?W00+X12?W01+X12?W02
y00=X20?W20+X21?W21+X22?W22 y01=X21?W20+X22?W21+X23?W22
+X10?W10+X11?W11+X12?W12 X11?W10+X12?W11+X13?W12
+X00?W00+X01?W01+X02?W02 +X01?W00+X02?W01+X03?W02
Above-mentioned formula is the example explanation, it calculates y30, y31 by treatment element PE1, treatment element PE2 calculates y20, y21, treatment element PE3 calculates y10, y11, treatment element PE4 calculates y00, y01, as for its data input stream journey, the action of each control signal (please cooperate with reference to Figure 18, Figure 19) is described as follows:
It is with control signal 5 control multiplexer Mu11, Mu12, Mu13, the selected input data of Mu14, control signal 6 is control multiplexer Mu21 then, Mu22, Mu23, Mu24, control signal 911 in addition, 912,913 have the first in first out of processing stack 100 data reads, the function that data write and reset and control, and being written into of data is subjected to the action of each control signal with computing and can carries out simultaneously, treat that the computing back that finishes is the 0(electronegative potential with the setting state of control signal 12 just), and operation result is written into register ro1, ro2, ro3, ro4 uses for output.
Figure 20 handles the array architecture figure of 2-D discrete cosine conversion (TWO-Dimensional DCT) for the present invention, treatment element PE1, PE2, PE3 are under first pattern, and use literal register shelves (constant register file) 101, data register shelves (Data register file) 113, the control of code translator (Decoder) 112 and three-state buffer (Tristate buffer) 111, existing to calculate:
Be the example explanation, this is first calculating for calculating the 2-D discrete cosine conversion [Zij] of [Xij] matrix
Action as for being written into of data, computing and each control signal then cooperates with reference to Figure 21, Figure 22 and Figure 23, now is described as follows:
As shown in figure 21, earlier with a(ij) data are written into treatment element PE1, PE2, in the literal register shelves 101 in the PE3, then with X(ij) data are written into register rb(as shown in figure 22 by multiport memory (Multi-Port Memory) M), y00 is tried to achieve in computing, y01, y02(PE1), y10, y11, y12(PE2), y20, y21, y22(PE3) data, the control signal of utilizing three-state buffer (Tristate buffer) 111 and code translator 112 to produce again makes y(ij) (order is y00, y01, y02, y10, y11, y12, y20, y21, y22) data are back to treatment element and do the computing input, then can finish the computing of 2-D discrete cosine conversion (Two-Dimensional DCT).
And shown in Figure 24ly be the enforcement illustration of two-dimensional array framework of the present invention, be used for calculating the process of 2-D discrete cosine conversion, it describes with six treatment elements (as shown in figure 25) PE11, PE12, PE21, PE22, PE31, PE32, as for data be written into, each control line motion flow and compute mode (please cooperate) with reference to Figure 26, Figure 27, Figure 28, now be described as follows:
Earlier with a(ij) data are written into treatment element PE11, PE21, PE31, PE12, PE22, in the literal register shelves (constant register file) 101 in the PE32, (as shown in figure 26), then by multiport memory (Multi-Port Memory) M with X(ij) data are sent into by register rb, at treatment element PE11, PE21, y(ij is tried to achieve in the PE31 computing) data (as shown in figure 27), be input to treatment element PE12 by Ob again, PE22, among the PE32, make treatment element PE12 try to achieve Z00, Z10, the Z20 data, treatment element PE22 tries to achieve Z01, Z11, Z21, treatment element PE32 tries to achieve Z02, Z12, Z22 data (as shown in figure 28), thus the function of handling the 2-D discrete cosine conversion reached.
Figure 29 is two-dimensional array framework of the present invention (a n * m treatment element), be applied to handle the embodiment array architecture figure that image moves assessment (Motion estimation) and model comparison (template matching), P1 wherein, P2, Pm is programmable delay (programmable delay), and the processing array with 3 * 3 illustrates (as shown in figure 30), P1 wherein, P2 is the delayer of 3 clock period, and its treatment element PE11, PE12, PE13, PE21, PE22, PE23, PE31, PE32, the inner structure of PE33 is in (as shown in Figure 9) under the 6th pattern, and to calculate:
Z20=|X20-Y40|+|X21-Y41|+|X22-Y42|
+|X10-Y30|+|X11-Y31|+|X12-Y32|
+|X00-Y20|+|X01-Y21|+|X02-Y22|
Z21=|X20-Y41|+|X21-Y42|+|X22-Y43|
+|X10-Y31|+|X11-Y32|+|X12-Y33|
+|X00-Y21|+|X01-Y22|+|X02-Y23|
Z22=|X20-Y42|+|X21-Y43|+|X22-Y44|
+|X10-Y32|+|X11-Y33|+|X12-Y34|
+|X00-Y22|+|X01-Y23|+|X02-Y24|
Z10=|X20-Y30|+|X21-Y31|+|X22-Y32|
+|X10-Y20|+|X11-Y21|+|X12-Y22|
+|X00-Y10|+|X01-Y11|+|X02-Y12|
Z11=|X20-Y31|+|X21-Y32|+|X22-Y33|
+|X10-Y21|+|X11-Y22|+|X12-Y23|
+|X00-Y11|+|X01-Y12|+|X02-Y13|
Z12=|X20-Y32|+|X21-Y33|+|X22-Y34|
+|X10-Y22|+|X11-Y23|+|X12-Y24|
+|X00-Y12|+|X01-Y13|+|X02-Y14|
Z00=|X20-Y20|+|X21-Y21|+|X22-Y22|
+|X10-Y10|+|X11-Y11|+|X12-Y12|
+|X00-Y00|+|X01-Y01|+|X02-Y02|
Z01=|X20-Y21|+|X21-Y22|+|X22-Y23|
+|X10-Y11|+|X11-Y12|+|X12-Y13|
+|X00-Y01|+|X01-Y02|+|X02-Y03|
Z02=|X20-Y22|+|X21-Y23|+|X22-Y24|
+|X10-Y12|+|X11-Y13|+|X12-Y14|
+ | X00-Y02|+|X01-Y03|+|X02-Y04| is the example explanation
Please cooperate with reference to Figure 31, Figure 32, it is to calculate Z20 with treatment element PE11, PE12 calculates Z21, and PE13 calculates Z22, and PE21 calculates Z10, PE22 calculates Z11, PE23 calculates Z12, and PE31 calculates Z00, and PE32 calculates Z01, PE33 calculates Z02's, and can both reach with computing and handle image and move the function that assessment (Motion estimation) and model are compared (tempiate matching).
Figure 33 adopts the embodiment array architecture figure of sublevel pipelined architecture for the present invention, it is with pipeline and heartbeat type single-instruction multiple-data stream (SIMD) ARRAY PROCESSING framework 2001,2002 ... to 200n with n ARRAY PROCESSING framework, connect with pipeline (pipelined) method, become sublevel pipeline (stage pipelined) framework, and can combine with microprocessor (Microprocessor) or digital signal processor (DSP processor) 1001, help to increase arithmetic speed, and it is example (as shown in figure 34) to calculate 1008 discrete Fourier transform (DFT), it is to calculate the pipeline of the sharp leaf transformation of 7 point discrete Fouriers and the ARRAY PROCESSING framework 3000 of heartbeat type and single-instruction multiple-data stream (SIMD) at microprocessor (Microprocessor) or digital signal processor (DSP processor) 1001 two ends serial connection, calculate the pipeline of the sharp leaf transformation of 9 point discrete Fouriers and the ARRAY PROCESSING framework 3001 of heartbeat type and single-instruction multiple-data stream (SIMD), and calculate the pipeline of the sharp leaf transformation of 16 point discrete Fouriers and the ARRAY PROCESSING framework 3002 of heartbeat type and single-instruction multiple-data stream (SIMD), come calculation process with this framework, can reach and calculate 1008 discrete Fourier transform (DFT), and the function of its arithmetic speed multiplication.
Shown in Figure 35, the embodiment array architecture figure that combines with heartbeat type structure (systolic Architecture) for the present invention, it can be in the front and back of ARRAY PROCESSING framework, be connected with the heartbeat type structure that a plurality of treatment element is formed, and connection control microprocessor (Microprocessor) or digital signal processor (DSP processor), and be the example explanation with Figure 35, it is the ARRAY PROCESSING framework 4000 at two groups of pipelines and heartbeat type and single-instruction multiple-data stream (SIMD), between 4001, add array manipulation element PE1-PEn, and constitute a heartbeat type structure (systolic Architecture) 4002, and can combine with microprocessor or digital signal processor, be example explanation (as shown in figure 36) now with the image compressibility, it is to use the pipeline of 2-D discrete cosine conversion (DCT) and the ARRAY PROCESSING framework 5000 of heartbeat type and single-instruction multiple-data stream (SIMD), and the pipeline of reverse 2-D discrete cosine conversion (IDCT), heartbeat type, the ARRAY PROCESSING framework 5001 of single-instruction multiple-data stream (SIMD) is connected with heartbeat type structure 5002, and the pipeline of 2-D discrete cosine conversion and the ARRAY PROCESSING framework 5000 of heartbeat type and single-instruction multiple-data stream (SIMD), and the pipeline of two-dimentional inverse discrete cosine conversion all is connected with control microprocessor (Microprocessor) or digital signal processor (DSP processor) 1001 with the ARRAY PROCESSING framework 5001 of heartbeat type and single-instruction multiple-data stream (SIMD), and the heartbeat type inside configuration is contained in the connection of several treatment elements, quantizer (Quantizer) PE11 is arranged, oblique scanner (Zig-Zag scan processor) PE21, scrambler (Coder) PE31, inverse quantizer (Dequantizer) PE12, contrary oblique scanner (inverse Zig-Zag scan processor) PE22, code translator (Decoder) PE32 and multiplexer Mul, its each treatment element all connects with the heartbeat type method, and have a control signal 19 to control the data transmission option of multiplexer Mul, and can reach the effect of image compression.
In sum, the ARRAY PROCESSING framework and the method thereof of pipeline of the present invention and heartbeat type and single-instruction multiple-data stream (SIMD), it is for operation of data, transfer, I/O, all can carry out simultaneously by the control of each control signal, comparatively saved its operation time, be written into spent plenty of time, connecting line and also can save data simultaneously, and can make a monolithic, has commercial value, and the present invention had not also seen related industry circle, meet the Patent Law important document, so propose the present invention patented claim in accordance with the law.
Claims (6)
1, the ARRAY PROCESSING framework of a kind of pipeline and heartbeat type and single-instruction multiple-data stream (SIMD), it is characterized in that by several pipeline treatment element serial connections be agent structure, and be provided with the registers group of many group serial connections at the input end of treatment element, and registers group all is connected with multiport memory, data transmission mouth at these registers is provided with multiplexer simultaneously, and the output terminal of managing element throughout is provided with register and multiplexer, and connect in the outside of this structure and to establish a controller, this controller connects each register of control, multiplexer, treatment element and multiport memory.
2, the array processing method of a kind of pipeline and heartbeat type and single-instruction multiple-data stream (SIMD), it is characterized in that input and output terminal at treatment element, to mix each register of method string group of broadcast type and heartbeat type, and the adding multiplexer carries out the selection that data shift between register, the data of treatment element output simultaneously also can be returned input, make data transmission applications proper, transfer, computing and the I/O of data simultaneously can be handled simultaneously.
3, the ARRAY PROCESSING framework of pipeline as claimed in claim 1 and heartbeat type and single-instruction multiple-data stream (SIMD), it is characterized in that the pipeline treatment element includes: first in first out stack, literal register shelves, multiplier, totalizer, absolute difference arithmetic element, data register shelves, three-state buffer, code translator, register and multiplexer, and connect a controller for six kinds of nonidentity operation patterns of its treatment element generation.
4, the ARRAY PROCESSING framework of pipeline as claimed in claim 1 and heartbeat type and single-instruction multiple-data stream (SIMD), the agent structure that it is characterized in that treatment element are to adopt two-dimensional array to connect.
5, the ARRAY PROCESSING framework of pipeline as claimed in claim 1 and heartbeat type and single-instruction multiple-data stream (SIMD) is characterized in that it is to adopt the array architecture of sublevel pipeline to combine with microprocessor or digital signal processor.
6, the ARRAY PROCESSING framework of pipeline as claimed in claim 1 and heartbeat type and single-instruction multiple-data stream (SIMD), it is characterized in that front and back at the ARRAY PROCESSING framework, be connected with the heartbeat type structure that a plurality of treatment element is formed, and be connected with control microprocessor or digital signal processor.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN94101719.2A CN1107597A (en) | 1994-02-24 | 1994-02-24 | Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method |
US08/260,393 US5659780A (en) | 1994-02-24 | 1994-06-15 | Pipelined SIMD-systolic array processor and methods thereof |
GB9413501A GB2286909A (en) | 1994-02-24 | 1994-07-05 | Pipelined SIMD-systolic array processor. |
DE19504089A DE19504089A1 (en) | 1994-02-24 | 1995-02-08 | Pipelined SIMD-systolic array processor in computer, video image processing, DSP |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN94101719.2A CN1107597A (en) | 1994-02-24 | 1994-02-24 | Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method |
DE19504089A DE19504089A1 (en) | 1994-02-24 | 1995-02-08 | Pipelined SIMD-systolic array processor in computer, video image processing, DSP |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1107597A true CN1107597A (en) | 1995-08-30 |
Family
ID=25743389
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN94101719.2A Pending CN1107597A (en) | 1994-02-24 | 1994-02-24 | Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN1107597A (en) |
DE (1) | DE19504089A1 (en) |
GB (1) | GB2286909A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184089A (en) * | 2011-05-27 | 2011-09-14 | 清华大学 | Data stream operating method in dynamic reconfigurable processor |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7506136B2 (en) | 1999-04-09 | 2009-03-17 | Clearspeed Technology Plc | Parallel data processing apparatus |
US7526630B2 (en) | 1999-04-09 | 2009-04-28 | Clearspeed Technology, Plc | Parallel data processing apparatus |
US7627736B2 (en) | 1999-04-09 | 2009-12-01 | Clearspeed Technology Plc | Thread manager to control an array of processing elements |
US8174530B2 (en) | 1999-04-09 | 2012-05-08 | Rambus Inc. | Parallel date processing apparatus |
US8171263B2 (en) | 1999-04-09 | 2012-05-01 | Rambus Inc. | Data processing apparatus comprising an array controller for separating an instruction stream processing instructions and data transfer instructions |
GB2348982A (en) * | 1999-04-09 | 2000-10-18 | Pixelfusion Ltd | Parallel data processing system |
US8169440B2 (en) | 1999-04-09 | 2012-05-01 | Rambus Inc. | Parallel data processing apparatus |
US7966475B2 (en) | 1999-04-09 | 2011-06-21 | Rambus Inc. | Parallel data processing apparatus |
US7802079B2 (en) | 1999-04-09 | 2010-09-21 | Clearspeed Technology Limited | Parallel data processing apparatus |
WO2000062182A2 (en) | 1999-04-09 | 2000-10-19 | Clearspeed Technology Limited | Parallel data processing apparatus |
US8762691B2 (en) | 1999-04-09 | 2014-06-24 | Rambus Inc. | Memory access consolidation for SIMD processing elements using transaction identifiers |
US7788471B2 (en) | 2006-09-18 | 2010-08-31 | Freescale Semiconductor, Inc. | Data processor and methods thereof |
DE102007014132A1 (en) * | 2007-03-23 | 2008-09-25 | Siemens Audiologische Technik Gmbh | Processor system with directly interconnected ports |
BR112021016111A2 (en) | 2019-03-15 | 2021-11-09 | Intel Corp | Computing device, parallel processing unit, general-purpose graphics processing unit core, and graphics multiprocessor |
EP3938893A1 (en) | 2019-03-15 | 2022-01-19 | INTEL Corporation | Systems and methods for cache optimization |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2226899A (en) * | 1989-01-06 | 1990-07-11 | Philips Electronic Associated | An electronic circuit and signal processing arrangements using it |
GB8925720D0 (en) * | 1989-11-14 | 1990-01-04 | Amt Holdings | Processor array system |
-
1994
- 1994-02-24 CN CN94101719.2A patent/CN1107597A/en active Pending
- 1994-07-05 GB GB9413501A patent/GB2286909A/en not_active Withdrawn
-
1995
- 1995-02-08 DE DE19504089A patent/DE19504089A1/en not_active Ceased
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184089A (en) * | 2011-05-27 | 2011-09-14 | 清华大学 | Data stream operating method in dynamic reconfigurable processor |
CN102184089B (en) * | 2011-05-27 | 2014-01-01 | 清华大学 | Data stream operating method in dynamic reconfigurable processor |
Also Published As
Publication number | Publication date |
---|---|
DE19504089A1 (en) | 1996-08-14 |
GB2286909A (en) | 1995-08-30 |
GB9413501D0 (en) | 1994-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1107597A (en) | Pipeline type and palpitation type single-instruction multi-data-flow array processing structure and method | |
US10846364B1 (en) | Generalized dot product for computer vision applications | |
Liu et al. | Gpu accelerated smith-waterman | |
US7034897B2 (en) | Method of operating a video decoding system | |
CN87104093A (en) | The calculation element of one dimension cosine transform and the image coding device and the decoding device that comprise this calculation element | |
CN1770111A (en) | Processor system with temperature sensor and control method of the same | |
CN1292366C (en) | System and method for manipulating data with a plurality of processors | |
CN101051301A (en) | Method and apparatus for operating a computer processor array | |
JP2009516238A5 (en) | ||
CN101047849A (en) | Discrete cosine inverse transformation method and its device | |
CN1717671A (en) | Compact galois field multiplier enginer | |
JP2011034566A (en) | Low power fir filter in multi-mac architecture | |
CN1447604A (en) | Signal processor | |
GB2475432B (en) | Digital video filter and image processing | |
Basoglu et al. | Single‐chip processor for media applications: the MAP1000™ | |
CN1652155A (en) | Method and apparatus for changing digital image size | |
Basoglu et al. | A Real-Time Scan Conversion Algorithm on Commercially Available Microprocessors1 | |
Hinrichs et al. | A 1.3-GOPS parallel DSP for high-performance image-processing applications | |
CN1598797A (en) | Real-time processor system and control method | |
Hussain et al. | Effects of scaling a coarse-grain reconfigurable array on power and energy consumption | |
Kurth et al. | Mobile ultrasound imaging on heterogeneous multi-core platforms | |
Chai et al. | Streaming processors for next-generation mobile imaging applications | |
CN1816144A (en) | 2-D discrete cosine conversion device and method | |
CN1456990A (en) | Applied program parallel processing system and method | |
Li et al. | An embedded high performance ultrasonic signal processing subsystem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |