[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2019093352A1 - Data processing device - Google Patents

Data processing device Download PDF

Info

Publication number
WO2019093352A1
WO2019093352A1 PCT/JP2018/041281 JP2018041281W WO2019093352A1 WO 2019093352 A1 WO2019093352 A1 WO 2019093352A1 JP 2018041281 W JP2018041281 W JP 2018041281W WO 2019093352 A1 WO2019093352 A1 WO 2019093352A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
transfer
instruction
register
internal memory
Prior art date
Application number
PCT/JP2018/041281
Other languages
French (fr)
Japanese (ja)
Inventor
悠記 小林
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2019093352A1 publication Critical patent/WO2019093352A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • the present invention relates to a data processing apparatus that performs data transfer and arithmetic processing.
  • calculation may be repeated on data in which hundreds of millions of data are collected into millions of entries.
  • operations such as matrix multiplication, vector-matrix multiplication, and element-by-element multiplication of vectors on a matrix of several million dimensions by several hundred dimensions may occur.
  • CPU central processing unit
  • GPGPU general-purpose computing on graphics processing units
  • the FPGA is provided with a general-purpose logic element called a LUT (Look Up Table) and a variable wiring network connecting between a plurality of LUTs.
  • LUT Look Up Table
  • various arithmetic devices can be realized by rewriting the contents of the LUT and the wiring network.
  • an FPGA in which dedicated resources such as a digital signal processor (DSP) and a static random access memory (SRAM) are mounted.
  • DSP digital signal processor
  • SRAM static random access memory
  • Such an FPGA can realize an efficient computing device.
  • the physical position of the DSP or SRAM is fixed, so that the wiring is congested unless the architecture appropriately uses the DSP or SRAM, and the congested portion of the wiring is bypassed.
  • the problem is that the wiring length becomes long.
  • the delay time of the wire is extended and the operating frequency of the arithmetic device is lowered.
  • FPGAs in which the SRAM module can be configured as a True Dual Port, ie, a completely independent Dual Port RAM (Random Access Memory).
  • a True Dual Port ie, a completely independent Dual Port RAM (Random Access Memory).
  • Such an FPGA can be used as a memory having two systems of clock input and address input. It is desirable to make full use of these features in order to maximize the capabilities of the FPGA.
  • Patent Document 1 discloses a neural network apparatus configured by connecting a plurality of ring registers for performing a product-sum operation of a neural network in a ring.
  • the device disclosed in Patent Document 1 includes a ring register path configured by connecting a plurality of ring registers having a transfer function in a ring, a plurality of arithmetic devices connected to at least one of each of the ring registers, and an arithmetic device. And a plurality of storage devices connected to each other.
  • Non-Patent Document 1 discloses a method of performing matrix operation using an FPGA.
  • Patent Document 2 discloses a parallel computer that processes matrix products at high speed.
  • the computer of Patent Document 2 includes a plurality of processor elements, and a control device that distributes data to each processor element and collects operation results. Further, the computer of Patent Document 2 includes a first communication path connecting the control device and each processor element, and a second communication path connecting the logically adjacent processor elements.
  • matrix products can be calculated using processing elements arranged in one dimension.
  • An object of the present invention is to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit in order to solve the problems described above.
  • a data processing apparatus includes a first annular bus, a transfer element group including a plurality of transfer elements connected in series by the first annular bus, and at least two transfer elements via the first annular bus.
  • Transfer control means connected to one transfer element and to an external memory, a second ring bus independent of the first ring bus, and a plurality of processes connected in series by the second ring bus
  • Internal memory including processing elements including elements, overall control means connected to at least two processing elements via a second ring bus, and a plurality of internal memories connected to corresponding transfer elements and processing elements And a group.
  • the present invention it is possible to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit.
  • a data processing apparatus according to a first embodiment of the present invention will be described with reference to the drawings.
  • the data processing device of the present embodiment is mounted on an FPGA (Field-Programmable Gate Array)
  • the data processing apparatus of the present embodiment may be realized as a dedicated circuit (ASIC: Application Specific Integrated Circuit).
  • FIG. 1 is a block diagram showing the configuration of the data processing apparatus 1 of the present embodiment.
  • the data processing apparatus 1 includes a transfer element group 12, an internal memory group 13, a processing element group 14, a transfer control unit 15, an overall control unit 16, a first ring bus 17, and a second ring bus 18.
  • the transfer element group 12 includes a plurality of transfer elements 20 (transfer elements 20-1 to 20-n) connected in series by the first ring bus 17, (n is a natural number).
  • the transfer elements 20 constituting the transfer element group 12 are connected to the adjacent transfer elements 20 via the first ring bus 17. Further, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15 via the first ring bus 17.
  • Each of the plurality of transfer elements 20 writes the data included in the transfer data into the internal memory 30 corresponding to itself according to the analysis result of the transfer data transferred by the first ring bus 17. Also, the transfer element 20 transmits transfer data to the adjacent transfer element 20 via the first ring bus 17. Also, each of the plurality of transfer elements 20 reads output data from the internal memory 30 corresponding to itself. The transfer element 20 transmits the read output data to the transfer control unit 15 through the first ring bus 17.
  • the processing element group 14 includes a plurality of processing elements 40 (processing elements 40-1 to 40-n) connected in series by the second annular bus 18 (n is a natural number).
  • the processing elements 40 constituting the processing element group 14 are connected to the adjacent processing elements 40 by the second annular bus 18.
  • the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16 via the second annular bus 18.
  • Each of the plurality of processing elements 40 reads data from the internal memory 30 corresponding to itself in accordance with the operation instruction received from the general control unit 16 via the second ring bus 18.
  • the processing element 40 writes the operation result of the operation using the read data into the internal memory 30 as output data.
  • the internal memory group 13 includes a plurality of internal memories 30 (internal memories 30-1 to n) (n is a natural number).
  • An internal memory 30 constituting the internal memory group 13 is connected between the corresponding transfer element 20 and the processing element 40. That is, each of the internal memories 30-1 to n is connected to each of the transfer elements 20-1 to n and each of the processing elements 40-1 to n.
  • the transfer control unit 15 (also referred to as transfer control means) is connected to the transfer element group 12 via the first ring bus 17. That is, the transfer control unit 15 is connected to at least two transfer elements 20 constituting the transfer element group 12 via the first ring bus 17. Since the transfer elements 20 adjacent to each other are connected via the first ring bus 17, the transfer control unit 15 receives the input of the transfer element 20-1 via the first ring bus 17 and the transfer element 20. Connected to the -n output.
  • the transfer control unit 15 is also connected to the external memory 100.
  • the transfer control unit 15 receives data to be processed from the external memory 100.
  • the transfer control unit 15 transmits the input data to the transfer element group 12 through the first ring bus 17.
  • the transfer control unit 15 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
  • the overall control unit 16 (also referred to as overall control means) is connected to the processing element group 14 via the second annular bus 18. That is, the overall control unit 16 is connected to at least two processing elements 40 via the second annular bus 18. The overall control unit 16 transmits an operation instruction to the processing element group 14 through the second ring bus 18. The transfer control unit 15 and the overall control unit 16 are connected to each other.
  • the first annular bus 17 is a one-dimensional annular bus.
  • the first ring bus 17 connects a plurality of transfer elements 20 included in the transfer element group 12 in series. Further, the first ring bus 17 is connected to the transfer control unit 15.
  • the second annular bus 18 is a one-dimensional annular bus independent of the first annular bus 17.
  • the second annular bus 18 connects a plurality of processing elements 40 included in the processing element group 14 in series.
  • the second annular bus 18 is connected to the overall control unit 16.
  • FIG. 2 is a block diagram showing the configuration of transfer elements 20-1 to 20-n included in the transfer element group 12. As shown in FIG. Hereinafter, the transfer element 20-1, the transfer element 20-2, ..., and the transfer element 20-n will be referred to as a transfer element 20 without distinction. Although the transfer elements 20 adjacent to each other are connected in FIG. 2, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15.
  • the transfer element 20 is connected to the first annular bus 17.
  • the transfer element 20 includes an annular bus register 21 forming a part of the first annular bus 17 and a memory interface unit 22.
  • the ring bus register 21 includes a first register unit 211, a second register unit 212, and a third register unit 213.
  • the ring bus register 21 (also referred to as a first ring bus register) analyzes transfer data transferred from the transfer element 20 in the previous stage through the first ring bus 17.
  • the ring bus register 21 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data.
  • the ring bus register 21 transfers the transfer data to the transfer element 20 of the next stage as it is.
  • the ring bus register 21 transfers the transfer data updated using the data read from the internal memory 30 to the transfer element 20 of the next stage. Forward.
  • FIG. 3 is a conceptual view showing a configuration example (transfer data 170) of transfer data flowing on the first ring bus 17.
  • the transfer data 170 includes a command field cmd, an identification field peid, an address field addr, and a data field data.
  • the command field cmd represents the type of data transfer (such as reading from an external memory or writing to an external memory).
  • the address field addr indicates which address in the internal memory 30 is to be accessed.
  • the data field data holds data to be read from or written to the internal memory 30.
  • FIG. 4 is a table summarizing an example of transfer data flowing on the first ring bus 17.
  • FIG. 4 shows an example of transfer data when eight 32-bit data are read from the external memory 100 and sequentially stored in the address 0 of the internal memories 30-1 to 8-8.
  • the command field cmd is 0x1, it indicates that the external memory 100 writes data to the internal memory 30.
  • the first register unit 211 (also referred to as a first register) analyzes the transfer data transferred from the transfer element 20 in the previous stage.
  • the first register unit 211 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data.
  • the identification field peid of the transfer data received from the transfer element 20 at the previous stage matches the identifier of the first register unit 211
  • the first register unit 211 determines that the command is a command for itself.
  • the command field cmd is a write command to the internal memory 30
  • the first register unit 211 sends the value of the data field DATA, the address of the address field ADDR and the write instruction to the memory interface unit 22.
  • the command field cmd is a read command from the internal memory 30, the first register unit 211 sends the address of the address field addr and a read instruction to the memory interface unit 22.
  • the memory interface unit 22 accesses the internal memory 30 in accordance with the instruction received from the first register unit 211.
  • the memory interface unit 22 receives a write instruction from the first register unit 211
  • the memory interface unit 22 writes data in the internal memory 30 according to the received write instruction.
  • the memory interface unit 22 receives a read instruction from the first register unit 211
  • the memory interface unit 22 reads data from the internal memory 30 according to the received read instruction. Then, the memory interface unit 22 sends the data read from the internal memory 30 to the third register unit 213.
  • the second register unit 212 (also referred to as a second register) is a buffer that is set in accordance with the access latency of the internal memory 30.
  • the second register unit 212 transfers the transfer data transferred from the first register unit 211 to the third register unit 213.
  • the second register unit 212 may be configured as a plurality of stages of shift registers in accordance with the access latency of the internal memory 30.
  • the third register unit 213 (also referred to as a third register) transfers the transfer data transferred from the second register unit 212 to the transfer element 20 of the next stage.
  • the third register unit 213 sends the transfer data that has arrived via the second register unit 212 to the transfer element 20 of the next stage as it is.
  • the third register unit 213 replaces the data field data included in the transfer data reached via the second register unit 212 with the data read from the internal memory 30. To the transfer element 20 of the next stage.
  • FIG. 5 is a block diagram showing the configuration of the internal memory 30. As shown in FIG. The arrows between the blocks shown in FIG. 5 conceptually indicate the flow of the write instruction, the address, the read data, and the write data, and do not limit their directions.
  • the Internal memory 30 includes dual port memory 31.
  • the dual port memory 31 includes two access ports of a port A 311 (hereinafter referred to as port A) and a port B 312 (hereinafter referred to as port B).
  • a signal line from the transfer element 20 is connected to the port A (also referred to as a first port).
  • a signal line from the processing element 40 is connected to the port B (also referred to as a second port).
  • These signal lines are wires for transmitting addresses for writing and reading, writing instructions, writing data, reading data, and the like.
  • FIG. 6 is a block diagram showing the configuration of the processing element 40. As shown in FIG. Hereinafter, the processing element 40-1, the processing element 40-2,..., The processing element 40-n will be referred to as the processing element 40 without distinction. Although the processing elements 40 adjacent to each other are illustrated as being connected in FIG. 6, the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16.
  • the processing element 40 includes a ring bus register 41, an instruction decoder 42, a memory interface unit 43, and an arithmetic unit 44.
  • the ring bus register 41 (also referred to as a second ring bus register) is part of the elements connected to the second ring bus 18 and constituting the second ring bus 18.
  • the ring bus register 41 is connected to the instruction decoder 42.
  • the ring bus register 41 may be a single register or a shift register composed of a plurality of stages.
  • the ring bus register 41 receives an operation instruction from the preceding processing element 40 connected to the second ring bus 18 and sends the received operation instruction to the processing element 40 of the next stage. Among the received operation instructions, the ring bus register 41 sends the operation instruction to be processed by itself to the instruction decoder 42.
  • the instruction decoder 42 is connected to the ring bus register 41. Also, the instruction decoder 42 is connected to the memory interface unit 43 and the arithmetic unit 44. The instruction decoder 42 analyzes the operation instruction received from the ring bus register 41 and generates a control signal according to the operation instruction. The instruction decoder 42 outputs the generated control signal to the memory interface unit 43 and the computing unit 44.
  • the memory interface unit 43 (also referred to as a second memory interface) is connected to the instruction decoder 42 and the arithmetic unit 44. Also, the memory interface unit 43 is connected to the internal memory 30. The memory interface unit 43 reads data from the internal memory 30 in response to a control signal from the instruction decoder 42, and transmits the read data to the computing unit 44. Also, the memory interface unit 43 writes the calculation result of the arithmetic unit 44 in the internal memory 30 as output data.
  • the arithmetic unit 44 is connected to the instruction decoder 42 and the memory interface unit 43. Arithmetic unit 44 executes an operation using data received from memory interface unit 43 in response to a control signal from instruction decoder 42. The arithmetic unit 44 transmits the operation result to the memory interface unit 43.
  • the computing unit 44 can be realized by a DSP (Digital Signal Processor) of an FPGA (Field-Programmable Gate Array).
  • a register file may be provided in the computing unit 44 so that operations on the registers in the register file can be performed.
  • FIG. 7 is a conceptual diagram showing a configuration example (operation instruction 420) of an operation instruction.
  • the operation instruction 420 includes fields of an 8-bit opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and a 32-bit immediate operand imm.
  • FIG. 8 is a table summarizing an example of operation instructions.
  • the table of FIG. 8 shows the value of the opcode opc and the operation corresponding to the opcode opc.
  • the description of instructions other than the opcode opc is omitted.
  • the instruction of the operation code MACI and the operation code MACR will be described later.
  • opc 0x01
  • rs 0x00
  • rt 0x40
  • rd 0x80
  • the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to read data from the addresses 0x00 and 0x40 in the internal memory 30.
  • the instruction decoder 42 outputs a control signal instructing the arithmetic unit 44 to perform an addition operation on the input data supplied from the memory interface unit 43. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 to the address 0x80 of the internal memory 30.
  • the instruction decoder 42 outputs, to the computing unit 44, a control signal instructing to perform a multiplication operation on the input data supplied from the memory interface unit 43 and the value of the immediate field imm. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 at the address 0x46 of the internal memory 30.
  • FIG. 9 is a block diagram showing the configuration of the overall control unit 16.
  • the overall control unit 16 has a program counter 61, a command memory 62, a command decoder 63, and an overall control unit data path 64.
  • the command decoder 63 is connected to the first processing element 40-1.
  • the general control unit data path 64 is connected to the last processing element 40-n.
  • the general control unit 16 operates in the same manner as a general instruction set processor.
  • the program counter 61 stores a value indicating a command to be executed next. If the content of the command is other than a branch instruction, the program counter 61 is automatically incremented. On the other hand, when the content of the command is a branch instruction, the value of the program counter 61 is changed in accordance with the branch instruction.
  • the command memory 62 stores a command including a flag indicating a subject that executes an instruction.
  • the command memory 62 outputs a command corresponding to the value of the program counter 61 to the command decoder 63.
  • the command decoder 63 analyzes the command output from the command memory 62 and generates a control signal according to the analysis result. When the command decoder 63 interprets the command as an instruction of the overall control unit 16, the command decoder 63 outputs the generated control signal to the overall control unit data path 64. On the other hand, when the command decoder 63 interprets the command as an instruction of the processing element 40, the command decoder 63 outputs the generated control signal to the processing element 40-1 of the first stage included in the processing element group 14.
  • the overall control unit data path 64 (also referred to as an overall control data path) performs an operation according to the content of the command in accordance with the control signal generated by the command decoder 63.
  • the overall control unit data path 64 performs operations such as addition and branching.
  • the overall control unit data path 64 may include elements included in a general instruction set processor such as a register file. If the content of the command is a branch instruction, the overall control unit data path 64 changes the value of the program counter 61 in accordance with the branch instruction.
  • FIG. 10 is a conceptual diagram showing a configuration example of the command 620 stored in the command memory 62.
  • the command 620 in the example of FIG. 10 includes a 1-bit flag pf and a 64-bit instruction inst. When pf is 0, it is interpreted as an instruction of the overall control unit 16. On the other hand, when pf is 1, it is interpreted as an instruction of the processing element 40. Then, if pf is a command 620 of 1, the command decoder 63 shown in FIG. 9 transmits the instruction inst to the first processing element 40-1 on the second ring bus 18.
  • the operation instruction that has arrived from the last processing element 40-n is stored in a register (not shown) in the overall control unit data path 64.
  • the storage destination of the operation instruction may be a specific register in the register file or may be a dedicated register.
  • overall control unit data path 64 may be provided with a dedicated FIFO (First In First Out) for storing an operation instruction, and a register for storing the inside of the register file is separately designated by a flag or the like in the operation instruction. It may be possible.
  • FIFO First In First Out
  • FIG. 11 is a conceptual diagram showing a configuration example of the instruction 160 of the overall control unit 16.
  • the instruction 160 of the general control unit 16 includes fields of an opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and an immediate operand imm.
  • the operation code opc is 8 bits
  • the first source operand rs is 5 bits
  • the second source operand rt is 5 bits
  • the destination operand rd is 5 bits
  • the immediate operand imm is 32 bits.
  • the instruction 160 of the general control unit 16 of FIG. 11 may be stored left-justified in the inst of 64-bit width shown in FIG.
  • FIG. 12 is a table summarizing an example of an instruction of the overall control unit 16.
  • RF [rs] represents the register value of the index specified by rs in the register file.
  • PC represents a program counter value.
  • dmactrl represents an instruction register to the transfer control unit 15.
  • dmastatus represents the status register of the transfer control unit 15.
  • “ ⁇ RF [rs], RF [rt] ⁇ ” represents a value obtained by concatenating two register values RF [rs] and RF [rt].
  • the bit width of the register in the register file is assumed to be 32 bits in the example of FIG. 12, the bit width of the register is not limited to 32 bits.
  • FIG. 13 is a block diagram showing the configuration of the transfer control unit 15. As shown in FIG. As shown in FIG. 13, the transfer control unit 15 includes an instruction register 51, a state register 52, and a control circuit 53. The instruction register 51 and the status register 52 are connected to the overall control unit 16. Control circuit 53 is connected to external memory 100. Also, the control circuit 53 is connected to the first transfer element 20-1 and the last transfer element 20-n.
  • the status register 52 holds a value indicating whether transfer data is being transferred or has been completed in the first ring bus 17.
  • Control circuit 53 is connected to external memory 100. Control circuit 53 receives data to be processed from external memory 100. The control circuit 53 transmits the input data to the first stage transfer element 20-1 included in the transfer element group 12 through the first ring bus 17. The control circuit 53 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
  • the control circuit 53 starts transfer if the instruction register 51 includes a valid transfer instruction. In addition, the control circuit 53 reflects a value indicating whether the transfer is in progress or the transfer is completed as needed, and notifies the overall control unit 16 of the reflected result. That is, when the instruction register 51 contains a valid transfer instruction, the control circuit 53 transfers data between the external memory 100 and the transfer element group 12 to update the value of the status register 52.
  • the control circuit 53 writes a value in the instruction register 51 according to the ivkdma instruction of the general control unit 16 shown in FIG. Further, the control circuit 53 reads the value of the status register 52 by the chkdma instruction of the overall control unit 16 shown in FIG.
  • the above is the description of the components of the data processing device 1.
  • the above configuration of the data processing apparatus 1 is an example, and various configurations may be added or deleted as long as the functions of the data processing apparatus 1 of the present embodiment can be exhibited.
  • FIG. 15 is a flowchart for explaining the operation of the processing element 40.
  • the processing element 40 determines whether or not an operation instruction has come from the processing element 40 of the previous stage (step S11).
  • step S11 When the operation instruction is received (Yes in step S11), the processing element 40 receives the operation instruction (step S12). On the other hand, when the operation instruction has not been received (No in step S11), the processing element 40 waits for the arrival of the operation instruction (return to step S11).
  • the processing element 40 performs an operation according to the received arithmetic instruction (step S13). For example, the processing element 40 performs the following operations 1 to 4 according to the received operation instruction. (1) Read values from the addresses shown in rs and rt in the internal memory 30. (2) Perform an operation on the read value. (3) Write the operation result to the address indicated by rd in the internal memory 30. (4) Rewrite imm in the operation instruction with the operation result.
  • the processing element 40 sends the updated operation instruction to the processing element 40 of the next stage (step S14).
  • step S15 If the transfer is continued (Yes in step S15), the process returns to step S11. When the transfer is completed (No in step S15), the process according to the flowchart of FIG. 15 ends.
  • FIG. 16 is a calculation example of the matrix product by the data processing device 1.
  • the six elements (A00 to A21) of the matrix A are stored in the register files (RF [0] to RF [5]) of the overall control unit 16.
  • FIG. 17 illustrates an example in which the elements of the matrix A are stored in the register file, the present invention is not limited to this.
  • a memory such as a scratch pad memory may be configured in the general control unit 16, and elements of the matrix A may be stored in the scratch pad memory.
  • matrix B and elements of matrix C are stored in the internal memory 30.
  • B17 represents an element of row 1 column 7 of the matrix B.
  • the transfer control unit 15 reads in advance the matrix B, which is input data, into the 0th address to the 4th address of the internal memory 30.
  • the matrix C, which is output data, is stored in the area of addresses 400 to 408 initialized to zero.
  • the MAC Immediate instruction shown in FIG. 8 is used to calculate the matrix product.
  • the rtth register of the register file of the overall control unit 16 is set to imm
  • an operation instruction is sent to the second ring bus 18
  • the imm is multiplied by the internal memory rs address in the processing element 40
  • the product is accumulated at the rd address of the internal memory 30.
  • FIG. 18 shows an example of an assembly program of matrix products (assembly program 171).
  • MACI is a mnemonic that represents a MAC Immediate instruction.
  • FIGS. 19 to 23 values of the internal memory 30 in each cycle of matrix multiplication, operation instructions (instructions 1 to 6 in FIG. 18) flowing through the second ring bus 18, and imm field in the operation instruction are shown. Indicates the set value.
  • the first processing element 40-1 of the second annular bus 18 receives an instruction 1.
  • Data A00 is set in the imm field of instruction 1.
  • the processing element 40-1 multiplies the value at address 0 (B00) of the internal memory 30-1 corresponding to itself with A00 according to the operation of the instruction 1 "MACI 0, 0, 0x400", and is the operation result. Accumulate product at address 0x400.
  • “A00 * B00” as the operation result is stored at address 400 of the internal memory 30-1.
  • “A00 * B00” indicates the product of "A00" and "B00".
  • the processing element 40-1 receives the instruction 2 and the processing element 40-2 receives the instruction 1.
  • the processing element 40-1 performs a product-sum operation of multiplying B10 and A01 and accumulating the product (A01 * B10) which is the operation result at address 400.
  • the processing element 40-2 performs multiplication of B01 and A00, and accumulates the product (A00 * B01), which is the operation result, at address 400.
  • A00 * B00 + A01 * B10 is stored at address 400 of the internal memory 30-1, and "A00 * B01” is stored at address 400 of the internal memory 30-2.
  • "A00 * B00 + A01 * B10" indicates the sum of "A00 * B00" and "A01 * B10".
  • the processing element 40-1 receives the instruction 3
  • the processing element 40-2 receives the instruction 2
  • the processing element 40-3 receives the instruction 1.
  • the processing element 40-1 performs multiplication of B00 and A10, and accumulates the product (A10 * B00), which is the operation result, at address 404.
  • the processing element 40-2 performs multiplication of B11 and A01 and executes a product-sum operation of accumulating the product (A01 * B11) which is the operation result at address 400.
  • the processing element 40-3 performs multiplication of B02 and A00, and accumulates the product (A00 * B02), which is the operation result, at address 400.
  • "A10 * B00" is stored at address 404 of internal memory 30-1,
  • "A00 * B01 + A01 * B11" is stored at address 400 of internal memory 30-2, and
  • "A00 * B02" is stored at address 400 of the memory 30-3.
  • processing element 40-1 receives instruction 4
  • processing element 40-2 receives instruction 3
  • processing element 40-3 receives instruction 2
  • processing element 40-4 Receives instruction 1.
  • the processing elements 40-1 to 4 execute the operation in the same manner as in FIGS. 19 to 21, and store the operation result in the designated address of the internal memory 30-1 to 4.
  • FIG. 23 shows the state of the internal memories 30-1 to 8 in cycle 14 (cyc14) when the matrix product calculation is completed. At respective addresses of the internal memories 30-1 to 8, the operation result according to the operation instruction is stored.
  • the data processing device 1 calculates the matrix product.
  • the processing element 40 performs an operation. To store the calculation result in the internal memory 30. Then, the processing element 40 stores the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself in the internal memory 30 corresponding to itself. Do.
  • FIG. 24 is an example of calculation of the inner product of vectors by the data processing device 1.
  • the data processing device 1 obtains an inner product d of a 1-row-8-column matrix A and a 1-row-8-column matrix B.
  • the elements of matrix A and matrix B are stored in the internal memory 30.
  • the MAC reduction instruction of FIG. 8 is used to calculate the inner product.
  • the MAC reduction instruction is an instruction for performing multiplication of the addresses rs and rt of the internal memory 30 in each processing element 40, accumulating the product to the value of imm in the operation instruction, and transferring it to the next processing element 40. That is, in the MAC reduction instruction, each time the operation instruction passes through the processing element 40, the operation result is multiplied by the value of the imm field of the operation instruction.
  • FIG. 26 shows an example of an inner product assembly program (assembly program 172).
  • MACR is a mnemonic that represents a MAC reduction instruction.
  • cycle 1 shown in FIG. 27 the value of the imm field of instruction 1 that has arrived at processing element 40-1 is zero.
  • the processing element 40-1 performs an operation of A00 * B00 and adds it to the imm field.
  • the processing element 40-1 transfers the operation result to the processing element 40-2 of the next stage.
  • “A00 * B00” is stored in the imm field of the processing element 40-2.
  • the processing element 40-2 performs an operation of A01 * B01 and adds it to the imm field.
  • the processing element 40-2 transfers the operation result to the processing element 40-3 of the next stage.
  • “A00 * B00 + A01 * B01” is stored in the imm field of the processing element 40-3.
  • the processing element 40-3 performs an operation of A02 * B02 and adds it to the imm field.
  • the processing element 40-3 transfers the operation result to the processing element 40-4 of the next stage.
  • “A00 * B00 + A01 * B01 + A02 * B02” is stored in the imm field of the processing element 40-4.
  • cycle 4 shown in FIG. 30 the processing element 40-4 performs the operation of A03 * B03 and adds it to the imm field.
  • the processing element 40-4 transfers the operation result to the next stage processing element 40-5.
  • the description of cycles 5 to 7 is omitted.
  • the last processing element 40 the processing element 40-8, adds the calculation result of A07 * B07 to the value “A00 * B00 + A01 * B01 +... + A06 * B06” of the imm field. .
  • the operation instruction that the last processing element 40-8 outputs to the second ring bus 18 is stored in a register or the like in the overall control unit data path 64 in the overall control unit 16.
  • the processing element 40 outputs immediate data when the overall control unit 16 outputs an operation instruction including a field for storing immediate data to the second ring bus 18. Perform the operation using That is, the processing element 40 rewrites the immediate data according to the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself. Then, the processing element 40 outputs the rewritten immediate data to the second ring bus 18 as output data.
  • the second annular bus 18 and the first annular bus 17 can operate independently. Therefore, as shown in FIG. 32, processing and transfer can be performed in parallel. That is, the data processor 1 performs transfer of a matrix A and transfer of a matrix B for the next stage of matrix multiplication at the same time as performing certain matrix multiplication, and of the matrix C which is an output of the previous stage of matrix multiplication. Transfer can be done.
  • the data to be processed by the data processing apparatus of the present embodiment is not limited to a matrix, and may be data of another form.
  • the data processing apparatus according to the present embodiment may process vector data.
  • the inside of the overall control unit and the processing element included in the data processing apparatus of the present embodiment may be realized as a pipeline processor.
  • Implementing the entire control unit and the inside of the processing elements as a pipeline processor can increase the throughput of operations.
  • the internal memory since it is necessary to simultaneously access the internal memory, for example, it is necessary to simultaneously access rs and rd in the MACI instruction, the internal memory may be configured as a plurality of banks to allow simultaneous access.
  • the number of processing elements is eight has been described as an example in the present embodiment, the number of processing elements is not limited. For example, even if the number of processing elements is 256 or 512, the configuration of the present embodiment is applicable. Also, the number of processing elements may be less than eight or more than 512.
  • the first effect of maintaining the operating rate of the computing unit can be obtained.
  • the operation content of the processing element as the operation instruction through the ring bus, it is possible to improve the operating frequency by eliminating the long wiring except for the signal line for the clock signal and the reset signal.
  • the second effect is obtained. That is, according to the present embodiment, the matrix product and the inner product of the vectors can be efficiently calculated by making the bus for data transfer and the bus for data processing independent.
  • the data processing apparatus is used for flexible and efficient execution on a field-programmable gate array (FPGA) with respect to an application such as analysis processing of big data that performs matrix operation such as large-scale matrix product or inner product.
  • FPGA field-programmable gate array
  • the data processing device of the present embodiment can be realized not only on the FPGA but also as a dedicated circuit (ASIC: Application Specific Integrated Circuit).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)

Abstract

In order to continuously perform data transfer and arithmetic processing, and to improve the operating rate of an arithmetic unit, this data processing device comprises: a first annular bus; a transfer element group that includes a plurality of transfer elements connected in series by the first annular bus; a transfer control means that is connect to at least two of the transfer elements via the first annular bus, and is connected to an external memory; a second annular bus that is independent from the first annular bus; a processing element group that includes a plurality of processing elements connected in series by the second annular bus; an overall control means that is connected to at least two of the processing elements via the second annular bus; and an internal memory group that includes a plurality of internal memories connected to corresponding transfer elements and processing elements.

Description

データ処理装置Data processor
 本発明は、データ転送や演算処理を実行するデータ処理装置に関する。 The present invention relates to a data processing apparatus that performs data transfer and arithmetic processing.
 ビッグデータの解析処理等においては、数百次元のデータを数百万エントリ集めたデータに対する計算を繰り返すことがある。例えば、数百万次元×数百次元の行列に対する行列積やベクトル行列積、ベクトルの要素ごとの積などの演算が発生しうる。このような演算に対し、CPU(Central Processing Unit)を用いた手法やGPGPU(General-Purpose computing on Graphics Processing Units)を用いた手法が検討されている。しかし、CPUやGPGPUでは、性能の向上に伴い消費電力が増えてしまうという課題があった。 In analysis processing of big data, etc., calculation may be repeated on data in which hundreds of millions of data are collected into millions of entries. For example, operations such as matrix multiplication, vector-matrix multiplication, and element-by-element multiplication of vectors on a matrix of several million dimensions by several hundred dimensions may occur. For such operations, methods using a central processing unit (CPU) and methods using general-purpose computing on graphics processing units (GPGPU) are being studied. However, with CPUs and GPGPUs, there has been a problem that the power consumption increases with the improvement of the performance.
 消費電力を削減するために、電力効率の良いデバイスであるFPGA(Field Programmable Gate Array)を用いる手法が注目されている。FPGAには、LUT(Look Up Table)と呼ばれる汎用論理素子と、複数のLUTの間を結ぶ構成可変な配線網が具備されている。FPGAにおいては、LUTおよび配線網の内容を書き換えることによって、様々な演算装置を実現できる。 In order to reduce power consumption, a method using an FPGA (Field Programmable Gate Array), which is a power efficient device, has attracted attention. The FPGA is provided with a general-purpose logic element called a LUT (Look Up Table) and a variable wiring network connecting between a plurality of LUTs. In the FPGA, various arithmetic devices can be realized by rewriting the contents of the LUT and the wiring network.
 また、LUTのみならず、演算器(DSP:Digital Signal Processor)やメモリ(SRAM:Static Random Access Memory)などの専用リソースが搭載されているFPGAもある。そのようなFPGAによれば、効率的な演算装置を実現できる。しかし、そのようなFPGAにおいては、DSPやSRAMの物理的な位置が固定されているため、DSPやSRAMを適切に利用するアーキテクチャでない限り配線が混雑してしまい、配線の混雑部分を迂回するために配線長が長くなるという課題があった。また、FPGAの全体に亘る長い配線を構成すると、配線の遅延時間が延び、演算装置の動作周波数が低下するという課題があった。 In addition to the LUT, there is also an FPGA in which dedicated resources such as a digital signal processor (DSP) and a static random access memory (SRAM) are mounted. Such an FPGA can realize an efficient computing device. However, in such an FPGA, the physical position of the DSP or SRAM is fixed, so that the wiring is congested unless the architecture appropriately uses the DSP or SRAM, and the congested portion of the wiring is bypassed. The problem is that the wiring length becomes long. In addition, when a long wire is formed over the entire FPGA, there is a problem that the delay time of the wire is extended and the operating frequency of the arithmetic device is lowered.
 また、SRAMモジュールをトゥルー・デュアル・ポート(True Dual Port)、すなわち、完全に独立したデュアルポートRAM(Random Access Memory)として構成できるFPGAもある。このようなFPGAは、クロック入力やアドレス入力を2系統持つメモリとして利用できる。FPGAの能力を最大限引き出すためには、このような機能を十分に活用することが好ましい。 There are also FPGAs in which the SRAM module can be configured as a True Dual Port, ie, a completely independent Dual Port RAM (Random Access Memory). Such an FPGA can be used as a memory having two systems of clock input and address input. It is desirable to make full use of these features in order to maximize the capabilities of the FPGA.
 このように、消費電力を増やさずに性能を引き出すためには、FPGAの特性を活かしたアーキテクチャを実現することが望ましい。 As described above, in order to extract the performance without increasing the power consumption, it is desirable to realize an architecture that utilizes the characteristics of the FPGA.
 特許文献1には、ニューラルネットワークの積和演算を行うための複数のリングレジスタを環状に接続して構成したニューラルネットワーク装置が開示されている。特許文献1の装置は、転送機能を持つ複数のリングレジスタを環状に接続して構成したリングレジスタパスと、一部のリングレジスタに少なくとも1基ずつ接続された複数の演算装置と、演算装置の各々に接続された複数の記憶装置とを備える。 Patent Document 1 discloses a neural network apparatus configured by connecting a plurality of ring registers for performing a product-sum operation of a neural network in a ring. The device disclosed in Patent Document 1 includes a ring register path configured by connecting a plurality of ring registers having a transfer function in a ring, a plurality of arithmetic devices connected to at least one of each of the ring registers, and an arithmetic device. And a plurality of storage devices connected to each other.
 非特許文献1には、FPGAを用いて行列演算を行う手法が開示されている。 Non-Patent Document 1 discloses a method of performing matrix operation using an FPGA.
 特許文献2には、行列積を高速に処理する並列計算機について開示されている。特許文献2の計算機は、複数のプロセッサエレメントと、各プロセッサエレメントにデータを分配し演算結果を収集する制御装置とを備える。また、特許文献2の計算機は、制御装置と各プロセッサエレメントを接続する第1の通信経路と、論理的に隣接するプロセッサエレメントを接続する第2の通信経路とを備える。 Patent Document 2 discloses a parallel computer that processes matrix products at high speed. The computer of Patent Document 2 includes a plurality of processor elements, and a control device that distributes data to each processor element and collects operation results. Further, the computer of Patent Document 2 includes a first communication path connecting the control device and each processor element, and a second communication path connecting the logically adjacent processor elements.
特開平5-101031号公報JP-A-5-101031 特開平9-62656号公報JP-A-9-62656
 特許文献1の装置によれば、リングレジスタパスが2系統あるため、一方のリングレジスタパスを用いて演算をしている間に、他方のリングレジスタパスを用いて入力データの設定ができる。しかしながら、特許文献1の装置では、2種類以上のデータを組み合わせた処理の場合、データを一度外部のメモリに引き上げたり、データの格納のためにデータの演算を中断したりする必要があった。また、特許文献1の装置では、自由度が少なく、様々な演算を繰り返すビッグデータの解析処理等に対しては適用できなかった。また、特許文献1の装置において、演算装置群が行う演算内容は単一のものであり、全体を制御する制御部から同時に配信する必要があった。 According to the apparatus of Patent Document 1, since there are two ring register paths, while performing an operation using one ring register path, input data can be set using the other ring register path. However, in the apparatus of Patent Document 1, in the case of processing combining two or more types of data, it is necessary to once pull up the data to an external memory, or to interrupt the data calculation to store the data. Further, the apparatus of Patent Document 1 can not be applied to analysis processing of big data with few degrees of freedom and repeating various operations. Further, in the device of Patent Document 1, the content of operation performed by the group of operation devices is single, and it has been necessary to simultaneously distribute from the control unit that controls the whole.
 非特許文献1の手法によれば、1次元に並んだ処理エレメントを用いて行列積を演算できる。しかしながら、非特許文献1の手法では、データの転送経路が1系統しかないため、入力データの投入や演算結果の取出し中は演算器を停止する必要があった。 According to the method of Non-Patent Document 1, matrix products can be calculated using processing elements arranged in one dimension. However, in the method of Non-Patent Document 1, since there is only one transfer path of data, it has been necessary to stop the computing unit while inputting the input data or extracting the calculation result.
 特許文献2の装置によれば、ベクトル型計算機のように高価な半導体素子を使わず、また超並列計算機のように複雑で高度な実装技術を必要とするネットワークを使用せずに、行列積計算の演算速度を高速化できる。しかしながら、特許文献2の装置では、プロセッサエレメントの数を増やすにつれて、制御装置による制御が複雑になるという問題点があった。 According to the apparatus of Patent Document 2, matrix product calculation is performed without using expensive semiconductor devices as in a vector computer, and without using a network that requires complicated and sophisticated mounting technology as a massively parallel computer. The speed of computing can be increased. However, the device of Patent Document 2 has a problem that the control by the control device becomes complicated as the number of processor elements is increased.
 本発明の目的は、上述した課題を解決するために、データ転送や演算処理を継続的に実行し、演算器の稼働率を向上できるデータ処理装置を提供することにある。 An object of the present invention is to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit in order to solve the problems described above.
 本発明の一態様のデータ処理装置は、第1の環状バスと、第1の環状バスによって直列に接続された複数の転送エレメントを含む転送エレメント群と、第1の環状バスを介して少なくとも二つの転送エレメントに接続されるとともに、外部メモリに接続される転送制御手段と、第1の環状バスとは独立した第2の環状バスと、第2の環状バスによって直列に接続された複数の処理エレメントを含む処理エレメント群と、第2の環状バスを介して少なくとも二つの処理エレメントに接続される全体制御手段と、対応し合う転送エレメントおよび処理エレメントに接続される複数の内部メモリを含む内部メモリ群とを備える。 A data processing apparatus according to one aspect of the present invention includes a first annular bus, a transfer element group including a plurality of transfer elements connected in series by the first annular bus, and at least two transfer elements via the first annular bus. Transfer control means connected to one transfer element and to an external memory, a second ring bus independent of the first ring bus, and a plurality of processes connected in series by the second ring bus Internal memory including processing elements including elements, overall control means connected to at least two processing elements via a second ring bus, and a plurality of internal memories connected to corresponding transfer elements and processing elements And a group.
 本発明によれば、データ転送や演算処理を継続的に実行し、演算器の稼働率を向上できるデータ処理装置を提供することが可能になる。 According to the present invention, it is possible to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit.
本発明の第1の実施形態に係るデータ処理装置の構成を示すブロック図である。It is a block diagram showing composition of a data processor concerning a 1st embodiment of the present invention. 本発明の第1の実施形態に係るデータ処理装置が備える転送エレメントの構成を示すブロック図である。It is a block diagram showing composition of a transfer element with which a data processor concerning a 1st embodiment of the present invention is provided. 本発明の第1の実施形態に係るデータ処理装置の第1の環状バスで転送される転送データの構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the transfer data transferred by the 1st cyclic | annular bus of the data processor which concerns on the 1st Embodiment of this invention. 本発明の第1の実施形態に係るデータ処理装置の第1の環状バスで転送される転送データの一例をまとめた表である。It is the table | surface which put together an example of the transfer data transferred by the 1st ring bus of the data processor which concerns on the 1st Embodiment of this invention. 本発明の第1の実施形態に係るデータ処理装置が備える内部メモリの構成を示すブロック図である。It is a block diagram showing composition of an internal memory with which a data processor concerning a 1st embodiment of the present invention is provided. 本発明の第1の実施形態に係るデータ処理装置が備える処理エレメントの構成を示すブロック図である。It is a block diagram showing composition of a processing element with which a data processor concerning a 1st embodiment of the present invention is provided. 本発明の第1の実施形態に係るデータ処理装置が扱う演算命令の構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the arithmetic instruction which the data processor which concerns on the 1st Embodiment of this invention handles. 本発明の第1の実施形態に係るデータ処理装置が扱う演算命令の一例をまとめた表である。It is the table which put together an example of the operation command which the data processor concerning a 1st embodiment of the present invention handles. 本発明の第1の実施形態に係るデータ処理装置が備える全体制御部の構成を示すブロック図である。It is a block diagram showing composition of a general control part with which a data processor concerning a 1st embodiment of the present invention is provided. 本発明の第1の実施形態に係るデータ処理装置が備える全体制御部に含まれるコマンドメモリに格納されるコマンドの構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the command stored in the command memory contained in the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is equipped. 本発明の第1の実施形態に係るデータ処理装置が備える全体制御部の命令の構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the command of the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is provided. 本発明の第1の実施形態に係るデータ処理装置が備える全体制御部の命令の一例をまとめた表である。It is the table | surface which put together an example of the command of the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is equipped. 本発明の第1の実施形態に係るデータ処理装置が備える転送制御部の構成を示すブロック図である。It is a block diagram showing composition of a transfer control part with which a data processor concerning a 1st embodiment of the present invention is provided. 本発明の第1の実施形態に係るデータ処理装置が備える転送制御部が行うデータ転送の一例について説明するための概念図である。It is a conceptual diagram for demonstrating an example of the data transfer which the transfer control part with which the data processing apparatus which concerns on the 1st Embodiment of this invention is equipped performs. 本発明の第1の実施形態に係るデータ処理装置が備える処理エレメントの動作について説明するためのフローチャートである。It is a flowchart for demonstrating the operation | movement of the processing element with which the data processor which concerns on the 1st Embodiment of this invention is provided. 本発明の第1の実施形態に係るデータ処理装置が行う行列積の計算式の一例である。It is an example of the calculation formula of the matrix product which the data processor which concerns on the 1st Embodiment of this invention performs. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際にデータ要素を保持する例について説明するための図である。It is a figure for demonstrating the example which hold | maintains a data element, when the data processor which concerns on the 1st Embodiment of this invention calculates matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際に用いるアセンブリプログラムの一例の図である。It is a figure of an example of the assembly program used when the data processing apparatus which concerns on the 1st Embodiment of this invention calculates a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行列積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. 本発明の第1の実施形態に係るデータ処理装置が行うベクトルの内積の計算式の一例である。It is an example of the calculation formula of the inner product of the vector which the data processor which concerns on the 1st Embodiment of this invention performs. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際にデータ要素を保持する例について説明するための図である。It is a figure for demonstrating the example which hold | maintains a data element, when the data processor which concerns on the 1st Embodiment of this invention calculates the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際に用いるアセンブリプログラムの一例の図である。It is a figure of an example of the assembly program used when the data processor which concerns on the 1st Embodiment of this invention calculates the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がベクトルの内積を計算する際の動作について説明するための図である。It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. 本発明の第1の実施形態に係るデータ処理装置がデータ転送とデータ処理とを並行して行う例について説明するためのタイムチャートである。It is a time chart for explaining an example in which the data processing device concerning a 1st embodiment of the present invention performs data transfer and data processing in parallel.
 以下に、本発明を実施するための形態について図面を用いて説明する。ただし、以下に述べる実施形態には、本発明を実施するために技術的に好ましい限定がされているが、発明の範囲を以下に限定するものではない。なお、以下の実施形態の説明に用いる全図においては、特に理由がない限り、同様箇所には同一符号を付す。また、以下の実施形態において、同様の構成・動作に関しては繰り返しの説明を省略する場合がある。また、図面中の矢印の向きは、一例を示すものであり、ブロック間の信号の向きを限定するものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below are technically preferable limitations for carrying out the present invention, but the scope of the invention is not limited to the following. In all the drawings used in the following description of the embodiment, the same reference numerals are given to the same parts unless there is a particular reason. In the following embodiments, the same configuration and operation may not be repeatedly described. Further, the direction of the arrow in the drawing shows an example, and does not limit the direction of the signal between the blocks.
 (第1の実施形態)
 まず。本発明の第1の実施形態に係るデータ処理装置について図面を参照しながら説明する。以下においては、本実施形態のデータ処理装置をFPGA(Field-Programmable Gate Array)上に実装する例について説明する。なお、本実施形態のデータ処理装置は、専用回路(ASIC:Application Specific Integrated Circuit)として実現してもよい。
First Embodiment
First of all. A data processing apparatus according to a first embodiment of the present invention will be described with reference to the drawings. In the following, an example in which the data processing device of the present embodiment is mounted on an FPGA (Field-Programmable Gate Array) will be described. Note that the data processing apparatus of the present embodiment may be realized as a dedicated circuit (ASIC: Application Specific Integrated Circuit).
 (構成)
 図1は、本実施形態のデータ処理装置1の構成を示すブロック図である。データ処理装置1は、転送エレメント群12、内部メモリ群13、処理エレメント群14、転送制御部15、全体制御部16、第1の環状バス17、第2の環状バス18を備える。
(Constitution)
FIG. 1 is a block diagram showing the configuration of the data processing apparatus 1 of the present embodiment. The data processing apparatus 1 includes a transfer element group 12, an internal memory group 13, a processing element group 14, a transfer control unit 15, an overall control unit 16, a first ring bus 17, and a second ring bus 18.
 転送エレメント群12は、第1の環状バス17によって直列に接続された複数の転送エレメント20(転送エレメント20-1~n)を含む(nは自然数)。転送エレメント群12を構成する転送エレメント20は、第1の環状バス17を介して、隣接し合う転送エレメント20と接続される。また、転送エレメント20-1の入力と転送エレメント20-nの出力とは、第1の環状バス17を介して転送制御部15に接続される。 The transfer element group 12 includes a plurality of transfer elements 20 (transfer elements 20-1 to 20-n) connected in series by the first ring bus 17, (n is a natural number). The transfer elements 20 constituting the transfer element group 12 are connected to the adjacent transfer elements 20 via the first ring bus 17. Further, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15 via the first ring bus 17.
 複数の転送エレメント20のそれぞれは、第1の環状バス17で転送される転送データの解析結果に応じて、転送データに含まれるデータを自身に対応する内部メモリ30に書き込む。また、転送エレメント20は、第1の環状バス17を介して、隣接する転送エレメント20に転送データを送信する。また、複数の転送エレメント20のそれぞれは、自身に対応する内部メモリ30から出力データを読み出す。転送エレメント20は、読み出した出力データを第1の環状バス17を通じて転送制御部15に送信する。 Each of the plurality of transfer elements 20 writes the data included in the transfer data into the internal memory 30 corresponding to itself according to the analysis result of the transfer data transferred by the first ring bus 17. Also, the transfer element 20 transmits transfer data to the adjacent transfer element 20 via the first ring bus 17. Also, each of the plurality of transfer elements 20 reads output data from the internal memory 30 corresponding to itself. The transfer element 20 transmits the read output data to the transfer control unit 15 through the first ring bus 17.
 処理エレメント群14は、第2の環状バス18によって直列に接続された複数の処理エレメント40(処理エレメント40-1~n)を含む(nは自然数)。処理エレメント群14を構成する処理エレメント40は、第2の環状バス18によって、隣接し合う処理エレメント40と接続される。また、処理エレメント40-1の入力と処理エレメント40-nの出力とは、第2の環状バス18を介して全体制御部16に接続される。 The processing element group 14 includes a plurality of processing elements 40 (processing elements 40-1 to 40-n) connected in series by the second annular bus 18 (n is a natural number). The processing elements 40 constituting the processing element group 14 are connected to the adjacent processing elements 40 by the second annular bus 18. The input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16 via the second annular bus 18.
 複数の処理エレメント40のそれぞれは、第2の環状バス18を介して全体制御部16から受信した演算命令に従って、自身に対応する内部メモリ30からデータを読み出す。
処理エレメント40は、読み出したデータを用いた演算の演算結果を出力データとして内部メモリ30に書き込む。
Each of the plurality of processing elements 40 reads data from the internal memory 30 corresponding to itself in accordance with the operation instruction received from the general control unit 16 via the second ring bus 18.
The processing element 40 writes the operation result of the operation using the read data into the internal memory 30 as output data.
 内部メモリ群13は、複数の内部メモリ30(内部メモリ30-1~n)によって構成される(nは自然数)。内部メモリ群13を構成する内部メモリ30は、対応する転送エレメント20と処理エレメント40との間に接続される。すなわち、内部メモリ30-1~nのそれぞれは、転送エレメント20-1~nのそれぞれと、処理エレメント40-1~nのそれぞれとに接続される。 The internal memory group 13 includes a plurality of internal memories 30 (internal memories 30-1 to n) (n is a natural number). An internal memory 30 constituting the internal memory group 13 is connected between the corresponding transfer element 20 and the processing element 40. That is, each of the internal memories 30-1 to n is connected to each of the transfer elements 20-1 to n and each of the processing elements 40-1 to n.
 転送制御部15(転送制御手段とも呼ぶ)は、第1の環状バス17を介して転送エレメント群12に接続される。すなわち、転送制御部15は、第1の環状バス17を介して、転送エレメント群12を構成する少なくとも二つの転送エレメント20に接続される。隣接し合う転送エレメント20同士は第1の環状バス17を介して接続されるため、転送制御部15は、第1の環状バス17を介して、転送エレメント20-1の入力と、転送エレメント20-nの出力とに接続される。 The transfer control unit 15 (also referred to as transfer control means) is connected to the transfer element group 12 via the first ring bus 17. That is, the transfer control unit 15 is connected to at least two transfer elements 20 constituting the transfer element group 12 via the first ring bus 17. Since the transfer elements 20 adjacent to each other are connected via the first ring bus 17, the transfer control unit 15 receives the input of the transfer element 20-1 via the first ring bus 17 and the transfer element 20. Connected to the -n output.
 また、転送制御部15は、外部メモリ100に接続される。転送制御部15は、外部メモリ100から処理対象のデータを入力する。転送制御部15は、第1の環状バス17を通じて、入力したデータを転送エレメント群12に送信する。また、転送制御部15は、第1の環状バス17を介して、内部メモリ群13から受信した出力データを外部メモリ100へ書き出す。 The transfer control unit 15 is also connected to the external memory 100. The transfer control unit 15 receives data to be processed from the external memory 100. The transfer control unit 15 transmits the input data to the transfer element group 12 through the first ring bus 17. The transfer control unit 15 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
 全体制御部16(全体制御手段とも呼ぶ)は、第2の環状バス18を介して処理エレメント群14に接続される。すなわち、全体制御部16は、第2の環状バス18を介して少なくとも二つの処理エレメント40に接続される。全体制御部16は、第2の環状バス18を通じて処理エレメント群14に演算命令を送信する。なお、転送制御部15と全体制御部16とは、互いに接続される。 The overall control unit 16 (also referred to as overall control means) is connected to the processing element group 14 via the second annular bus 18. That is, the overall control unit 16 is connected to at least two processing elements 40 via the second annular bus 18. The overall control unit 16 transmits an operation instruction to the processing element group 14 through the second ring bus 18. The transfer control unit 15 and the overall control unit 16 are connected to each other.
 第1の環状バス17は、一次元的な環状のバスである。第1の環状バス17は、転送エレメント群12に含まれる複数の転送エレメント20を直列に接続する。また、第1の環状バス17は、転送制御部15に接続される。 The first annular bus 17 is a one-dimensional annular bus. The first ring bus 17 connects a plurality of transfer elements 20 included in the transfer element group 12 in series. Further, the first ring bus 17 is connected to the transfer control unit 15.
 第2の環状バス18は、第1の環状バス17とは独立した一次元的な環状のバスである。第2の環状バス18は、処理エレメント群14に含まれる複数の処理エレメント40を直列に接続する。また、第2の環状バス18は、全体制御部16に接続される。 The second annular bus 18 is a one-dimensional annular bus independent of the first annular bus 17. The second annular bus 18 connects a plurality of processing elements 40 included in the processing element group 14 in series. The second annular bus 18 is connected to the overall control unit 16.
 以上が、データ処理装置1の構成に関する概略的な説明である。以下においては、データ処理装置1の構成要素について個別に説明する。 The above is a schematic description of the configuration of the data processing device 1. The components of the data processor 1 will be individually described below.
 〔転送エレメント〕
 図2は、転送エレメント群12に含まれる転送エレメント20-1~nの構成を示すブロック図である。以下においては、転送エレメント20-1、転送エレメント20-2、・・・、転送エレメント20-nを区別せずに、転送エレメント20と記載する。なお、図2においては隣接し合う転送エレメント20同士が接続するように図示しているが、転送エレメント20-1の入力と転送エレメント20-nの出力とは転送制御部15に接続される。
[Transfer element]
FIG. 2 is a block diagram showing the configuration of transfer elements 20-1 to 20-n included in the transfer element group 12. As shown in FIG. Hereinafter, the transfer element 20-1, the transfer element 20-2, ..., and the transfer element 20-n will be referred to as a transfer element 20 without distinction. Although the transfer elements 20 adjacent to each other are connected in FIG. 2, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15.
 図1のように、転送エレメント20は、第1の環状バス17に接続される。図2のように、転送エレメント20は、第1の環状バス17の一部を成す環状バスレジスタ21と、メモリインタフェース部22とを含む。環状バスレジスタ21は、第一のレジスタ部211と、第二のレジスタ部212と、第三のレジスタ部213とを含む。 As shown in FIG. 1, the transfer element 20 is connected to the first annular bus 17. As shown in FIG. 2, the transfer element 20 includes an annular bus register 21 forming a part of the first annular bus 17 and a memory interface unit 22. The ring bus register 21 includes a first register unit 211, a second register unit 212, and a third register unit 213.
 環状バスレジスタ21(第1の環状バスレジスタとも呼ぶ)は、第1の環状バス17を通じて前段の転送エレメント20から転送されてきた転送データを解析する。環状バスレジスタ21は、転送データの解析結果に応じて、メモリインタフェース部22に対して内部メモリ30へのアクセス指示を出す。メモリインタフェース部22に対して内部メモリ30への書き込み指示を出す場合、環状バスレジスタ21は、転送データをそのまま次段の転送エレメント20に転送する。一方、メモリインタフェース部22に対して内部メモリ30からの読み出し指示を出す場合、環状バスレジスタ21は、内部メモリ30から読み出されたデータを用いて更新した転送データを次段の転送エレメント20に転送する。 The ring bus register 21 (also referred to as a first ring bus register) analyzes transfer data transferred from the transfer element 20 in the previous stage through the first ring bus 17. The ring bus register 21 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data. When issuing a write instruction to the internal memory 30 to the memory interface unit 22, the ring bus register 21 transfers the transfer data to the transfer element 20 of the next stage as it is. On the other hand, when a read instruction from the internal memory 30 is issued to the memory interface unit 22, the ring bus register 21 transfers the transfer data updated using the data read from the internal memory 30 to the transfer element 20 of the next stage. Forward.
 図3は、第1の環状バス17上を流れる転送データの構成例(転送データ170)を示す概念図である。転送データ170は、コマンドフィールドcmd、識別フィールドpeid、アドレスフィールドaddr、データフィールドdataを含む。コマンドフィールドcmdは、データ転送の種類(外部メモリからの読み込み、または、外部メモリへの書込み等)を表す。アドレスフィールドaddrは、内部メモリ30のどのアドレスにアクセスするかを表す。データフィールドdataは、内部メモリ30に読み書きするデータを保持する。 FIG. 3 is a conceptual view showing a configuration example (transfer data 170) of transfer data flowing on the first ring bus 17. As shown in FIG. The transfer data 170 includes a command field cmd, an identification field peid, an address field addr, and a data field data. The command field cmd represents the type of data transfer (such as reading from an external memory or writing to an external memory). The address field addr indicates which address in the internal memory 30 is to be accessed. The data field data holds data to be read from or written to the internal memory 30.
 図4は、第1の環状バス17上を流れる転送データの例をまとめた表である。図4には、外部メモリ100から8個の32ビットデータを読み込み、内部メモリ30-1~8のアドレス0番地に順番に格納するときの転送データの例を示す。なお、コマンドフィールドcmdが0x1の場合は、外部メモリ100から内部メモリ30への書込みを表すこととする。 FIG. 4 is a table summarizing an example of transfer data flowing on the first ring bus 17. FIG. 4 shows an example of transfer data when eight 32-bit data are read from the external memory 100 and sequentially stored in the address 0 of the internal memories 30-1 to 8-8. When the command field cmd is 0x1, it indicates that the external memory 100 writes data to the internal memory 30.
 第一のレジスタ部211(第1のレジスタとも呼ぶ)は、前段の転送エレメント20から転送されてきた転送データを解析する。第一のレジスタ部211は、転送データの解析結果に応じて、メモリインタフェース部22に対して内部メモリ30へのアクセス指示を出す。第一のレジスタ部211は、前段の転送エレメント20から受信した転送データの識別フィールドpeidが自身の識別子と一致した場合、そのコマンドが自身へのコマンドであると判断する。そして、第一のレジスタ部211は、コマンドフィールドcmdが内部メモリ30への書込みコマンドであれば、データフィールドDATAの値とアドレスフィールドADDRのアドレスと書き込み指示をメモリインタフェース部22に送る。また、第一のレジスタ部211は、コマンドフィールドcmdが内部メモリ30からの読み出しコマンドであれば、アドレスフィールドaddrのアドレスと読み出し指示をメモリインタフェース部22に送る。 The first register unit 211 (also referred to as a first register) analyzes the transfer data transferred from the transfer element 20 in the previous stage. The first register unit 211 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data. When the identification field peid of the transfer data received from the transfer element 20 at the previous stage matches the identifier of the first register unit 211, the first register unit 211 determines that the command is a command for itself. Then, when the command field cmd is a write command to the internal memory 30, the first register unit 211 sends the value of the data field DATA, the address of the address field ADDR and the write instruction to the memory interface unit 22. If the command field cmd is a read command from the internal memory 30, the first register unit 211 sends the address of the address field addr and a read instruction to the memory interface unit 22.
 メモリインタフェース部22(第1のメモリインタフェースとも呼ぶ)は、第一のレジスタ部211から受信した指示に応じて内部メモリ30にアクセスする。メモリインタフェース部22は、第一のレジスタ部211から書き込み指示を受信すると、受信した書き込み指示に従って内部メモリ30にデータを書き込む。また、メモリインタフェース部22は、第一のレジスタ部211から読み出し指示を受信すると、受信した読み出し指示に従って内部メモリ30からデータを読み出す。そして、メモリインタフェース部22は、内部メモリ30から読み出したデータを第三のレジスタ部213に送る。 The memory interface unit 22 (also referred to as a first memory interface) accesses the internal memory 30 in accordance with the instruction received from the first register unit 211. When the memory interface unit 22 receives a write instruction from the first register unit 211, the memory interface unit 22 writes data in the internal memory 30 according to the received write instruction. When the memory interface unit 22 receives a read instruction from the first register unit 211, the memory interface unit 22 reads data from the internal memory 30 according to the received read instruction. Then, the memory interface unit 22 sends the data read from the internal memory 30 to the third register unit 213.
 第二のレジスタ部212(第2のレジスタとも呼ぶ)は、内部メモリ30のアクセスレイテンシに合わせて設定されるバッファである。第二のレジスタ部212は、第一のレジスタ部211から転送されてきた転送データを第三のレジスタ部213に転送する。第二のレジスタ部212は、内部メモリ30のアクセスレイテンシに合わせて複数段のシフトレジスタとして構成してもよい。 The second register unit 212 (also referred to as a second register) is a buffer that is set in accordance with the access latency of the internal memory 30. The second register unit 212 transfers the transfer data transferred from the first register unit 211 to the third register unit 213. The second register unit 212 may be configured as a plurality of stages of shift registers in accordance with the access latency of the internal memory 30.
 第三のレジスタ部213(第3のレジスタとも呼ぶ)は、第二のレジスタ部212から転送されてきた転送データを次段の転送エレメント20に転送する。第三のレジスタ部213は、内部メモリ30にデータを書き込まれる場合は、第二のレジスタ部212を経由して到達した転送データをそのまま次段の転送エレメント20へ送る。第三のレジスタ部213は、内部メモリ30からデータを読み出す場合は、第二のレジスタ部212を経由して到達した転送データに含まれるデータフィールドdataを内部メモリ30から読み出したデータで置換した上で次段の転送エレメント20へ送る。 The third register unit 213 (also referred to as a third register) transfers the transfer data transferred from the second register unit 212 to the transfer element 20 of the next stage. When data is written to the internal memory 30, the third register unit 213 sends the transfer data that has arrived via the second register unit 212 to the transfer element 20 of the next stage as it is. When reading data from the internal memory 30, the third register unit 213 replaces the data field data included in the transfer data reached via the second register unit 212 with the data read from the internal memory 30. To the transfer element 20 of the next stage.
 〔内部メモリ〕
 図5は、内部メモリ30の構成を示すブロック図である。なお、図5に示すブロック間の矢印は、書込み指示やアドレス、読出しデータ、書込みデータの流れを概念的に示すものであって、それらの向きを限定するものではない。
[Internal memory]
FIG. 5 is a block diagram showing the configuration of the internal memory 30. As shown in FIG. The arrows between the blocks shown in FIG. 5 conceptually indicate the flow of the write instruction, the address, the read data, and the write data, and do not limit their directions.
 内部メモリ30は、デュアルポートメモリ31を含む。デュアルポートメモリ31は、ポートA311(以下、ポートAと記載する)とポートB312(以下、ポートBと記載する)の2系統のアクセスポートを備える。ポートA(第1のポートとも呼ぶ)には、転送エレメント20からの信号線が接続される。一方、ポートB(第2のポートとも呼ぶ)には、処理エレメント40からの信号線が接続される。それらの信号線は、書込みおよび読出しのためのアドレスや、書込み指示、書込みデータ、読出しデータなどを伝送するための配線である。 Internal memory 30 includes dual port memory 31. The dual port memory 31 includes two access ports of a port A 311 (hereinafter referred to as port A) and a port B 312 (hereinafter referred to as port B). A signal line from the transfer element 20 is connected to the port A (also referred to as a first port). On the other hand, a signal line from the processing element 40 is connected to the port B (also referred to as a second port). These signal lines are wires for transmitting addresses for writing and reading, writing instructions, writing data, reading data, and the like.
 〔処理エレメント〕
 図6は、処理エレメント40の構成を示すブロックである。以下においては、処理エレメント40-1、処理エレメント40-2、・・・、処理エレメント40-nを区別せずに、処理エレメント40と記載する。なお、図6においては隣接し合う処理エレメント40同士が接続するように図示しているが、処理エレメント40-1の入力と処理エレメント40-nの出力とは全体制御部16に接続される。
[Processing element]
FIG. 6 is a block diagram showing the configuration of the processing element 40. As shown in FIG. Hereinafter, the processing element 40-1, the processing element 40-2,..., The processing element 40-n will be referred to as the processing element 40 without distinction. Although the processing elements 40 adjacent to each other are illustrated as being connected in FIG. 6, the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16.
 図6のように、処理エレメント40は、環状バスレジスタ41と、命令デコーダ42と、メモリインタフェース部43と、演算器44とを有する。 As shown in FIG. 6, the processing element 40 includes a ring bus register 41, an instruction decoder 42, a memory interface unit 43, and an arithmetic unit 44.
 環状バスレジスタ41(第2の環状バスレジスタとも呼ぶ)は、第2の環状バス18に接続され、第2の環状バス18を構成する要素の一部である。環状バスレジスタ41は、命令デコーダ42に接続される。なお、環状バスレジスタ41は、単一のレジスタとしてもよいし、複数段から成るシフトレジスタとしてもよい。環状バスレジスタ41は、第2の環状バス18に接続される前段の処理エレメント40から演算命令を受け取り、受け取った演算命令を次段の処理エレメント40へ送り出す。環状バスレジスタ41は、受け取った演算命令のうち、自身の処理対象の演算命令を命令デコーダ42に送る。 The ring bus register 41 (also referred to as a second ring bus register) is part of the elements connected to the second ring bus 18 and constituting the second ring bus 18. The ring bus register 41 is connected to the instruction decoder 42. The ring bus register 41 may be a single register or a shift register composed of a plurality of stages. The ring bus register 41 receives an operation instruction from the preceding processing element 40 connected to the second ring bus 18 and sends the received operation instruction to the processing element 40 of the next stage. Among the received operation instructions, the ring bus register 41 sends the operation instruction to be processed by itself to the instruction decoder 42.
 命令デコーダ42は、環状バスレジスタ41に接続される。また、命令デコーダ42は、メモリインタフェース部43と演算器44とに接続される。命令デコーダ42は、環状バスレジスタ41から受け取った演算命令を解析し、演算命令に応じた制御信号を生成する。命令デコーダ42は、生成した制御信号をメモリインタフェース部43と演算器44とに出力する。 The instruction decoder 42 is connected to the ring bus register 41. Also, the instruction decoder 42 is connected to the memory interface unit 43 and the arithmetic unit 44. The instruction decoder 42 analyzes the operation instruction received from the ring bus register 41 and generates a control signal according to the operation instruction. The instruction decoder 42 outputs the generated control signal to the memory interface unit 43 and the computing unit 44.
 メモリインタフェース部43(第2のメモリインタフェースとも呼ぶ)は、命令デコーダ42と演算器44とに接続される。また、メモリインタフェース部43は、内部メモリ30に接続される。メモリインタフェース部43は、命令デコーダ42からの制御信号に応じて、内部メモリ30からデータを読み出し、読み出したデータを演算器44に送信する。また、メモリインタフェース部43は、演算器44の演算結果を出力データとして内部メモリ30に書き込む。 The memory interface unit 43 (also referred to as a second memory interface) is connected to the instruction decoder 42 and the arithmetic unit 44. Also, the memory interface unit 43 is connected to the internal memory 30. The memory interface unit 43 reads data from the internal memory 30 in response to a control signal from the instruction decoder 42, and transmits the read data to the computing unit 44. Also, the memory interface unit 43 writes the calculation result of the arithmetic unit 44 in the internal memory 30 as output data.
 演算器44は、命令デコーダ42とメモリインタフェース部43とに接続される。演算器44は、命令デコーダ42からの制御信号に応じて、メモリインタフェース部43から受信したデータを用いた演算を実行する。演算器44は、演算結果をメモリインタフェース部43に送信する。例えば、演算器44は、FPGA(Field-Programmable Gate Array)のDSP(Digital Signal Processor)によって実現できる。 The arithmetic unit 44 is connected to the instruction decoder 42 and the memory interface unit 43. Arithmetic unit 44 executes an operation using data received from memory interface unit 43 in response to a control signal from instruction decoder 42. The arithmetic unit 44 transmits the operation result to the memory interface unit 43. For example, the computing unit 44 can be realized by a DSP (Digital Signal Processor) of an FPGA (Field-Programmable Gate Array).
 なお、処理エレメント40の機能は、上記説明に限定されるものではなく、当業者によって容易に想到される機能が追加されていてもよい。例えば、演算器44内にレジスタファイルを具備し、レジスタファイル中のレジスタに対する演算を実行できるようにしてもよい。 Note that the function of the processing element 40 is not limited to the above description, and a function easily conceived by a person skilled in the art may be added. For example, a register file may be provided in the computing unit 44 so that operations on the registers in the register file can be performed.
 図7は、演算命令の構成例(演算命令420)を示す概念図である。例えば、演算命令420は、それぞれ8ビットのオペコードopc、第1ソースオペランドrs、第2ソースオペランドrt、デスティネーションオペランドrd、および、32ビットの即値オペランドimmのフィールドによって構成される。 FIG. 7 is a conceptual diagram showing a configuration example (operation instruction 420) of an operation instruction. For example, the operation instruction 420 includes fields of an 8-bit opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and a 32-bit immediate operand imm.
 図8は、演算命令の一例をまとめた表である。図8の表には、オペコードopcの値と、それらのオペコードopcに対応する動作を示す。例えば、opc=0x01は、加算命令ADDを表す。opc=0x01の加算命令は、内部メモリ30のrs番地のデータとrt番地のデータとを加算し、演算結果をrd番地に書込む、という動作を行う命令に対応する。ここでは、オペコードopc以外の命令に関する説明は省略する。ただし、オペコードMACIおよびオペコードMACRの命令については、後ほど説明する。 FIG. 8 is a table summarizing an example of operation instructions. The table of FIG. 8 shows the value of the opcode opc and the operation corresponding to the opcode opc. For example, opc = 0x01 represents an addition instruction ADD. The addition instruction of opc = 0x01 corresponds to an instruction for adding the data at the rs address of the internal memory 30 and the data at the rt address and writing the operation result to the rd address. Here, the description of instructions other than the opcode opc is omitted. However, the instruction of the operation code MACI and the operation code MACR will be described later.
 ここで、図8の表に含まれる演算命令について二つの例を挙げて説明する。 Here, the operation instructions included in the table of FIG. 8 will be described by giving two examples.
 第1の例は、以下の式1で表される演算命令である。
0x0100408000000000→(opc=0x01)mem[0x80]←mem[0x00]+mem[0x40]・・・(1)
 第1の例は、opc=0x01、rs=0x00、rt=0x40、rd=0x80なので、内部メモリ30の0x00番地のデータと0x40番地のデータとを加算して、0x80番地に書込む命令を表す。第1の例では、命令デコーダ42は、メモリインタフェース部43に対して、内部メモリ30のアドレス0x00番地と0x40番地からデータを読み込むことを指示する制御信号を出力する。その後、命令デコーダ42は、演算器44に対して、メモリインタフェース部43から与えられた入力データについて加算演算を行うことを指示する制御信号を出力する。そして、命令デコーダ42は、メモリインタフェース部43に対して、演算器44の出力データを内部メモリ30の0x80番地に書込むことを指示する制御信号を出力する。
The first example is an operation instruction represented by the following equation 1.
0x010040000000000000 → (opc = 0x01) mem [0x80] mem mem [0x00] + mem [0x40] ... (1)
In the first example, opc = 0x01, rs = 0x00, rt = 0x40, and rd = 0x80, so the data at address 0x00 in the internal memory 30 and the data at address 0x40 are added to represent an instruction to be written at address 0x80. . In the first example, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to read data from the addresses 0x00 and 0x40 in the internal memory 30. Thereafter, the instruction decoder 42 outputs a control signal instructing the arithmetic unit 44 to perform an addition operation on the input data supplied from the memory interface unit 43. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 to the address 0x80 of the internal memory 30.
 第2の例は、以下の式2で表される演算命令である。
0x0722004612345678→(opc=0x07)mem[0x46]←mem[0x22]*0x12345678・・・(2)
 第2の例は、opc=0x07、rs=0x22、rd=0x46、imm=0x12345678なので、内部メモリ30の0x22番地と即値フィールドimmの値を乗算して、0x46番地に書込む命令を表す。第2の例では、命令デコーダ42は、メモリインタフェース部43に対し、内部メモリ30のアドレス0x22番地からデータを読み込むことを指示する制御信号を出力する。その後、命令デコーダ42は、演算器44に対して、メモリインタフェース部43から与えられた入力データと即値フィールドimmの値との乗算演算を行うことを指示する制御信号を出力する。そして、命令デコーダ42は、メモリインタフェース部43に対して、演算器44の出力データを内部メモリ30の0x46番地に書込むことを指示する制御信号を出力する。
The second example is an operation instruction represented by the following Equation 2.
0x0722004612345678 → (opc = 0x07) mem [0x46] mem mem [0x22] * 0x 12345678 ... (2)
Since the second example is opc = 0x07, rs = 0x22, rd = 0x46, imm = 0x12345678, an instruction to write the address 0x46 by multiplying the address 0x22 of the internal memory 30 by the value of the immediate field imm is shown. In the second example, the instruction decoder 42 outputs, to the memory interface unit 43, a control signal instructing to read data from the address 0x22 of the internal memory 30. Thereafter, the instruction decoder 42 outputs, to the computing unit 44, a control signal instructing to perform a multiplication operation on the input data supplied from the memory interface unit 43 and the value of the immediate field imm. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 at the address 0x46 of the internal memory 30.
 〔全体制御部〕
 図9は、全体制御部16の構成を示すブロック図である。図9のように、全体制御部16は、プログラムカウンタ61、コマンドメモリ62、コマンドデコーダ63、全体制御部データパス64を有する。コマンドデコーダ63は、最初の処理エレメント40-1に接続される。全体制御部データパス64は、最後の処理エレメント40-nに接続される。全体制御部16は、一般的な命令セットプロセッサと同様に動作する。
[Overall control unit]
FIG. 9 is a block diagram showing the configuration of the overall control unit 16. As shown in FIG. 9, the overall control unit 16 has a program counter 61, a command memory 62, a command decoder 63, and an overall control unit data path 64. The command decoder 63 is connected to the first processing element 40-1. The general control unit data path 64 is connected to the last processing element 40-n. The general control unit 16 operates in the same manner as a general instruction set processor.
 プログラムカウンタ61は、次に実行すべきコマンドを示す値を保存する。コマンドの内容が分岐命令以外の場合、プログラムカウンタ61は、自動的にインクリメントされる。一方、コマンドの内容が分岐命令の場合、プログラムカウンタ61の値は当該分岐命令に従って変更される。 The program counter 61 stores a value indicating a command to be executed next. If the content of the command is other than a branch instruction, the program counter 61 is automatically incremented. On the other hand, when the content of the command is a branch instruction, the value of the program counter 61 is changed in accordance with the branch instruction.
 コマンドメモリ62には、命令を実行する主体を示すフラグが含まれるコマンドが格納される。コマンドメモリ62は、プログラムカウンタ61の値に応じたコマンドをコマンドデコーダ63に出力する。 The command memory 62 stores a command including a flag indicating a subject that executes an instruction. The command memory 62 outputs a command corresponding to the value of the program counter 61 to the command decoder 63.
 コマンドデコーダ63は、コマンドメモリ62から出力されたコマンドを解析し、解析結果に応じた制御信号を生成する。コマンドデコーダ63は、コマンドを全体制御部16の命令として解釈すると、生成した制御信号を全体制御部データパス64に出力する。一方、コマンドデコーダ63は、コマンドを処理エレメント40の命令として解釈すると、生成した制御信号を処理エレメント群14に含まれる初段の処理エレメント40-1に出力する。 The command decoder 63 analyzes the command output from the command memory 62 and generates a control signal according to the analysis result. When the command decoder 63 interprets the command as an instruction of the overall control unit 16, the command decoder 63 outputs the generated control signal to the overall control unit data path 64. On the other hand, when the command decoder 63 interprets the command as an instruction of the processing element 40, the command decoder 63 outputs the generated control signal to the processing element 40-1 of the first stage included in the processing element group 14.
 全体制御部データパス64(全体制御データパスとも呼ぶ)は、コマンドデコーダ63によって生成された制御信号に従い、コマンドの内容に応じた動作を行う。例えば、全体制御部データパス64は、加算や分岐などの動作を行う。なお、全体制御部データパス64は、レジスタファイル等の一般的な命令セットプロセッサが具備する要素を含んでいてもよい。また、コマンドの内容が分岐命令の場合、全体制御部データパス64は、プログラムカウンタ61の値を当該分岐命令に従って変更する。 The overall control unit data path 64 (also referred to as an overall control data path) performs an operation according to the content of the command in accordance with the control signal generated by the command decoder 63. For example, the overall control unit data path 64 performs operations such as addition and branching. The overall control unit data path 64 may include elements included in a general instruction set processor such as a register file. If the content of the command is a branch instruction, the overall control unit data path 64 changes the value of the program counter 61 in accordance with the branch instruction.
 図10は、コマンドメモリ62に格納されるコマンド620の構成例を示す概念図である。図10の例のコマンド620には、1ビットのフラグpfと、64ビットの命令instが含まれる。pfが0の場合は、全体制御部16の命令として解釈される。一方、pfが1の場合は、処理エレメント40の命令として解釈される。そして、pfが1のコマンド620の場合、図9に示すコマンドデコーダ63は、第2の環状バス18上の最初の処理エレメント40-1に命令instを送信する。 FIG. 10 is a conceptual diagram showing a configuration example of the command 620 stored in the command memory 62. As shown in FIG. The command 620 in the example of FIG. 10 includes a 1-bit flag pf and a 64-bit instruction inst. When pf is 0, it is interpreted as an instruction of the overall control unit 16. On the other hand, when pf is 1, it is interpreted as an instruction of the processing element 40. Then, if pf is a command 620 of 1, the command decoder 63 shown in FIG. 9 transmits the instruction inst to the first processing element 40-1 on the second ring bus 18.
 また、図9において、最後の処理エレメント40-nから到着した演算命令は、全体制御部データパス64の中のレジスタ等(図示しない)に格納される。なお、演算命令の格納先は、レジスタファイル中の特定のレジスタでもよいし、専用レジスタであってもよい。また、全体制御部データパス64は、演算命令を格納するための専用のFIFO(First In First Out)を備えてもよいし、レジスタファイル中を格納するレジスタを演算命令中のフラグ等で別途指定できるようにしてもよい。 Further, in FIG. 9, the operation instruction that has arrived from the last processing element 40-n is stored in a register (not shown) in the overall control unit data path 64. The storage destination of the operation instruction may be a specific register in the register file or may be a dedicated register. In addition, overall control unit data path 64 may be provided with a dedicated FIFO (First In First Out) for storing an operation instruction, and a register for storing the inside of the register file is separately designated by a flag or the like in the operation instruction. It may be possible.
 図11は、全体制御部16の命令160の構成例を示す概念図である。例えば、全体制御部16の命令160は、オペコードopc、第1ソースオペランドrs、第2ソースオペランドrt、デスティネーションオペランドrd、即値オペランドimmのフィールドを含む。図11の例では、オペコードopcが8ビット、第1ソースオペランドrsが5ビット、第2ソースオペランドrtが5ビット、デスティネーションオペランドrdが5ビット、即値オペランドimmが32ビットである。なお、図10に示す64ビット幅のinstに、図11の全体制御部16の命令160を左詰めで格納するものとしてよい。 FIG. 11 is a conceptual diagram showing a configuration example of the instruction 160 of the overall control unit 16. For example, the instruction 160 of the general control unit 16 includes fields of an opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and an immediate operand imm. In the example of FIG. 11, the operation code opc is 8 bits, the first source operand rs is 5 bits, the second source operand rt is 5 bits, the destination operand rd is 5 bits, and the immediate operand imm is 32 bits. Note that the instruction 160 of the general control unit 16 of FIG. 11 may be stored left-justified in the inst of 64-bit width shown in FIG.
 図12は、全体制御部16の命令の一例をまとめた表である。RF[rs]は、レジスタファイル中のrsで指定されるインデックスのレジスタ値を表す。また、PCは、プログラムカウンタ値を表す。dmactrlは、転送制御部15への指示レジスタを表す。dmastatusは、転送制御部15の状態レジスタを表す。「{RF[rs]、RF[rt]}」は、2つのレジスタ値RF[rs]とRF[rt]とを連接した値を表す。なお、図12の例ではレジスタファイル中のレジスタのビット幅を32ビットと想定しているが、レジスタのビット幅は32ビットに限定されない。 FIG. 12 is a table summarizing an example of an instruction of the overall control unit 16. RF [rs] represents the register value of the index specified by rs in the register file. Also, PC represents a program counter value. dmactrl represents an instruction register to the transfer control unit 15. dmastatus represents the status register of the transfer control unit 15. “{RF [rs], RF [rt]}” represents a value obtained by concatenating two register values RF [rs] and RF [rt]. Although the bit width of the register in the register file is assumed to be 32 bits in the example of FIG. 12, the bit width of the register is not limited to 32 bits.
 〔転送制御部〕
 図13は、転送制御部15の構成を示すブロック図である。図13のように、転送制御部15は、指示レジスタ51と、状態レジスタ52と、制御回路53と、を含む。指示レジスタ51および状態レジスタ52は、全体制御部16に接続される。制御回路53は、外部メモリ100に接続される。また、制御回路53は、最初の転送エレメント20-1と、最後の転送エレメント20-nとに接続される。
[Transfer control unit]
FIG. 13 is a block diagram showing the configuration of the transfer control unit 15. As shown in FIG. As shown in FIG. 13, the transfer control unit 15 includes an instruction register 51, a state register 52, and a control circuit 53. The instruction register 51 and the status register 52 are connected to the overall control unit 16. Control circuit 53 is connected to external memory 100. Also, the control circuit 53 is connected to the first transfer element 20-1 and the last transfer element 20-n.
 指示レジスタ51は、外部メモリアドレスを示すeaddr、内部メモリアドレスを示すiaddr、転送データ数を示すnum、転送方向を示すdirといった複数のレジスタフィールドを含む。例えば、dir==0の場合、内部メモリ30のiaddr番地から外部メモリ100のeaddr番地へのnum個のデータの転送を表す。また、dir==1の場合、外部メモリのeaddr番地から内部メモリのiaddr番地へのnum個のデータの転送を表す。 The instruction register 51 includes a plurality of register fields such as eaddr indicating an external memory address, iaddr indicating an internal memory address, num indicating the number of transfer data, and dir indicating a transfer direction. For example, in the case of dir == 0, it represents transfer of num data from the iaddr address of the internal memory 30 to the eaddr address of the external memory 100. Also, in the case of dir == 1, it represents transfer of num data from the eaddr address of the external memory to the iaddr address of the internal memory.
 状態レジスタ52には、第1の環状バス17において、転送データを転送中であるのか、転送が完了したのかを示す値が保持される。 The status register 52 holds a value indicating whether transfer data is being transferred or has been completed in the first ring bus 17.
 制御回路53は、外部メモリ100に接続される。制御回路53は、外部メモリ100から処理対象のデータを入力する。制御回路53は、第1の環状バス17を通じて、入力したデータを転送エレメント群12に含まれる初段の転送エレメント20-1に送信する。また、制御回路53は、第1の環状バス17を介して、内部メモリ群13から受信した出力データを外部メモリ100へ書き出す。 Control circuit 53 is connected to external memory 100. Control circuit 53 receives data to be processed from external memory 100. The control circuit 53 transmits the input data to the first stage transfer element 20-1 included in the transfer element group 12 through the first ring bus 17. The control circuit 53 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
 制御回路53は、指示レジスタ51に有効な転送指示が含まれていれば、転送を開始する。また、制御回路53は、状態レジスタ52に関して、転送中か転送完了かを示す値を随時反映し、反映した結果を全体制御部16に通知する。すなわち、制御回路53は、指示レジスタ51に有効な転送指示が含まれている際に、外部メモリ100と転送エレメント群12との間でデータを転送して状態レジスタ52の値を更新する。 The control circuit 53 starts transfer if the instruction register 51 includes a valid transfer instruction. In addition, the control circuit 53 reflects a value indicating whether the transfer is in progress or the transfer is completed as needed, and notifies the overall control unit 16 of the reflected result. That is, when the instruction register 51 contains a valid transfer instruction, the control circuit 53 transfers data between the external memory 100 and the transfer element group 12 to update the value of the status register 52.
 制御回路53は、図12に示す全体制御部16のivkdma命令により、指示レジスタ51に値を書き込む。また、制御回路53は、図12に示す全体制御部16のchkdma命令により、状態レジスタ52の値を読み込む。 The control circuit 53 writes a value in the instruction register 51 according to the ivkdma instruction of the general control unit 16 shown in FIG. Further, the control circuit 53 reads the value of the status register 52 by the chkdma instruction of the overall control unit 16 shown in FIG.
 図14は、転送制御部15による転送データの転送例を示す概念図である。図14には、処理エレメント数=8、dir=1、eaddr=0x0、iaddr=0x400、num=12の場合を例に説明する。dir=1は、外部メモリ100から内部メモリ30への転送を表す。eaddr=0x0かつnum=12は、転送するデータが外部メモリ100の0x0番地から12個のデータであることを示す。iaddr=0x400は、転送先の開始アドレスはそれぞれの内部メモリ30の400番地であることを示す。転送データ数が12であるため、内部メモリ30の8個の記憶領域(アドレス1~8)のうち、最初の4つの記憶領域(アドレス1~4)には2個ずつのデータが保持され、残りの4つの記憶領域(アドレス5~8)には1個ずつのデータが保持される。 FIG. 14 is a conceptual diagram showing an example of transfer of transfer data by the transfer control unit 15. As shown in FIG. In FIG. 14, the case where the number of processing elements = 8, dir = 1, eaddr = 0x0, iaddr = 0x400, and num = 12 will be described as an example. dir = 1 represents transfer from the external memory 100 to the internal memory 30. eaddr = 0x0 and num = 12 indicate that the data to be transferred is 12 pieces of data from the address 0x0 of the external memory 100. iaddr = 0x400 indicates that the start address of the transfer destination is the address 400 of each internal memory 30. Since the number of transfer data is 12, two data are held in the first four storage areas (addresses 1 to 4) of the eight storage areas (addresses 1 to 8) of the internal memory 30, The remaining four storage areas (addresses 5 to 8) hold one data each.
 以上が、データ処理装置1の構成要素についての説明である。なお、以上のデータ処理装置1の構成は一例であって、本実施形態のデータ処理装置1の機能を発揮できさえすれば、種々の構成を追加・削除してもよい。 The above is the description of the components of the data processing device 1. The above configuration of the data processing apparatus 1 is an example, and various configurations may be added or deleted as long as the functions of the data processing apparatus 1 of the present embodiment can be exhibited.
 (動作)
 次に、処理エレメント40の動作について図面を参照しながら説明する。図15は、処理エレメント40の動作について説明するためのフローチャートである。
(Operation)
Next, the operation of the processing element 40 will be described with reference to the drawings. FIG. 15 is a flowchart for explaining the operation of the processing element 40.
 図15において、まず、処理エレメント40は、前段の処理エレメント40から演算命令が来たかどうか判断する(ステップS11)。 In FIG. 15, first, the processing element 40 determines whether or not an operation instruction has come from the processing element 40 of the previous stage (step S11).
 演算命令が来た場合(ステップS11でYes)、処理エレメント40は、演算命令を受信する(ステップS12)。一方、演算命令が来ていない場合(ステップS11でNo)、処理エレメント40は、演算命令の到着を待機する(ステップS11に戻る)。 When the operation instruction is received (Yes in step S11), the processing element 40 receives the operation instruction (step S12). On the other hand, when the operation instruction has not been received (No in step S11), the processing element 40 waits for the arrival of the operation instruction (return to step S11).
 次に、処理エレメント40は、受信した演算命令に応じた動作を行う(ステップS13)。例えば、処理エレメント40は、受信した演算命令に応じて、以下のような1~4のような動作を行う。
(1)内部メモリ30中のrs、rtに示されるアドレスから値を読み込む。
(2)読み込んだ値に対して演算を実行する。
(3)内部メモリ30中のrdに示されるアドレスに演算結果を書き込む。
(4)演算命令中のimmを演算結果で書き換える。
Next, the processing element 40 performs an operation according to the received arithmetic instruction (step S13). For example, the processing element 40 performs the following operations 1 to 4 according to the received operation instruction.
(1) Read values from the addresses shown in rs and rt in the internal memory 30.
(2) Perform an operation on the read value.
(3) Write the operation result to the address indicated by rd in the internal memory 30.
(4) Rewrite imm in the operation instruction with the operation result.
 そして、処理エレメント40は、次段の処理エレメント40に更新後の演算命令を送る(ステップS14)。 Then, the processing element 40 sends the updated operation instruction to the processing element 40 of the next stage (step S14).
 転送が継続されている場合(ステップS15でYes)は、ステップS11に戻る。転送が完了した場合(ステップS15でNo)は、図15のフローチャートに沿った処理は終了とする。 If the transfer is continued (Yes in step S15), the process returns to step S11. When the transfer is completed (No in step S15), the process according to the flowchart of FIG. 15 ends.
 以上が、図15のフローチャートに沿った処理エレメント40の動作についての説明である。続いて、データ処理装置1による計算例について図面を参照しながら説明する。 The above is the description of the operation of the processing element 40 along the flowchart of FIG. Subsequently, an example of calculation by the data processing device 1 will be described with reference to the drawings.
 〔行列積〕
 図16は、データ処理装置1による行列積の計算例である。ここでは、データ処理装置1が、3行2列の行列Aと2行8列の行列Bとの行列積を計算し、3行8列の行列C(=AB)を求める例について説明する。
Matrix product
FIG. 16 is a calculation example of the matrix product by the data processing device 1. Here, an example will be described in which the data processing apparatus 1 calculates the matrix product of the matrix A of 3 rows and 2 columns and the matrix B of 2 rows and 8 columns to obtain a matrix C (= AB) of 3 rows and 8 columns.
 図17のように、行列Aの6つの要素(A00~A21)は、全体制御部16のレジスタファイル(RF[0]~RF[5])に格納される。なお、図17には、レジスタファイル中に行列Aの要素を格納する例を図示しているが、これに限らない。例えば、スクラッチパッドメモリのようなメモリを全体制御部16の中に構成し、スクラッチパッドメモリに行列Aの要素を格納するように構成してもよい。 As shown in FIG. 17, the six elements (A00 to A21) of the matrix A are stored in the register files (RF [0] to RF [5]) of the overall control unit 16. Although FIG. 17 illustrates an example in which the elements of the matrix A are stored in the register file, the present invention is not limited to this. For example, a memory such as a scratch pad memory may be configured in the general control unit 16, and elements of the matrix A may be stored in the scratch pad memory.
 図17のように、行列Bおよび行列Cの要素(B00~B17、C00~C27)は、内部メモリ30の中に格納される。例えば、B17は、行列Bの行1列7の要素を表す。図17の例では、入力データである行列Bは、転送制御部15が内部メモリ30の0番地~4番地に予め読み込んでおくものとする。出力データである行列Cは、ゼロで初期化しておいたアドレス400番地~408番地の領域に格納される。 As shown in FIG. 17, matrix B and elements of matrix C (B00 to B17, C00 to C27) are stored in the internal memory 30. For example, B17 represents an element of row 1 column 7 of the matrix B. In the example of FIG. 17, it is assumed that the transfer control unit 15 reads in advance the matrix B, which is input data, into the 0th address to the 4th address of the internal memory 30. The matrix C, which is output data, is stored in the area of addresses 400 to 408 initialized to zero.
 行列積の計算には、図8に示すMAC Immediate命令を使用する。MAC Immediate命令では、全体制御部16のレジスタファイルのrt番目のレジスタをimmに設定し、第2の環状バス18に演算命令を流し、処理エレメント40でimmと内部メモリrs番地の乗算を行い、積を内部メモリ30のrd番地に累積する。 The MAC Immediate instruction shown in FIG. 8 is used to calculate the matrix product. In the MAC Immediate instruction, the rtth register of the register file of the overall control unit 16 is set to imm, an operation instruction is sent to the second ring bus 18, the imm is multiplied by the internal memory rs address in the processing element 40, The product is accumulated at the rd address of the internal memory 30.
 図18は、行列積のアセンブリプログラムの一例(アセンブリプログラム171)である。MACIは、MAC Immediate命令を表すニモニックである。例えば、「MACI 1、5、0x402」は、rs=1、rt=5、rd=0x402であるMAC Immediate命令を表す。 FIG. 18 shows an example of an assembly program of matrix products (assembly program 171). MACI is a mnemonic that represents a MAC Immediate instruction. For example, “ MACI 1, 5, 0x402” represents a MAC Immediate instruction in which rs = 1, rt = 5, and rd = 0x402.
 ここで、図19~図23を用いて、行列積のサイクルごとの動作例について説明する。図19~図23には、行列積の各サイクルにおける内部メモリ30の値と、第2の環状バス18を流れる演算命令(図18における命令1~命令6)と、演算命令中のimmフィールドに設定されている値とを示す。 Here, an operation example of each matrix product cycle will be described with reference to FIGS. In FIGS. 19 to 23, values of the internal memory 30 in each cycle of matrix multiplication, operation instructions (instructions 1 to 6 in FIG. 18) flowing through the second ring bus 18, and imm field in the operation instruction are shown. Indicates the set value.
 図19に示すサイクル1(cyc1)において、第2の環状バス18の最初の処理エレメント40-1は、命令1を受け取る。命令1のimmフィールドには、データA00が設定されている。処理エレメント40-1は、命令1「MACI 0、0、0x400」の動作に従い、自身に対応する内部メモリ30-1の0番地の値(B00)とA00との乗算を行い、演算結果である積を0x400番地に累積する。図20のように、内部メモリ30-1の400番地には、演算結果の「A00*B00」が格納される。なお、「A00*B00」は、「A00」と「B00」との積を示す。 In cycle 1 (cyc1) shown in FIG. 19, the first processing element 40-1 of the second annular bus 18 receives an instruction 1. Data A00 is set in the imm field of instruction 1. The processing element 40-1 multiplies the value at address 0 (B00) of the internal memory 30-1 corresponding to itself with A00 according to the operation of the instruction 1 " MACI 0, 0, 0x400", and is the operation result. Accumulate product at address 0x400. As shown in FIG. 20, “A00 * B00” as the operation result is stored at address 400 of the internal memory 30-1. "A00 * B00" indicates the product of "A00" and "B00".
 図20に示すサイクル2(cyc2)においては、処理エレメント40-1が命令2を受け取り、処理エレメント40-2が命令1を受け取る。処理エレメント40-1は、B10とA01との乗算を行い、演算結果である積(A01*B10)を400番地に累積する積和演算を実行する。処理エレメント40-2は、B01とA00との乗算を行い、演算結果である積(A00*B01)を400番地に累積する。その結果、図21のように、内部メモリ30-1の400番地には「A00*B00+A01*B10」が格納され、内部メモリ30-2の400番地には「A00*B01」が格納される。なお、「A00*B00+A01*B10」は、「A00*B00」と「A01*B10」との和を示す。 In cycle 2 (cyc2) shown in FIG. 20, the processing element 40-1 receives the instruction 2 and the processing element 40-2 receives the instruction 1. The processing element 40-1 performs a product-sum operation of multiplying B10 and A01 and accumulating the product (A01 * B10) which is the operation result at address 400. The processing element 40-2 performs multiplication of B01 and A00, and accumulates the product (A00 * B01), which is the operation result, at address 400. As a result, as shown in FIG. 21, "A00 * B00 + A01 * B10" is stored at address 400 of the internal memory 30-1, and "A00 * B01" is stored at address 400 of the internal memory 30-2. "A00 * B00 + A01 * B10" indicates the sum of "A00 * B00" and "A01 * B10".
 図21におけるサイクル3(cyc3)においては、処理エレメント40-1が命令3を受け取り、処理エレメント40-2が命令2を受け取り、処理エレメント40-3が命令1を受け取る。処理エレメント40-1は、B00とA10との乗算を行い、演算結果である積(A10*B00)を404番地に累積する。処理エレメント40-2は、B11とA01との乗算を行い、演算結果である積(A01*B11)を400番地に累積する積和演算を実行する。処理エレメント40-3は、B02とA00との乗算を行い、演算結果である積(A00*B02)を400番地に累積する。その結果、図22のように、内部メモリ30-1の404番地には「A10*B00」が格納され、内部メモリ30-2の400番地には「A00*B01+A01*B11」が格納され、内部メモリ30-3の400番地には「A00*B02」が格納される。 In cycle 3 (cyc3) in FIG. 21, the processing element 40-1 receives the instruction 3, the processing element 40-2 receives the instruction 2, and the processing element 40-3 receives the instruction 1. The processing element 40-1 performs multiplication of B00 and A10, and accumulates the product (A10 * B00), which is the operation result, at address 404. The processing element 40-2 performs multiplication of B11 and A01 and executes a product-sum operation of accumulating the product (A01 * B11) which is the operation result at address 400. The processing element 40-3 performs multiplication of B02 and A00, and accumulates the product (A00 * B02), which is the operation result, at address 400. As a result, as shown in FIG. 22, "A10 * B00" is stored at address 404 of internal memory 30-1, "A00 * B01 + A01 * B11" is stored at address 400 of internal memory 30-2, and "A00 * B02" is stored at address 400 of the memory 30-3.
 図22に示すサイクル4(cyc4)においては、処理エレメント40-1が命令4を受け取り、処理エレメント40-2が命令3を受け取り、処理エレメント40-3が命令2を受け取り、処理エレメント40-4が命令1を受け取る。処理エレメント40-1~4は、図19~図21と同様に演算を実行し、演算結果を内部メモリ30-1~4の指定されたアドレスに格納する。 In cycle 4 (cyc 4) shown in FIG. 22, processing element 40-1 receives instruction 4, processing element 40-2 receives instruction 3, processing element 40-3 receives instruction 2, and processing element 40-4. Receives instruction 1. The processing elements 40-1 to 4 execute the operation in the same manner as in FIGS. 19 to 21, and store the operation result in the designated address of the internal memory 30-1 to 4.
 図23は、行列積の計算が終了した時点のサイクル14(cyc14)における内部メモリ30-1~8の状態を示す。内部メモリ30-1~8のそれぞれのアドレスには、演算命令に従った演算結果が格納される。 FIG. 23 shows the state of the internal memories 30-1 to 8 in cycle 14 (cyc14) when the matrix product calculation is completed. At respective addresses of the internal memories 30-1 to 8, the operation result according to the operation instruction is stored.
 以上が、データ処理装置1が行列積を計算する例に関する説明である。図16~図23を用いて説明した行列積の演算においては、全体制御部16が即値データを格納するフィールドを含む演算命令を第2の環状バス18に出力した際に、処理エレメント40が演算を実行して演算結果を内部メモリ30に格納する。そして、処理エレメント40は、第2の環状バス18を通じて受け取った即値データと、自身に対応する内部メモリ30に格納されたデータとを用いて算出した演算結果を自身に対応する内部メモリ30に格納する。 The above is the description regarding the example in which the data processing device 1 calculates the matrix product. In the matrix product operation described with reference to FIGS. 16 to 23, when the overall control unit 16 outputs an operation instruction including a field storing immediate data to the second ring bus 18, the processing element 40 performs an operation. To store the calculation result in the internal memory 30. Then, the processing element 40 stores the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself in the internal memory 30 corresponding to itself. Do.
 〔ベクトルの内積〕
 図24は、データ処理装置1によるベクトルの内積の計算例である。ここでは、データ処理装置1が、1行8列の行列Aと1行8列の行列Bとの内積dを求める例について説明する。
[Inner product of vectors]
FIG. 24 is an example of calculation of the inner product of vectors by the data processing device 1. Here, an example will be described in which the data processing device 1 obtains an inner product d of a 1-row-8-column matrix A and a 1-row-8-column matrix B.
 図25のように、行列Aと行列Bの要素(A00~A07、B00~B07)は、内部メモリ30の中に格納する。 As shown in FIG. 25, the elements of matrix A and matrix B (A00 to A07, B00 to B07) are stored in the internal memory 30.
 内積の計算には、図8のMAC Reduction命令を使用する。MAC Reduction命令は、各処理エレメント40において、内部メモリ30のrs番地とrt番地の乗算を行い、積を演算命令中のimmの値に累積し、次の処理エレメント40に転送する命令である。すなわち、MAC Reduction命令においては、演算命令が処理エレメント40を通過するたびに、演算命令のimmフィールドの値に演算結果が積和されていく。 The MAC reduction instruction of FIG. 8 is used to calculate the inner product. The MAC reduction instruction is an instruction for performing multiplication of the addresses rs and rt of the internal memory 30 in each processing element 40, accumulating the product to the value of imm in the operation instruction, and transferring it to the next processing element 40. That is, in the MAC reduction instruction, each time the operation instruction passes through the processing element 40, the operation result is multiplied by the value of the imm field of the operation instruction.
 図26は、内積のアセンブリプログラムの一例(アセンブリプログラム172)である。MACRは、MAC Reduction命令を表すニモニックである。例えば、「MACR 0、4」は、rs=0、rt=4であるMAC Reduction命令を表す。 FIG. 26 shows an example of an inner product assembly program (assembly program 172). MACR is a mnemonic that represents a MAC reduction instruction. For example, “ MACR 0, 4” represents a MAC reduction instruction where rs = 0 and rt = 4.
 ここで、図27~図31を用いて、ベクトルの内積計算におけるサイクルごとの動作例について説明する。 Here, an operation example for each cycle in inner product calculation of vectors will be described with reference to FIGS.
 図27に示すサイクル1においては、処理エレメント40-1に到着した命令1のimmフィールドの値はゼロである。サイクル1においては、処理エレメント40-1は、A00*B00の演算を行い、immフィールドに加算する。処理エレメント40-1は、演算結果を次段の処理エレメント40-2に転送する。図28のように、処理エレメント40-2のimmフィールドには、「A00*B00」が格納される。 In cycle 1 shown in FIG. 27, the value of the imm field of instruction 1 that has arrived at processing element 40-1 is zero. In cycle 1, the processing element 40-1 performs an operation of A00 * B00 and adds it to the imm field. The processing element 40-1 transfers the operation result to the processing element 40-2 of the next stage. As shown in FIG. 28, “A00 * B00” is stored in the imm field of the processing element 40-2.
 図28に示すサイクル2においては、処理エレメント40-2は、A01*B01の演算を行い、immフィールドに加算する。処理エレメント40-2は、演算結果を次段の処理エレメント40-3に転送する。図29のように、処理エレメント40-3のimmフィールドには、「A00*B00+A01*B01」が格納される。 In cycle 2 shown in FIG. 28, the processing element 40-2 performs an operation of A01 * B01 and adds it to the imm field. The processing element 40-2 transfers the operation result to the processing element 40-3 of the next stage. As shown in FIG. 29, “A00 * B00 + A01 * B01” is stored in the imm field of the processing element 40-3.
 図29に示すサイクル3においては、処理エレメント40-3がA02*B02の演算を行い、immフィールドに加算する。処理エレメント40-3は、演算結果を次段の処理エレメント40-4に転送する。図30のように、処理エレメント40-4のimmフィールドには、「A00*B00+A01*B01+A02*B02」が格納される。 In cycle 3 shown in FIG. 29, the processing element 40-3 performs an operation of A02 * B02 and adds it to the imm field. The processing element 40-3 transfers the operation result to the processing element 40-4 of the next stage. As shown in FIG. 30, “A00 * B00 + A01 * B01 + A02 * B02” is stored in the imm field of the processing element 40-4.
 図30に示すサイクル4においては、処理エレメント40-4がA03*B03の演算を行い、immフィールドに加算する。処理エレメント40-4は、演算結果を次段の処理エレメント40-5に転送する。なお、サイクル5~7に関しては、説明を省略する。 In cycle 4 shown in FIG. 30, the processing element 40-4 performs the operation of A03 * B03 and adds it to the imm field. The processing element 40-4 transfers the operation result to the next stage processing element 40-5. The description of cycles 5 to 7 is omitted.
 図31に示すサイクル8においては、最後の処理エレメント40である処理エレメント40-8が、A07*B07の計算結果をimmフィールドの値「A00*B00+A01*B01+・・・+A06*B06」に加算する。これにより、内積の計算が完了する。例えば、最後の処理エレメント40-8が第2の環状バス18に出力する演算命令は、全体制御部16中の全体制御部データパス64中のレジスタ等に格納される。 In cycle 8 shown in FIG. 31, the last processing element 40, the processing element 40-8, adds the calculation result of A07 * B07 to the value “A00 * B00 + A01 * B01 +... + A06 * B06” of the imm field. . This completes the calculation of the inner product. For example, the operation instruction that the last processing element 40-8 outputs to the second ring bus 18 is stored in a register or the like in the overall control unit data path 64 in the overall control unit 16.
 以上が、データ処理装置1がベクトルの内積を計算する例に関する説明である。図24~図31を用いて説明した内積の演算においては、全体制御部16が即値データを格納するフィールドを含む演算命令を第2の環状バス18に出力した際に、処理エレメント40が即値データを用いて演算を実行する。すなわち、処理エレメント40は、第2の環状バス18を通じて受け取った即値データと、自身に対応する内部メモリ30に格納されたデータとを用いて算出した演算結果によって即値データを書き換える。そして、処理エレメント40は、書き換えた即値データを出力データとして第2の環状バス18に出力する。 The above is the description regarding the example in which the data processing device 1 calculates the inner product of vectors. In the calculation of the inner product described with reference to FIGS. 24 to 31, the processing element 40 outputs immediate data when the overall control unit 16 outputs an operation instruction including a field for storing immediate data to the second ring bus 18. Perform the operation using That is, the processing element 40 rewrites the immediate data according to the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself. Then, the processing element 40 outputs the rewritten immediate data to the second ring bus 18 as output data.
 以上が、データ処理装置1による計算例についての説明である。なお、上記の例においては小さいサイズの行列に対する行列積を説明したが、使用する内部メモリ30や処理エレメント40の数を増やすことによって、より大きな行列に対しても同様の計算を実行できる。その場合においても、処理エレメント40に含まれる演算器44は、毎サイクル連続して演算を行う。 The above is the description of the calculation example by the data processing device 1. Although the matrix multiplication for the small size matrix has been described in the above example, the same calculation can be performed for a larger matrix by increasing the number of internal memories 30 and processing elements 40 used. Even in that case, the computing unit 44 included in the processing element 40 continuously performs computations every cycle.
 また、データ処理装置1では、第2の環状バス18と第1の環状バス17とが独立して動作することができる。そのため、図32に示すように、処理と転送とを並行処理することができる。すなわち、データ処理装置1は、ある行列積演算を行うのと同時に、次段の行列積演算のための行列Aの転送と行列Bの転送や、前段の行列積演算の出力である行列Cの転送を行うことができる。 Further, in the data processing device 1, the second annular bus 18 and the first annular bus 17 can operate independently. Therefore, as shown in FIG. 32, processing and transfer can be performed in parallel. That is, the data processor 1 performs transfer of a matrix A and transfer of a matrix B for the next stage of matrix multiplication at the same time as performing certain matrix multiplication, and of the matrix C which is an output of the previous stage of matrix multiplication. Transfer can be done.
 また、本実施形態のデータ処理装置の処理対象のデータは、行列に限らず、他の形態のデータでもよい。例えば、本実施形態のデータ処理装置は、ベクトルデータを処理対象としてもよい。 Further, the data to be processed by the data processing apparatus of the present embodiment is not limited to a matrix, and may be data of another form. For example, the data processing apparatus according to the present embodiment may process vector data.
 また、本実施形態のデータ処理装置に含まれる全体制御部および処理エレメントの内部は、パイプラインプロセッサとして実現してもよい。全体制御部および処理エレメントの内部をパイプラインプロセッサとして実現すれば、演算のスループットを高めることができる。この場合、例えばMACI命令ではrsとrdに同時にアクセスする必要があるなど、内部メモリへ同時アクセスする必要があるため、内部メモリを複数バンク構成として同時アクセスを可能にしてもよい。 Further, the inside of the overall control unit and the processing element included in the data processing apparatus of the present embodiment may be realized as a pipeline processor. Implementing the entire control unit and the inside of the processing elements as a pipeline processor can increase the throughput of operations. In this case, since it is necessary to simultaneously access the internal memory, for example, it is necessary to simultaneously access rs and rd in the MACI instruction, the internal memory may be configured as a plurality of banks to allow simultaneous access.
 また、本実施の形態では、処理エレメント数が8の場合を例に説明したが、処理エレメント数に限定は加えない。例えば、処理エレメント数が256や512であっても本実施形態の構成を適用可能である。また、処理エレメント数は、8を下回る数であってもよいし、512を超える数であってもよい。 Further, although the case where the number of processing elements is eight has been described as an example in the present embodiment, the number of processing elements is not limited. For example, even if the number of processing elements is 256 or 512, the configuration of the present embodiment is applicable. Also, the number of processing elements may be less than eight or more than 512.
 以上の本実施形態によれば、データの転送とデータの演算を同時に実行することにより、演算器の稼働率を維持できるという第1の効果が得られる。また、本実施形態によれば、処理エレメントの演算内容を演算命令として環状バスを通じて送信することにより、クロック信号およびリセット信号のための信号線を除いて長い配線がなくなり、動作周波数を向上できるという第2の効果が得られる。すなわち、本実施形態によれば、データ転送のためのバスとデータ処理のためのバスとを独立させることによって、行列積やベクトルの内積を効率的に演算できる。 According to the present embodiment described above, by simultaneously executing data transfer and data calculation, the first effect of maintaining the operating rate of the computing unit can be obtained. Further, according to the present embodiment, by transmitting the operation content of the processing element as the operation instruction through the ring bus, it is possible to improve the operating frequency by eliminating the long wiring except for the signal line for the clock signal and the reset signal. The second effect is obtained. That is, according to the present embodiment, the matrix product and the inner product of the vectors can be efficiently calculated by making the bus for data transfer and the bus for data processing independent.
 本実施形態のデータ処理装置は、大規模な行列積や内積といった行列演算を行うビッグデータの解析処理等のアプリケーションに対し、FPGA(Field-Programmable Gate Array)上で柔軟かつ効率的に実行する用途に適用できる。また、本実施形態のデータ処理装置は、FPGA上のみならず、専用回路(ASIC:Application Specific Integrated Circuit)としても実現可能である。 The data processing apparatus according to the present embodiment is used for flexible and efficient execution on a field-programmable gate array (FPGA) with respect to an application such as analysis processing of big data that performs matrix operation such as large-scale matrix product or inner product. Applicable to Further, the data processing device of the present embodiment can be realized not only on the FPGA but also as a dedicated circuit (ASIC: Application Specific Integrated Circuit).
 以上、実施形態を参照して本発明を説明してきたが、本発明は上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2017年11月10日に出願された日本出願特願2017-217026を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2017-217026 filed on Nov. 10, 2017, the entire disclosure of which is incorporated herein.
 1  データ処理装置
 12  転送エレメント群
 13  内部メモリ群
 14  処理エレメント群
 15  転送制御部
 16  全体制御部
 20  転送エレメント
 21  環状バスレジスタ
 22  メモリインタフェース部
 30  内部メモリ
 31  デュアルポートメモリ
 40  処理エレメント
 41  環状バスレジスタ
 42  命令デコーダ
 43  メモリインタフェース部
 44  演算器
 51  指示レジスタ
 52  状態レジスタ
 53  制御回路
 61  プログラムカウンタ
 62  コマンドメモリ
 63  コマンドデコーダ
 64  全体制御部データパス
 100  外部メモリ
 211  第一のレジスタ部
 212  第二のレジスタ部
 213  第三のレジスタ部
 311  ポートA
 312  ポートB
Reference Signs List 1 data processing unit 12 transfer element group 13 internal memory group 14 processing element group 15 transfer control unit 16 overall control unit 20 transfer element 21 annular bus register 22 memory interface unit 30 internal memory 31 dual port memory 40 processing element 41 annular bus register 42 Instruction decoder 43 Memory interface unit 44 Arithmetic unit 51 Instruction register 52 Status register 53 Control circuit 61 Program counter 62 Command memory 63 Command decoder 64 General control unit data path 100 External memory 211 First register unit 212 Second register unit 213 Three Registers 311 Port A
312 port B

Claims (10)

  1.  第1の環状バスと、
     前記第1の環状バスによって直列に接続された複数の転送エレメントを含む転送エレメント群と、
     前記第1の環状バスを介して少なくとも二つの前記転送エレメントに接続されるとともに、外部メモリに接続される転送制御手段と、
     前記第1の環状バスとは独立した第2の環状バスと、
     前記第2の環状バスによって直列に接続された複数の処理エレメントを含む処理エレメント群と、
     前記第2の環状バスを介して少なくとも二つの前記処理エレメントに接続される全体制御手段と、
     対応し合う前記転送エレメントおよび前記処理エレメントに接続される複数の内部メモリを含む内部メモリ群とを備えるデータ処理装置。
    A first ring bus,
    A transfer element group including a plurality of transfer elements connected in series by the first ring bus;
    Transfer control means connected to the at least two transfer elements via the first ring bus and to an external memory;
    A second annular bus independent of the first annular bus;
    A processing element group including a plurality of processing elements connected in series by the second annular bus;
    Overall control means connected to the at least two processing elements via the second annular bus;
    A data processing apparatus comprising: an internal memory group including a corresponding transfer element and a plurality of internal memories connected to the processing element.
  2.  前記転送制御手段は、
     前記外部メモリから読み込んだデータを、前記第1の環状バスを通じて前記転送エレメント群に転送データとして送信し、
     前記転送エレメント群に含まれる複数の前記転送エレメントのそれぞれは、
     前記転送データの解析結果に応じて、前記転送データに含まれるデータを自身に対応する前記内部メモリに書き込む請求項1に記載のデータ処理装置。
    The transfer control means
    The data read from the external memory is transmitted as transfer data to the transfer element group through the first ring bus,
    Each of the plurality of transfer elements included in the transfer element group is
    The data processing apparatus according to claim 1, wherein the data included in the transfer data is written to the internal memory corresponding to itself according to an analysis result of the transfer data.
  3.  前記全体制御手段は、
     前記第2の環状バスを通じて演算命令を前記処理エレメント群に送信し、
     複数の前記処理エレメントのそれぞれは、
     受信した前記演算命令に従って、自身に対応する前記内部メモリからデータを読み出し、読み出したデータを用いた演算の演算結果を出力データとして前記内部メモリに書き込み、
     複数の前記転送エレメントのそれぞれは、
     自身に対応する前記内部メモリから前記出力データを読み出して、読み出した前記出力データを前記第1の環状バスを通じて前記転送制御手段に送信し、
     前記転送制御手段は、
     受信した前記出力データを前記外部メモリへ書き出す請求項1または2に記載のデータ処理装置。
    The overall control means
    Sending an operation instruction to the processing element group through the second ring bus;
    Each of the plurality of processing elements is
    According to the received operation instruction, read data from the internal memory corresponding to itself, and write the operation result of operation using the read data as the output data to the internal memory.
    Each of the plurality of transfer elements is
    Reading out the output data from the internal memory corresponding to itself, and transmitting the read out output data to the transfer control means through the first ring bus;
    The transfer control means
    The data processing apparatus according to claim 1, wherein the received output data is written to the external memory.
  4.  前記転送エレメントは、
     前記第1の環状バスに接続される第1の環状バスレジスタと、
     前記第1の環状バスレジスタと前記内部メモリとに接続される第1のメモリインタフェースとを有し、
     前記処理エレメントは、
     前記第2の環状バスに接続される第2の環状バスレジスタと、
     前記第2の環状バスレジスタに接続される命令デコーダと、
     前記命令デコーダと前記内部メモリとに接続される第2のメモリインタフェースと、
     前記命令デコーダと前記第2のメモリインタフェースに接続される演算器とを有する請求項1乃至3のいずれか一項に記載のデータ処理装置。
    The transfer element is
    A first annular bus register connected to the first annular bus;
    A first memory interface connected to the first ring bus register and the internal memory;
    The processing element is
    A second ring bus register connected to the second ring bus;
    An instruction decoder connected to the second ring bus register;
    A second memory interface connected to the instruction decoder and the internal memory;
    The data processing apparatus according to any one of claims 1 to 3, further comprising: the instruction decoder and an arithmetic unit connected to the second memory interface.
  5.  前記第1の環状バスレジスタは、
     前段の前記転送エレメントから前記第1の環状バスを介して転送されてきた転送データを解析して解析結果に応じたアクセス指示を前記第1のメモリインタフェースに出す第1のレジスタと、
     前記内部メモリのアクセスレイテンシに合わせて設定され、前記第1のレジスタから転送されてきた前記転送データを転送する第2のレジスタと、
     前記第2のレジスタによって転送された前記転送データを後段の前記転送エレメントに転送する第3のレジスタとを含み、
      前記第1のレジスタは、
      前段の前記転送エレメントから受信した前記転送データに含まれるコマンドが自身へのコマンドであると判断した場合、
       前記コマンドが前記内部メモリへの書込みコマンドであれば、前記第1のメモリインタフェースに書き込み指示を送り、
       前記コマンドが前記内部メモリからの読み出しコマンドであれば、前記第1のメモリインタフェースに読み出し指示を送り、
      前記第1のメモリインタフェースは、
      前記第1のレジスタから前記書き込み指示を受信すると、受信した前記書き込み指示に従って前記内部メモリにデータを書き込み、
       前記第1のレジスタから前記読み出し指示を受信すると、受信した前記読み出し指示に従って前記内部メモリからデータを読み出し、読み出したデータを前記第3のレジスタに送り、
      前記第3のレジスタは、
       前記第1のメモリインタフェースによって前記内部メモリにデータが書き込まれる場合は、前記第2のレジスタを経由して到達した前記転送データをそのまま次段の前記転送エレメントへ送り、
       前記第1のメモリインタフェースによって内部メモリ30からデータが読み出される場合は、前記第2のレジスタを経由して到達した前記転送データの一部を前記内部メモリから読み出したデータで置換した上で次段の前記転送エレメントに送る請求項4に記載のデータ処理装置。
    The first ring bus register is
    A first register that analyzes transfer data transferred from the transfer element in the previous stage via the first ring bus, and issues an access instruction according to the analysis result to the first memory interface;
    A second register configured to match the access latency of the internal memory and transferring the transfer data transferred from the first register;
    And a third register for transferring the transfer data transferred by the second register to the transfer element in a subsequent stage,
    The first register is
    When it is determined that the command included in the transfer data received from the transfer element in the previous stage is a command to itself:
    If the command is a write command to the internal memory, send a write instruction to the first memory interface;
    If the command is a read command from the internal memory, send a read instruction to the first memory interface;
    The first memory interface is
    When the write instruction is received from the first register, data is written to the internal memory according to the received write instruction;
    When the read instruction is received from the first register, the data is read from the internal memory according to the received read instruction, and the read data is sent to the third register.
    The third register is
    When data is written to the internal memory by the first memory interface, the transfer data arrived via the second register is sent as it is to the transfer element of the next stage;
    When data is read from the internal memory 30 by the first memory interface, a part of the transfer data reached via the second register is replaced with the data read from the internal memory, and then the next stage is performed. 5. A data processing apparatus according to claim 4, wherein the data is sent to the transfer element of.
  6.  前記第2の環状バスレジスタは、
     前記第2の環状バスを通じて受け取った演算命令を後段の前記処理エレメントに送る際に、受け取った前記演算命令に自身の処理対象の前記演算命令がある場合は、処理対象の前記演算命令を前記命令デコーダに出力し、
     前記命令デコーダは、
     前記第2の環状バスレジスタから受け取った処理対象の前記演算命令を解析し、処理対象の前記演算命令に応じた制御信号を生成して前記第2のメモリインタフェースおよび前記演算器に出力し、
     前記第2のメモリインタフェースは、
     前記命令デコーダからの前記制御信号に応じて、前記内部メモリから読み出したデータを前記演算器に送信し、
     前記演算器は、
     前記命令デコーダからの前記制御信号に応じて、前記第2のメモリインタフェースから受信したデータを用いた演算結果を前記第2のメモリインタフェースに送信し、
     前記第2のメモリインタフェースは、
     前記演算器の演算結果を出力データとして前記内部メモリに書き込む請求項4または5に記載のデータ処理装置。
    The second ring bus register is
    When sending the operation instruction received through the second ring bus to the processing element of the subsequent stage, if the received operation instruction includes the operation instruction to be processed by itself, the operation instruction to be processed is the instruction Output to the decoder,
    The instruction decoder
    Analyzing the operation instruction to be processed received from the second ring bus register, generating a control signal according to the operation instruction to be processed, and outputting the control signal to the second memory interface and the arithmetic unit;
    The second memory interface is
    The data read from the internal memory is transmitted to the arithmetic unit in response to the control signal from the instruction decoder.
    The computing unit is
    In accordance with the control signal from the instruction decoder, an operation result using data received from the second memory interface is transmitted to the second memory interface;
    The second memory interface is
    The data processing apparatus according to claim 4 or 5, wherein the calculation result of the calculation unit is written to the internal memory as output data.
  7.  前記全体制御手段は、
     次に実行すべきコマンドを示す値を保存するプログラムカウンタと、
     前記コマンドが格納され、前記プログラムカウンタの値に応じた前記コマンドを出力するコマンドメモリと、
     前記コマンドメモリから出力される前記コマンドを解析し、解析結果に応じた制御信号を生成するコマンドデコーダと、
     前記処理エレメント群に含まれる最後の前記処理エレメントおよび前記転送制御手段に接続され、前記コマンドデコーダによって生成された前記制御信号に従った動作を行う全体制御データパスとを有し、
     前記コマンドデコーダは、
      前記コマンドを前記全体制御手段の命令として解釈すると、生成した前記制御信号を前記全体制御データパスに出力し、
      前記コマンドを前記処理エレメントの命令として解釈すると、生成した前記制御信号を前記処理エレメント群に含まれる初段の前記処理エレメントに出力する請求項1乃至6のいずれか一項に記載のデータ処理装置。
    The overall control means
    A program counter that stores a value indicating a command to be executed next;
    A command memory storing the command and outputting the command according to the value of the program counter;
    A command decoder that analyzes the command output from the command memory and generates a control signal according to the analysis result;
    A final control data path connected to the last processing element included in the processing element group and the transfer control means and performing an operation according to the control signal generated by the command decoder;
    The command decoder
    Interpreting the command as an instruction of the overall control means, and outputting the generated control signal to the overall control data path;
    The data processing apparatus according to any one of claims 1 to 6, wherein when the command is interpreted as an instruction of the processing element, the generated control signal is output to the processing element of the first stage included in the processing element group.
  8.  前記転送制御手段は、
     外部メモリアドレス、内部メモリアドレス、転送データ数および転送方向を示す複数のレジスタフィールドを含む指示レジスタと、
     前記第1の環状バスにおいてデータを転送中であるのか否かを示す値を保持する状態レジスタと、
     前記外部メモリに接続され、前記指示レジスタに有効な転送指示が含まれている際に、前記外部メモリと前記転送エレメント群との間でデータを転送して前記状態レジスタの値を更新する制御回路とを有する請求項1乃至7のいずれか一項に記載のデータ処理装置。
    The transfer control means
    An instruction register including a plurality of register fields indicating an external memory address, an internal memory address, a number of transfer data, and a transfer direction;
    A status register holding a value indicating whether data is being transferred on the first ring bus;
    A control circuit connected to the external memory and transferring data between the external memory and the transfer element group to update the value of the status register when the instruction register includes a valid transfer instruction The data processing apparatus according to any one of claims 1 to 7, comprising:
  9.  前記全体制御手段は、
     即値データを格納するフィールドを含む演算命令を前記第2の環状バスに出力し、
     複数の前記処理エレメントのそれぞれは、
     前記第2の環状バスを通じて受け取った前記即値データと、自身に対応する前記内部メモリに格納されたデータとを用いて算出した演算結果を自身に対応する前記内部メモリに格納する請求項1乃至8のいずれか一項に記載のデータ処理装置。
    The overall control means
    Outputting an operation instruction including a field storing immediate data to the second ring bus;
    Each of the plurality of processing elements is
    The arithmetic result calculated using the immediate data received through the second ring bus and the data stored in the internal memory corresponding to itself is stored in the internal memory corresponding to itself. The data processing apparatus according to any one of the above.
  10.  前記全体制御手段は、
     即値データを格納するフィールドを含む演算命令を前記第2の環状バスに出力し、
     複数の前記処理エレメントのそれぞれは、
     前記第2の環状バスを通じて受け取った前記即値データと、自身に対応する前記内部メモリに格納されたデータとを用いて算出した演算結果によって前記即値データを書き換え、書き換えた前記即値データを前記第2の環状バスに出力する請求項1乃至8のいずれか一項に記載のデータ処理装置。
    The overall control means
    Outputting an operation instruction including a field storing immediate data to the second ring bus;
    Each of the plurality of processing elements is
    The immediate data is rewritten according to the calculation result calculated using the immediate data received through the second ring bus and the data stored in the internal memory corresponding to itself, and the rewritten immediate data is converted to the second immediate data. The data processing apparatus according to any one of claims 1 to 8, wherein the data is output to an annular bus.
PCT/JP2018/041281 2017-11-10 2018-11-07 Data processing device WO2019093352A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017217026 2017-11-10
JP2017-217026 2017-11-10

Publications (1)

Publication Number Publication Date
WO2019093352A1 true WO2019093352A1 (en) 2019-05-16

Family

ID=66437805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/041281 WO2019093352A1 (en) 2017-11-10 2018-11-07 Data processing device

Country Status (1)

Country Link
WO (1) WO2019093352A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09185592A (en) * 1995-12-19 1997-07-15 Commiss Energ Atom Array system architecture of multiple parallel structural processor
JP2003036248A (en) * 2001-07-25 2003-02-07 Nec Software Tohoku Ltd Small scale processor to be used for single chip microprocessor
JP2010079921A (en) * 2003-07-25 2010-04-08 Rmi Corp Processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09185592A (en) * 1995-12-19 1997-07-15 Commiss Energ Atom Array system architecture of multiple parallel structural processor
JP2003036248A (en) * 2001-07-25 2003-02-07 Nec Software Tohoku Ltd Small scale processor to be used for single chip microprocessor
JP2010079921A (en) * 2003-07-25 2010-04-08 Rmi Corp Processor

Similar Documents

Publication Publication Date Title
CN110494851B (en) Reconfigurable parallel processing
KR102292349B1 (en) Processing device and processing method
CN111580865B (en) Vector operation device and operation method
JP7361133B2 (en) Efficient architecture for deep learning algorithms
US7350054B2 (en) Processor having array of processing elements whose individual operations and mutual connections are variable
US20030061601A1 (en) Data processing apparatus and method, computer program, information storage medium, parallel operation apparatus, and data processing system
TW544603B (en) Designer configurable multi-processor system
US11907681B2 (en) Semiconductor device and method of controlling the semiconductor device
Chalamalasetti et al. MORA-an architecture and programming model for a resource efficient coarse grained reconfigurable processor
CN107679012A (en) Method and apparatus for the configuration of reconfigurable processing system
JP7131115B2 (en) DATA PROCESSING APPARATUS, DATA PROCESSING METHOD, AND PROGRAM
WO2019093352A1 (en) Data processing device
CN117785287A (en) Private memory mode sequential memory access in multi-threaded computing
JP4962305B2 (en) Reconfigurable circuit
CN117421048A (en) Hybrid scalar and vector operations in multithreaded computing
EP1299811A2 (en) Synergetic computing system
US11250105B2 (en) Computationally efficient general matrix-matrix multiplication (GeMM)
US7509479B2 (en) Reconfigurable global cellular automaton with RAM blocks coupled to input and output feedback crossbar switches receiving clock counter value from sequence control unit
Chalamalasetti et al. A low cost reconfigurable soft processor for multimedia applications: Design synthesis and programming model
Heenes et al. FPGA implementations of the massively parallel GCA model
US7107478B2 (en) Data processing system having a Cartesian Controller
JP7575841B2 (en) Reuse of adjacent SIMD units for fast and comprehensive results
CN105593809A (en) Flexible configuration hardware streaming unit
JPH05324694A (en) Reconstitutable parallel processor
CN114968911B (en) FIR (finite Impulse response) reconfigurable processor for operator frequency compression and context configuration scheduling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18876682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18876682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP