WO2019093352A1

WO2019093352A1 - Data processing device

Info

Publication number: WO2019093352A1
Application number: PCT/JP2018/041281
Authority: WO
Inventors: 悠記小林
Original assignee: 日本電気株式会社
Priority date: 2017-11-10
Filing date: 2018-11-07
Publication date: 2019-05-16

Abstract

In order to continuously perform data transfer and arithmetic processing, and to improve the operating rate of an arithmetic unit, this data processing device comprises: a first annular bus; a transfer element group that includes a plurality of transfer elements connected in series by the first annular bus; a transfer control means that is connect to at least two of the transfer elements via the first annular bus, and is connected to an external memory; a second annular bus that is independent from the first annular bus; a processing element group that includes a plurality of processing elements connected in series by the second annular bus; an overall control means that is connected to at least two of the processing elements via the second annular bus; and an internal memory group that includes a plurality of internal memories connected to corresponding transfer elements and processing elements.

Description

Data processor

The present invention relates to a data processing apparatus that performs data transfer and arithmetic processing.

In analysis processing of big data, etc., calculation may be repeated on data in which hundreds of millions of data are collected into millions of entries. For example, operations such as matrix multiplication, vector-matrix multiplication, and element-by-element multiplication of vectors on a matrix of several million dimensions by several hundred dimensions may occur. For such operations, methods using a central processing unit (CPU) and methods using general-purpose computing on graphics processing units (GPGPU) are being studied. However, with CPUs and GPGPUs, there has been a problem that the power consumption increases with the improvement of the performance.

In order to reduce power consumption, a method using an FPGA (Field Programmable Gate Array), which is a power efficient device, has attracted attention. The FPGA is provided with a general-purpose logic element called a LUT (Look Up Table) and a variable wiring network connecting between a plurality of LUTs. In the FPGA, various arithmetic devices can be realized by rewriting the contents of the LUT and the wiring network.

In addition to the LUT, there is also an FPGA in which dedicated resources such as a digital signal processor (DSP) and a static random access memory (SRAM) are mounted. Such an FPGA can realize an efficient computing device. However, in such an FPGA, the physical position of the DSP or SRAM is fixed, so that the wiring is congested unless the architecture appropriately uses the DSP or SRAM, and the congested portion of the wiring is bypassed. The problem is that the wiring length becomes long. In addition, when a long wire is formed over the entire FPGA, there is a problem that the delay time of the wire is extended and the operating frequency of the arithmetic device is lowered.

There are also FPGAs in which the SRAM module can be configured as a True Dual Port, ie, a completely independent Dual Port RAM (Random Access Memory). Such an FPGA can be used as a memory having two systems of clock input and address input. It is desirable to make full use of these features in order to maximize the capabilities of the FPGA.

As described above, in order to extract the performance without increasing the power consumption, it is desirable to realize an architecture that utilizes the characteristics of the FPGA.

Patent Document 1 discloses a neural network apparatus configured by connecting a plurality of ring registers for performing a product-sum operation of a neural network in a ring. The device disclosed in Patent Document 1 includes a ring register path configured by connecting a plurality of ring registers having a transfer function in a ring, a plurality of arithmetic devices connected to at least one of each of the ring registers, and an arithmetic device. And a plurality of storage devices connected to each other.

Non-Patent Document 1 discloses a method of performing matrix operation using an FPGA.

Patent Document 2 discloses a parallel computer that processes matrix products at high speed. The computer of Patent Document 2 includes a plurality of processor elements, and a control device that distributes data to each processor element and collects operation results. Further, the computer of Patent Document 2 includes a first communication path connecting the control device and each processor element, and a second communication path connecting the logically adjacent processor elements.

JP-A-5-101031 JP-A-9-62656

According to the apparatus of Patent Document 1, since there are two ring register paths, while performing an operation using one ring register path, input data can be set using the other ring register path. However, in the apparatus of Patent Document 1, in the case of processing combining two or more types of data, it is necessary to once pull up the data to an external memory, or to interrupt the data calculation to store the data. Further, the apparatus of Patent Document 1 can not be applied to analysis processing of big data with few degrees of freedom and repeating various operations. Further, in the device of Patent Document 1, the content of operation performed by the group of operation devices is single, and it has been necessary to simultaneously distribute from the control unit that controls the whole.

According to the method of Non-Patent Document 1, matrix products can be calculated using processing elements arranged in one dimension. However, in the method of Non-Patent Document 1, since there is only one transfer path of data, it has been necessary to stop the computing unit while inputting the input data or extracting the calculation result.

According to the apparatus of Patent Document 2, matrix product calculation is performed without using expensive semiconductor devices as in a vector computer, and without using a network that requires complicated and sophisticated mounting technology as a massively parallel computer. The speed of computing can be increased. However, the device of Patent Document 2 has a problem that the control by the control device becomes complicated as the number of processor elements is increased.

An object of the present invention is to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit in order to solve the problems described above.

A data processing apparatus according to one aspect of the present invention includes a first annular bus, a transfer element group including a plurality of transfer elements connected in series by the first annular bus, and at least two transfer elements via the first annular bus. Transfer control means connected to one transfer element and to an external memory, a second ring bus independent of the first ring bus, and a plurality of processes connected in series by the second ring bus Internal memory including processing elements including elements, overall control means connected to at least two processing elements via a second ring bus, and a plurality of internal memories connected to corresponding transfer elements and processing elements And a group.

According to the present invention, it is possible to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit.

It is a block diagram showing composition of a data processor concerning a 1st embodiment of the present invention. It is a block diagram showing composition of a transfer element with which a data processor concerning a 1st embodiment of the present invention is provided. It is a conceptual diagram which shows the structural example of the transfer data transferred by the 1st cyclic | annular bus of the data processor which concerns on the 1st Embodiment of this invention. It is the table | surface which put together an example of the transfer data transferred by the 1st ring bus of the data processor which concerns on the 1st Embodiment of this invention. It is a block diagram showing composition of an internal memory with which a data processor concerning a 1st embodiment of the present invention is provided. It is a block diagram showing composition of a processing element with which a data processor concerning a 1st embodiment of the present invention is provided. It is a conceptual diagram which shows the structural example of the arithmetic instruction which the data processor which concerns on the 1st Embodiment of this invention handles. It is the table which put together an example of the operation command which the data processor concerning a 1st embodiment of the present invention handles. It is a block diagram showing composition of a general control part with which a data processor concerning a 1st embodiment of the present invention is provided. It is a conceptual diagram which shows the structural example of the command stored in the command memory contained in the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is equipped. It is a conceptual diagram which shows the structural example of the command of the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is provided. It is the table | surface which put together an example of the command of the whole control part with which the data processor which concerns on the 1st Embodiment of this invention is equipped. It is a block diagram showing composition of a transfer control part with which a data processor concerning a 1st embodiment of the present invention is provided. It is a conceptual diagram for demonstrating an example of the data transfer which the transfer control part with which the data processing apparatus which concerns on the 1st Embodiment of this invention is equipped performs. It is a flowchart for demonstrating the operation | movement of the processing element with which the data processor which concerns on the 1st Embodiment of this invention is provided. It is an example of the calculation formula of the matrix product which the data processor which concerns on the 1st Embodiment of this invention performs. It is a figure for demonstrating the example which hold | maintains a data element, when the data processor which concerns on the 1st Embodiment of this invention calculates matrix product. It is a figure of an example of the assembly program used when the data processing apparatus which concerns on the 1st Embodiment of this invention calculates a matrix product. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating a matrix product. It is an example of the calculation formula of the inner product of the vector which the data processor which concerns on the 1st Embodiment of this invention performs. It is a figure for demonstrating the example which hold | maintains a data element, when the data processor which concerns on the 1st Embodiment of this invention calculates the inner product of a vector. It is a figure of an example of the assembly program used when the data processor which concerns on the 1st Embodiment of this invention calculates the inner product of a vector. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. It is a figure for demonstrating the operation | movement at the time of the data processor which concerns on the 1st Embodiment of this invention calculating the inner product of a vector. It is a time chart for explaining an example in which the data processing device concerning a 1st embodiment of the present invention performs data transfer and data processing in parallel.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below are technically preferable limitations for carrying out the present invention, but the scope of the invention is not limited to the following. In all the drawings used in the following description of the embodiment, the same reference numerals are given to the same parts unless there is a particular reason. In the following embodiments, the same configuration and operation may not be repeatedly described. Further, the direction of the arrow in the drawing shows an example, and does not limit the direction of the signal between the blocks.

First Embodiment
First of all. A data processing apparatus according to a first embodiment of the present invention will be described with reference to the drawings. In the following, an example in which the data processing device of the present embodiment is mounted on an FPGA (Field-Programmable Gate Array) will be described. Note that the data processing apparatus of the present embodiment may be realized as a dedicated circuit (ASIC: Application Specific Integrated Circuit).

(Constitution)
FIG. 1 is a block diagram showing the configuration of the data processing apparatus 1 of the present embodiment. The data processing apparatus 1 includes a transfer element group 12, an internal memory group 13, a processing element group 14, a transfer control unit 15, an overall control unit 16, a first ring bus 17, and a second ring bus 18.

The transfer element group 12 includes a plurality of transfer elements 20 (transfer elements 20-1 to 20-n) connected in series by the first ring bus 17, (n is a natural number). The transfer elements 20 constituting the transfer element group 12 are connected to the adjacent transfer elements 20 via the first ring bus 17. Further, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15 via the first ring bus 17.

Each of the plurality of transfer elements 20 writes the data included in the transfer data into the internal memory 30 corresponding to itself according to the analysis result of the transfer data transferred by the first ring bus 17. Also, the transfer element 20 transmits transfer data to the adjacent transfer element 20 via the first ring bus 17. Also, each of the plurality of transfer elements 20 reads output data from the internal memory 30 corresponding to itself. The transfer element 20 transmits the read output data to the transfer control unit 15 through the first ring bus 17.

The processing element group 14 includes a plurality of processing elements 40 (processing elements 40-1 to 40-n) connected in series by the second annular bus 18 (n is a natural number). The processing elements 40 constituting the processing element group 14 are connected to the adjacent processing elements 40 by the second annular bus 18. The input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16 via the second annular bus 18.

Each of the plurality of processing elements 40 reads data from the internal memory 30 corresponding to itself in accordance with the operation instruction received from the general control unit 16 via the second ring bus 18.
The processing element 40 writes the operation result of the operation using the read data into the internal memory 30 as output data.

The internal memory group 13 includes a plurality of internal memories 30 (internal memories 30-1 to n) (n is a natural number). An internal memory 30 constituting the internal memory group 13 is connected between the corresponding transfer element 20 and the processing element 40. That is, each of the internal memories 30-1 to n is connected to each of the transfer elements 20-1 to n and each of the processing elements 40-1 to n.

The transfer control unit 15 (also referred to as transfer control means) is connected to the transfer element group 12 via the first ring bus 17. That is, the transfer control unit 15 is connected to at least two transfer elements 20 constituting the transfer element group 12 via the first ring bus 17. Since the transfer elements 20 adjacent to each other are connected via the first ring bus 17, the transfer control unit 15 receives the input of the transfer element 20-1 via the first ring bus 17 and the transfer element 20. Connected to the -n output.

The transfer control unit 15 is also connected to the external memory 100. The transfer control unit 15 receives data to be processed from the external memory 100. The transfer control unit 15 transmits the input data to the transfer element group 12 through the first ring bus 17. The transfer control unit 15 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.

The overall control unit 16 (also referred to as overall control means) is connected to the processing element group 14 via the second annular bus 18. That is, the overall control unit 16 is connected to at least two processing elements 40 via the second annular bus 18. The overall control unit 16 transmits an operation instruction to the processing element group 14 through the second ring bus 18. The transfer control unit 15 and the overall control unit 16 are connected to each other.

The first annular bus 17 is a one-dimensional annular bus. The first ring bus 17 connects a plurality of transfer elements 20 included in the transfer element group 12 in series. Further, the first ring bus 17 is connected to the transfer control unit 15.

The second annular bus 18 is a one-dimensional annular bus independent of the first annular bus 17. The second annular bus 18 connects a plurality of processing elements 40 included in the processing element group 14 in series. The second annular bus 18 is connected to the overall control unit 16.

The above is a schematic description of the configuration of the data processing device 1. The components of the data processor 1 will be individually described below.

[Transfer element]
FIG. 2 is a block diagram showing the configuration of transfer elements 20-1 to 20-n included in the transfer element group 12. As shown in FIG. Hereinafter, the transfer element 20-1, the transfer element 20-2, ..., and the transfer element 20-n will be referred to as a transfer element 20 without distinction. Although the transfer elements 20 adjacent to each other are connected in FIG. 2, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15.

As shown in FIG. 1, the transfer element 20 is connected to the first annular bus 17. As shown in FIG. 2, the transfer element 20 includes an annular bus register 21 forming a part of the first annular bus 17 and a memory interface unit 22. The ring bus register 21 includes a first register unit 211, a second register unit 212, and a third register unit 213.

The ring bus register 21 (also referred to as a first ring bus register) analyzes transfer data transferred from the transfer element 20 in the previous stage through the first ring bus 17. The ring bus register 21 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data. When issuing a write instruction to the internal memory 30 to the memory interface unit 22, the ring bus register 21 transfers the transfer data to the transfer element 20 of the next stage as it is. On the other hand, when a read instruction from the internal memory 30 is issued to the memory interface unit 22, the ring bus register 21 transfers the transfer data updated using the data read from the internal memory 30 to the transfer element 20 of the next stage. Forward.

FIG. 3 is a conceptual view showing a configuration example (transfer data 170) of transfer data flowing on the first ring bus 17. As shown in FIG. The transfer data 170 includes a command field cmd, an identification field peid, an address field addr, and a data field data. The command field cmd represents the type of data transfer (such as reading from an external memory or writing to an external memory). The address field addr indicates which address in the internal memory 30 is to be accessed. The data field data holds data to be read from or written to the internal memory 30.

FIG. 4 is a table summarizing an example of transfer data flowing on the first ring bus 17. FIG. 4 shows an example of transfer data when eight 32-bit data are read from the external memory 100 and sequentially stored in the address 0 of the internal memories 30-1 to 8-8. When the command field cmd is 0x1, it indicates that the external memory 100 writes data to the internal memory 30.

The first register unit 211 (also referred to as a first register) analyzes the transfer data transferred from the transfer element 20 in the previous stage. The first register unit 211 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data. When the identification field peid of the transfer data received from the transfer element 20 at the previous stage matches the identifier of the first register unit 211, the first register unit 211 determines that the command is a command for itself. Then, when the command field cmd is a write command to the internal memory 30, the first register unit 211 sends the value of the data field DATA, the address of the address field ADDR and the write instruction to the memory interface unit 22. If the command field cmd is a read command from the internal memory 30, the first register unit 211 sends the address of the address field addr and a read instruction to the memory interface unit 22.

The memory interface unit 22 (also referred to as a first memory interface) accesses the internal memory 30 in accordance with the instruction received from the first register unit 211. When the memory interface unit 22 receives a write instruction from the first register unit 211, the memory interface unit 22 writes data in the internal memory 30 according to the received write instruction. When the memory interface unit 22 receives a read instruction from the first register unit 211, the memory interface unit 22 reads data from the internal memory 30 according to the received read instruction. Then, the memory interface unit 22 sends the data read from the internal memory 30 to the third register unit 213.

The second register unit 212 (also referred to as a second register) is a buffer that is set in accordance with the access latency of the internal memory 30. The second register unit 212 transfers the transfer data transferred from the first register unit 211 to the third register unit 213. The second register unit 212 may be configured as a plurality of stages of shift registers in accordance with the access latency of the internal memory 30.

The third register unit 213 (also referred to as a third register) transfers the transfer data transferred from the second register unit 212 to the transfer element 20 of the next stage. When data is written to the internal memory 30, the third register unit 213 sends the transfer data that has arrived via the second register unit 212 to the transfer element 20 of the next stage as it is. When reading data from the internal memory 30, the third register unit 213 replaces the data field data included in the transfer data reached via the second register unit 212 with the data read from the internal memory 30. To the transfer element 20 of the next stage.

[Internal memory]
FIG. 5 is a block diagram showing the configuration of the internal memory 30. As shown in FIG. The arrows between the blocks shown in FIG. 5 conceptually indicate the flow of the write instruction, the address, the read data, and the write data, and do not limit their directions.

Internal memory 30 includes dual port memory 31. The dual port memory 31 includes two access ports of a port A 311 (hereinafter referred to as port A) and a port B 312 (hereinafter referred to as port B). A signal line from the transfer element 20 is connected to the port A (also referred to as a first port). On the other hand, a signal line from the processing element 40 is connected to the port B (also referred to as a second port). These signal lines are wires for transmitting addresses for writing and reading, writing instructions, writing data, reading data, and the like.

[Processing element]
FIG. 6 is a block diagram showing the configuration of the processing element 40. As shown in FIG. Hereinafter, the processing element 40-1, the processing element 40-2,..., The processing element 40-n will be referred to as the processing element 40 without distinction. Although the processing elements 40 adjacent to each other are illustrated as being connected in FIG. 6, the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16.

As shown in FIG. 6, the processing element 40 includes a ring bus register 41, an instruction decoder 42, a memory interface unit 43, and an arithmetic unit 44.

The ring bus register 41 (also referred to as a second ring bus register) is part of the elements connected to the second ring bus 18 and constituting the second ring bus 18. The ring bus register 41 is connected to the instruction decoder 42. The ring bus register 41 may be a single register or a shift register composed of a plurality of stages. The ring bus register 41 receives an operation instruction from the preceding processing element 40 connected to the second ring bus 18 and sends the received operation instruction to the processing element 40 of the next stage. Among the received operation instructions, the ring bus register 41 sends the operation instruction to be processed by itself to the instruction decoder 42.

The instruction decoder 42 is connected to the ring bus register 41. Also, the instruction decoder 42 is connected to the memory interface unit 43 and the arithmetic unit 44. The instruction decoder 42 analyzes the operation instruction received from the ring bus register 41 and generates a control signal according to the operation instruction. The instruction decoder 42 outputs the generated control signal to the memory interface unit 43 and the computing unit 44.

The memory interface unit 43 (also referred to as a second memory interface) is connected to the instruction decoder 42 and the arithmetic unit 44. Also, the memory interface unit 43 is connected to the internal memory 30. The memory interface unit 43 reads data from the internal memory 30 in response to a control signal from the instruction decoder 42, and transmits the read data to the computing unit 44. Also, the memory interface unit 43 writes the calculation result of the arithmetic unit 44 in the internal memory 30 as output data.

The arithmetic unit 44 is connected to the instruction decoder 42 and the memory interface unit 43. Arithmetic unit 44 executes an operation using data received from memory interface unit 43 in response to a control signal from instruction decoder 42. The arithmetic unit 44 transmits the operation result to the memory interface unit 43. For example, the computing unit 44 can be realized by a DSP (Digital Signal Processor) of an FPGA (Field-Programmable Gate Array).

Note that the function of the processing element 40 is not limited to the above description, and a function easily conceived by a person skilled in the art may be added. For example, a register file may be provided in the computing unit 44 so that operations on the registers in the register file can be performed.

FIG. 7 is a conceptual diagram showing a configuration example (operation instruction 420) of an operation instruction. For example, the operation instruction 420 includes fields of an 8-bit opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and a 32-bit immediate operand imm.

FIG. 8 is a table summarizing an example of operation instructions. The table of FIG. 8 shows the value of the opcode opc and the operation corresponding to the opcode opc. For example, opc = 0x01 represents an addition instruction ADD. The addition instruction of opc = 0x01 corresponds to an instruction for adding the data at the rs address of the internal memory 30 and the data at the rt address and writing the operation result to the rd address. Here, the description of instructions other than the opcode opc is omitted. However, the instruction of the operation code MACI and the operation code MACR will be described later.

Here, the operation instructions included in the table of FIG. 8 will be described by giving two examples.

The first example is an operation instruction represented by the following equation 1.
0x010040000000000000 → (opc = 0x01) mem [0x80] mem mem [0x00] + mem [0x40] ... (1)
In the first example, opc = 0x01, rs = 0x00, rt = 0x40, and rd = 0x80, so the data at address 0x00 in the internal memory 30 and the data at address 0x40 are added to represent an instruction to be written at address 0x80. . In the first example, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to read data from the addresses 0x00 and 0x40 in the internal memory 30. Thereafter, the instruction decoder 42 outputs a control signal instructing the arithmetic unit 44 to perform an addition operation on the input data supplied from the memory interface unit 43. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 to the address 0x80 of the internal memory 30.

The second example is an operation instruction represented by the following Equation 2.
0x0722004612345678 → (opc = 0x07) mem [0x46] mem mem [0x22] * 0x 12345678 ... (2)
Since the second example is opc = 0x07, rs = 0x22, rd = 0x46, imm = 0x12345678, an instruction to write the address 0x46 by multiplying the address 0x22 of the internal memory 30 by the value of the immediate field imm is shown. In the second example, the instruction decoder 42 outputs, to the memory interface unit 43, a control signal instructing to read data from the address 0x22 of the internal memory 30. Thereafter, the instruction decoder 42 outputs, to the computing unit 44, a control signal instructing to perform a multiplication operation on the input data supplied from the memory interface unit 43 and the value of the immediate field imm. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 at the address 0x46 of the internal memory 30.

[Overall control unit]
FIG. 9 is a block diagram showing the configuration of the overall control unit 16. As shown in FIG. 9, the overall control unit 16 has a program counter 61, a command memory 62, a command decoder 63, and an overall control unit data path 64. The command decoder 63 is connected to the first processing element 40-1. The general control unit data path 64 is connected to the last processing element 40-n. The general control unit 16 operates in the same manner as a general instruction set processor.

The program counter 61 stores a value indicating a command to be executed next. If the content of the command is other than a branch instruction, the program counter 61 is automatically incremented. On the other hand, when the content of the command is a branch instruction, the value of the program counter 61 is changed in accordance with the branch instruction.

The command memory 62 stores a command including a flag indicating a subject that executes an instruction. The command memory 62 outputs a command corresponding to the value of the program counter 61 to the command decoder 63.

The command decoder 63 analyzes the command output from the command memory 62 and generates a control signal according to the analysis result. When the command decoder 63 interprets the command as an instruction of the overall control unit 16, the command decoder 63 outputs the generated control signal to the overall control unit data path 64. On the other hand, when the command decoder 63 interprets the command as an instruction of the processing element 40, the command decoder 63 outputs the generated control signal to the processing element 40-1 of the first stage included in the processing element group 14.

The overall control unit data path 64 (also referred to as an overall control data path) performs an operation according to the content of the command in accordance with the control signal generated by the command decoder 63. For example, the overall control unit data path 64 performs operations such as addition and branching. The overall control unit data path 64 may include elements included in a general instruction set processor such as a register file. If the content of the command is a branch instruction, the overall control unit data path 64 changes the value of the program counter 61 in accordance with the branch instruction.

FIG. 10 is a conceptual diagram showing a configuration example of the command 620 stored in the command memory 62. As shown in FIG. The command 620 in the example of FIG. 10 includes a 1-bit flag pf and a 64-bit instruction inst. When pf is 0, it is interpreted as an instruction of the overall control unit 16. On the other hand, when pf is 1, it is interpreted as an instruction of the processing element 40. Then, if pf is a command 620 of 1, the command decoder 63 shown in FIG. 9 transmits the instruction inst to the first processing element 40-1 on the second ring bus 18.

Further, in FIG. 9, the operation instruction that has arrived from the last processing element 40-n is stored in a register (not shown) in the overall control unit data path 64. The storage destination of the operation instruction may be a specific register in the register file or may be a dedicated register. In addition, overall control unit data path 64 may be provided with a dedicated FIFO (First In First Out) for storing an operation instruction, and a register for storing the inside of the register file is separately designated by a flag or the like in the operation instruction. It may be possible.

FIG. 11 is a conceptual diagram showing a configuration example of the instruction 160 of the overall control unit 16. For example, the instruction 160 of the general control unit 16 includes fields of an opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and an immediate operand imm. In the example of FIG. 11, the operation code opc is 8 bits, the first source operand rs is 5 bits, the second source operand rt is 5 bits, the destination operand rd is 5 bits, and the immediate operand imm is 32 bits. Note that the instruction 160 of the general control unit 16 of FIG. 11 may be stored left-justified in the inst of 64-bit width shown in FIG.

FIG. 12 is a table summarizing an example of an instruction of the overall control unit 16. RF [rs] represents the register value of the index specified by rs in the register file. Also, PC represents a program counter value. dmactrl represents an instruction register to the transfer control unit 15. dmastatus represents the status register of the transfer control unit 15. “{RF [rs], RF [rt]}” represents a value obtained by concatenating two register values RF [rs] and RF [rt]. Although the bit width of the register in the register file is assumed to be 32 bits in the example of FIG. 12, the bit width of the register is not limited to 32 bits.

[Transfer control unit]
FIG. 13 is a block diagram showing the configuration of the transfer control unit 15. As shown in FIG. As shown in FIG. 13, the transfer control unit 15 includes an instruction register 51, a state register 52, and a control circuit 53. The instruction register 51 and the status register 52 are connected to the overall control unit 16. Control circuit 53 is connected to external memory 100. Also, the control circuit 53 is connected to the first transfer element 20-1 and the last transfer element 20-n.

The instruction register 51 includes a plurality of register fields such as eaddr indicating an external memory address, iaddr indicating an internal memory address, num indicating the number of transfer data, and dir indicating a transfer direction. For example, in the case of dir == 0, it represents transfer of num data from the iaddr address of the internal memory 30 to the eaddr address of the external memory 100. Also, in the case of dir == 1, it represents transfer of num data from the eaddr address of the external memory to the iaddr address of the internal memory.

The status register 52 holds a value indicating whether transfer data is being transferred or has been completed in the first ring bus 17.

Control circuit 53 is connected to external memory 100. Control circuit 53 receives data to be processed from external memory 100. The control circuit 53 transmits the input data to the first stage transfer element 20-1 included in the transfer element group 12 through the first ring bus 17. The control circuit 53 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.

The control circuit 53 starts transfer if the instruction register 51 includes a valid transfer instruction. In addition, the control circuit 53 reflects a value indicating whether the transfer is in progress or the transfer is completed as needed, and notifies the overall control unit 16 of the reflected result. That is, when the instruction register 51 contains a valid transfer instruction, the control circuit 53 transfers data between the external memory 100 and the transfer element group 12 to update the value of the status register 52.

The control circuit 53 writes a value in the instruction register 51 according to the ivkdma instruction of the general control unit 16 shown in FIG. Further, the control circuit 53 reads the value of the status register 52 by the chkdma instruction of the overall control unit 16 shown in FIG.

FIG. 14 is a conceptual diagram showing an example of transfer of transfer data by the transfer control unit 15. As shown in FIG. In FIG. 14, the case where the number of processing elements = 8, dir = 1, eaddr = 0x0, iaddr = 0x400, and num = 12 will be described as an example. dir = 1 represents transfer from the external memory 100 to the internal memory 30. eaddr = 0x0 and num = 12 indicate that the data to be transferred is 12 pieces of data from the address 0x0 of the external memory 100. iaddr = 0x400 indicates that the start address of the transfer destination is the address 400 of each internal memory 30. Since the number of transfer data is 12, two data are held in the first four storage areas (addresses 1 to 4) of the eight storage areas (addresses 1 to 8) of the internal memory 30, The remaining four storage areas (addresses 5 to 8) hold one data each.

The above is the description of the components of the data processing device 1. The above configuration of the data processing apparatus 1 is an example, and various configurations may be added or deleted as long as the functions of the data processing apparatus 1 of the present embodiment can be exhibited.

(Operation)
Next, the operation of the processing element 40 will be described with reference to the drawings. FIG. 15 is a flowchart for explaining the operation of the processing element 40.

In FIG. 15, first, the processing element 40 determines whether or not an operation instruction has come from the processing element 40 of the previous stage (step S11).

When the operation instruction is received (Yes in step S11), the processing element 40 receives the operation instruction (step S12). On the other hand, when the operation instruction has not been received (No in step S11), the processing element 40 waits for the arrival of the operation instruction (return to step S11).

Next, the processing element 40 performs an operation according to the received arithmetic instruction (step S13). For example, the processing element 40 performs the following operations 1 to 4 according to the received operation instruction.
(1) Read values from the addresses shown in rs and rt in the internal memory 30.
(2) Perform an operation on the read value.
(3) Write the operation result to the address indicated by rd in the internal memory 30.
(4) Rewrite imm in the operation instruction with the operation result.

Then, the processing element 40 sends the updated operation instruction to the processing element 40 of the next stage (step S14).

If the transfer is continued (Yes in step S15), the process returns to step S11. When the transfer is completed (No in step S15), the process according to the flowchart of FIG. 15 ends.

The above is the description of the operation of the processing element 40 along the flowchart of FIG. Subsequently, an example of calculation by the data processing device 1 will be described with reference to the drawings.

Matrix product
FIG. 16 is a calculation example of the matrix product by the data processing device 1. Here, an example will be described in which the data processing apparatus 1 calculates the matrix product of the matrix A of 3 rows and 2 columns and the matrix B of 2 rows and 8 columns to obtain a matrix C (= AB) of 3 rows and 8 columns.

As shown in FIG. 17, the six elements (A00 to A21) of the matrix A are stored in the register files (RF [0] to RF [5]) of the overall control unit 16. Although FIG. 17 illustrates an example in which the elements of the matrix A are stored in the register file, the present invention is not limited to this. For example, a memory such as a scratch pad memory may be configured in the general control unit 16, and elements of the matrix A may be stored in the scratch pad memory.

As shown in FIG. 17, matrix B and elements of matrix C (B00 to B17, C00 to C27) are stored in the internal memory 30. For example, B17 represents an element of row 1 column 7 of the matrix B. In the example of FIG. 17, it is assumed that the transfer control unit 15 reads in advance the matrix B, which is input data, into the 0th address to the 4th address of the internal memory 30. The matrix C, which is output data, is stored in the area of addresses 400 to 408 initialized to zero.

The MAC Immediate instruction shown in FIG. 8 is used to calculate the matrix product. In the MAC Immediate instruction, the rtth register of the register file of the overall control unit 16 is set to imm, an operation instruction is sent to the second ring bus 18, the imm is multiplied by the internal memory rs address in the processing element 40, The product is accumulated at the rd address of the internal memory 30.

FIG. 18 shows an example of an assembly program of matrix products (assembly program 171). MACI is a mnemonic that represents a MAC Immediate instruction. For example, “

MACI

1, 5, 0x402” represents a MAC Immediate instruction in which rs = 1, rt = 5, and rd = 0x402.

Here, an operation example of each matrix product cycle will be described with reference to FIGS. In FIGS. 19 to 23, values of the internal memory 30 in each cycle of matrix multiplication, operation instructions (instructions 1 to 6 in FIG. 18) flowing through the second ring bus 18, and imm field in the operation instruction are shown. Indicates the set value.

In cycle 1 (cyc1) shown in FIG. 19, the first processing element 40-1 of the second annular bus 18 receives an instruction 1. Data A00 is set in the imm field of instruction 1. The processing element 40-1 multiplies the value at address 0 (B00) of the internal memory 30-1 corresponding to itself with A00 according to the operation of the instruction 1 "

MACI

0, 0, 0x400", and is the operation result. Accumulate product at address 0x400. As shown in FIG. 20, “A00 * B00” as the operation result is stored at address 400 of the internal memory 30-1. "A00 * B00" indicates the product of "A00" and "B00".

In cycle 2 (cyc2) shown in FIG. 20, the processing element 40-1 receives the instruction 2 and the processing element 40-2 receives the instruction 1. The processing element 40-1 performs a product-sum operation of multiplying B10 and A01 and accumulating the product (A01 * B10) which is the operation result at address 400. The processing element 40-2 performs multiplication of B01 and A00, and accumulates the product (A00 * B01), which is the operation result, at address 400. As a result, as shown in FIG. 21, "A00 * B00 + A01 * B10" is stored at address 400 of the internal memory 30-1, and "A00 * B01" is stored at address 400 of the internal memory 30-2. "A00 * B00 + A01 * B10" indicates the sum of "A00 * B00" and "A01 * B10".

In cycle 3 (cyc3) in FIG. 21, the processing element 40-1 receives the instruction 3, the processing element 40-2 receives the instruction 2, and the processing element 40-3 receives the instruction 1. The processing element 40-1 performs multiplication of B00 and A10, and accumulates the product (A10 * B00), which is the operation result, at address 404. The processing element 40-2 performs multiplication of B11 and A01 and executes a product-sum operation of accumulating the product (A01 * B11) which is the operation result at address 400. The processing element 40-3 performs multiplication of B02 and A00, and accumulates the product (A00 * B02), which is the operation result, at address 400. As a result, as shown in FIG. 22, "A10 * B00" is stored at address 404 of internal memory 30-1, "A00 * B01 + A01 * B11" is stored at address 400 of internal memory 30-2, and "A00 * B02" is stored at address 400 of the memory 30-3.

In cycle 4 (cyc 4) shown in FIG. 22, processing element 40-1 receives instruction 4, processing element 40-2 receives instruction 3, processing element 40-3 receives instruction 2, and processing element 40-4. Receives instruction 1. The processing elements 40-1 to 4 execute the operation in the same manner as in FIGS. 19 to 21, and store the operation result in the designated address of the internal memory 30-1 to 4.

FIG. 23 shows the state of the internal memories 30-1 to 8 in cycle 14 (cyc14) when the matrix product calculation is completed. At respective addresses of the internal memories 30-1 to 8, the operation result according to the operation instruction is stored.

The above is the description regarding the example in which the data processing device 1 calculates the matrix product. In the matrix product operation described with reference to FIGS. 16 to 23, when the overall control unit 16 outputs an operation instruction including a field storing immediate data to the second ring bus 18, the processing element 40 performs an operation. To store the calculation result in the internal memory 30. Then, the processing element 40 stores the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself in the internal memory 30 corresponding to itself. Do.

[Inner product of vectors]
FIG. 24 is an example of calculation of the inner product of vectors by the data processing device 1. Here, an example will be described in which the data processing device 1 obtains an inner product d of a 1-row-8-column matrix A and a 1-row-8-column matrix B.

As shown in FIG. 25, the elements of matrix A and matrix B (A00 to A07, B00 to B07) are stored in the internal memory 30.

The MAC reduction instruction of FIG. 8 is used to calculate the inner product. The MAC reduction instruction is an instruction for performing multiplication of the addresses rs and rt of the internal memory 30 in each processing element 40, accumulating the product to the value of imm in the operation instruction, and transferring it to the next processing element 40. That is, in the MAC reduction instruction, each time the operation instruction passes through the processing element 40, the operation result is multiplied by the value of the imm field of the operation instruction.

FIG. 26 shows an example of an inner product assembly program (assembly program 172). MACR is a mnemonic that represents a MAC reduction instruction. For example, “

MACR

0, 4” represents a MAC reduction instruction where rs = 0 and rt = 4.

Here, an operation example for each cycle in inner product calculation of vectors will be described with reference to FIGS.

In cycle 1 shown in FIG. 27, the value of the imm field of instruction 1 that has arrived at processing element 40-1 is zero. In cycle 1, the processing element 40-1 performs an operation of A00 * B00 and adds it to the imm field. The processing element 40-1 transfers the operation result to the processing element 40-2 of the next stage. As shown in FIG. 28, “A00 * B00” is stored in the imm field of the processing element 40-2.

In cycle 2 shown in FIG. 28, the processing element 40-2 performs an operation of A01 * B01 and adds it to the imm field. The processing element 40-2 transfers the operation result to the processing element 40-3 of the next stage. As shown in FIG. 29, “A00 * B00 + A01 * B01” is stored in the imm field of the processing element 40-3.

In cycle 3 shown in FIG. 29, the processing element 40-3 performs an operation of A02 * B02 and adds it to the imm field. The processing element 40-3 transfers the operation result to the processing element 40-4 of the next stage. As shown in FIG. 30, “A00 * B00 + A01 * B01 + A02 * B02” is stored in the imm field of the processing element 40-4.

In cycle 4 shown in FIG. 30, the processing element 40-4 performs the operation of A03 * B03 and adds it to the imm field. The processing element 40-4 transfers the operation result to the next stage processing element 40-5. The description of cycles 5 to 7 is omitted.

In cycle 8 shown in FIG. 31, the last processing element 40, the processing element 40-8, adds the calculation result of A07 * B07 to the value “A00 * B00 + A01 * B01 +... + A06 * B06” of the imm field. . This completes the calculation of the inner product. For example, the operation instruction that the last processing element 40-8 outputs to the second ring bus 18 is stored in a register or the like in the overall control unit data path 64 in the overall control unit 16.

The above is the description regarding the example in which the data processing device 1 calculates the inner product of vectors. In the calculation of the inner product described with reference to FIGS. 24 to 31, the processing element 40 outputs immediate data when the overall control unit 16 outputs an operation instruction including a field for storing immediate data to the second ring bus 18. Perform the operation using That is, the processing element 40 rewrites the immediate data according to the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself. Then, the processing element 40 outputs the rewritten immediate data to the second ring bus 18 as output data.

The above is the description of the calculation example by the data processing device 1. Although the matrix multiplication for the small size matrix has been described in the above example, the same calculation can be performed for a larger matrix by increasing the number of internal memories 30 and processing elements 40 used. Even in that case, the computing unit 44 included in the processing element 40 continuously performs computations every cycle.

Further, in the data processing device 1, the second annular bus 18 and the first annular bus 17 can operate independently. Therefore, as shown in FIG. 32, processing and transfer can be performed in parallel. That is, the data processor 1 performs transfer of a matrix A and transfer of a matrix B for the next stage of matrix multiplication at the same time as performing certain matrix multiplication, and of the matrix C which is an output of the previous stage of matrix multiplication. Transfer can be done.

Further, the data to be processed by the data processing apparatus of the present embodiment is not limited to a matrix, and may be data of another form. For example, the data processing apparatus according to the present embodiment may process vector data.

Further, the inside of the overall control unit and the processing element included in the data processing apparatus of the present embodiment may be realized as a pipeline processor. Implementing the entire control unit and the inside of the processing elements as a pipeline processor can increase the throughput of operations. In this case, since it is necessary to simultaneously access the internal memory, for example, it is necessary to simultaneously access rs and rd in the MACI instruction, the internal memory may be configured as a plurality of banks to allow simultaneous access.

Further, although the case where the number of processing elements is eight has been described as an example in the present embodiment, the number of processing elements is not limited. For example, even if the number of processing elements is 256 or 512, the configuration of the present embodiment is applicable. Also, the number of processing elements may be less than eight or more than 512.

According to the present embodiment described above, by simultaneously executing data transfer and data calculation, the first effect of maintaining the operating rate of the computing unit can be obtained. Further, according to the present embodiment, by transmitting the operation content of the processing element as the operation instruction through the ring bus, it is possible to improve the operating frequency by eliminating the long wiring except for the signal line for the clock signal and the reset signal. The second effect is obtained. That is, according to the present embodiment, the matrix product and the inner product of the vectors can be efficiently calculated by making the bus for data transfer and the bus for data processing independent.

The data processing apparatus according to the present embodiment is used for flexible and efficient execution on a field-programmable gate array (FPGA) with respect to an application such as analysis processing of big data that performs matrix operation such as large-scale matrix product or inner product. Applicable to Further, the data processing device of the present embodiment can be realized not only on the FPGA but also as a dedicated circuit (ASIC: Application Specific Integrated Circuit).

Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2017-217026 filed on Nov. 10, 2017, the entire disclosure of which is incorporated herein.

Reference Signs List 1 data processing unit 12 transfer element group 13 internal memory group 14 processing element group 15 transfer control unit 16 overall control unit 20 transfer element 21 annular bus register 22 memory interface unit 30 internal memory 31 dual port memory 40 processing element 41 annular bus register 42 Instruction decoder 43 Memory interface unit 44 Arithmetic unit 51 Instruction register 52 Status register 53 Control circuit 61 Program counter 62 Command memory 63 Command decoder 64 General control unit data path 100 External memory 211 First register unit 212 Second register unit 213 Three Registers 311 Port A
312 port B

Claims

A first ring bus,
A transfer element group including a plurality of transfer elements connected in series by the first ring bus;
Transfer control means connected to the at least two transfer elements via the first ring bus and to an external memory;
A second annular bus independent of the first annular bus;
A processing element group including a plurality of processing elements connected in series by the second annular bus;
Overall control means connected to the at least two processing elements via the second annular bus;
A data processing apparatus comprising: an internal memory group including a corresponding transfer element and a plurality of internal memories connected to the processing element.
The transfer control means
The data read from the external memory is transmitted as transfer data to the transfer element group through the first ring bus,
Each of the plurality of transfer elements included in the transfer element group is
The data processing apparatus according to claim 1, wherein the data included in the transfer data is written to the internal memory corresponding to itself according to an analysis result of the transfer data.
The overall control means
Sending an operation instruction to the processing element group through the second ring bus;
Each of the plurality of processing elements is
According to the received operation instruction, read data from the internal memory corresponding to itself, and write the operation result of operation using the read data as the output data to the internal memory.
Each of the plurality of transfer elements is
Reading out the output data from the internal memory corresponding to itself, and transmitting the read out output data to the transfer control means through the first ring bus;
The transfer control means
The data processing apparatus according to claim 1, wherein the received output data is written to the external memory.
The transfer element is
A first annular bus register connected to the first annular bus;
A first memory interface connected to the first ring bus register and the internal memory;
The processing element is
A second ring bus register connected to the second ring bus;
An instruction decoder connected to the second ring bus register;
A second memory interface connected to the instruction decoder and the internal memory;
The data processing apparatus according to any one of claims 1 to 3, further comprising: the instruction decoder and an arithmetic unit connected to the second memory interface.
The first ring bus register is
A first register that analyzes transfer data transferred from the transfer element in the previous stage via the first ring bus, and issues an access instruction according to the analysis result to the first memory interface;
A second register configured to match the access latency of the internal memory and transferring the transfer data transferred from the first register;
And a third register for transferring the transfer data transferred by the second register to the transfer element in a subsequent stage,
The first register is
When it is determined that the command included in the transfer data received from the transfer element in the previous stage is a command to itself:
If the command is a write command to the internal memory, send a write instruction to the first memory interface;
If the command is a read command from the internal memory, send a read instruction to the first memory interface;
The first memory interface is
When the write instruction is received from the first register, data is written to the internal memory according to the received write instruction;
When the read instruction is received from the first register, the data is read from the internal memory according to the received read instruction, and the read data is sent to the third register.
The third register is
When data is written to the internal memory by the first memory interface, the transfer data arrived via the second register is sent as it is to the transfer element of the next stage;
When data is read from the internal memory 30 by the first memory interface, a part of the transfer data reached via the second register is replaced with the data read from the internal memory, and then the next stage is performed. 5. A data processing apparatus according to claim 4, wherein the data is sent to the transfer element of.
The second ring bus register is
When sending the operation instruction received through the second ring bus to the processing element of the subsequent stage, if the received operation instruction includes the operation instruction to be processed by itself, the operation instruction to be processed is the instruction Output to the decoder,
The instruction decoder
Analyzing the operation instruction to be processed received from the second ring bus register, generating a control signal according to the operation instruction to be processed, and outputting the control signal to the second memory interface and the arithmetic unit;
The second memory interface is
The data read from the internal memory is transmitted to the arithmetic unit in response to the control signal from the instruction decoder.
The computing unit is
In accordance with the control signal from the instruction decoder, an operation result using data received from the second memory interface is transmitted to the second memory interface;
The second memory interface is
The data processing apparatus according to claim 4 or 5, wherein the calculation result of the calculation unit is written to the internal memory as output data.
The overall control means
A program counter that stores a value indicating a command to be executed next;
A command memory storing the command and outputting the command according to the value of the program counter;
A command decoder that analyzes the command output from the command memory and generates a control signal according to the analysis result;
A final control data path connected to the last processing element included in the processing element group and the transfer control means and performing an operation according to the control signal generated by the command decoder;
The command decoder
Interpreting the command as an instruction of the overall control means, and outputting the generated control signal to the overall control data path;
The data processing apparatus according to any one of claims 1 to 6, wherein when the command is interpreted as an instruction of the processing element, the generated control signal is output to the processing element of the first stage included in the processing element group.
The transfer control means
An instruction register including a plurality of register fields indicating an external memory address, an internal memory address, a number of transfer data, and a transfer direction;
A status register holding a value indicating whether data is being transferred on the first ring bus;
A control circuit connected to the external memory and transferring data between the external memory and the transfer element group to update the value of the status register when the instruction register includes a valid transfer instruction The data processing apparatus according to any one of claims 1 to 7, comprising:
The overall control means
Outputting an operation instruction including a field storing immediate data to the second ring bus;
Each of the plurality of processing elements is
The arithmetic result calculated using the immediate data received through the second ring bus and the data stored in the internal memory corresponding to itself is stored in the internal memory corresponding to itself. The data processing apparatus according to any one of the above.
The overall control means
Outputting an operation instruction including a field storing immediate data to the second ring bus;
Each of the plurality of processing elements is
The immediate data is rewritten according to the calculation result calculated using the immediate data received through the second ring bus and the data stored in the internal memory corresponding to itself, and the rewritten immediate data is converted to the second immediate data. The data processing apparatus according to any one of claims 1 to 8, wherein the data is output to an annular bus.