CN116775556B

CN116775556B - A storage and computing architecture FPGA with high data transmission efficiency

Info

Publication number: CN116775556B
Application number: CN202310776682.4A
Authority: CN
Inventors: 单悦尔; 徐彦峰; 闫华
Original assignee: Wuxi Zhongwei Yixin Co Ltd
Current assignee: Wuxi Zhongwei Yixin Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2024-11-26
Anticipated expiration: 2043-06-27
Also published as: CN116775556A

Abstract

The present application discloses a storage and computing architecture FPGA with high data transmission efficiency, which relates to the field of FPGA technology. In the storage and computing architecture FPGA, resource modules located in the same sub-area are connected through interconnection resources inside the FPGA to realize a storage and computing unit. The processor in each storage and computing unit performs operations according to the instructions and data stored in the connected local storage unit. Each storage and computing unit transmits data between other storage and computing units through a specific data transmission interface, so that multiple storage and computing units can complete a storage and computing operation in parallel and synchronously. The overall FPGA has high data transmission efficiency, high data processing efficiency, and fast operation speed.

Description

High data transmission efficiency's memory calculates framework FPGA

Technical Field

The application relates to the technical field of FPGA, in particular to a high-data transmission efficiency storage architecture FPGA.

Background

With the development of high and new technologies such as the Internet of things, cloud computing and artificial intelligence, various data-intensive applications realized by means of hardware platforms also grow rapidly. FPGA (Field Programmable GATE ARRAY, programmable array logic) is a hardware platform commonly used at present because of its advantages of parallelism and reconfigurability.

The existing FPGA, like most conventional hardware platforms, adopts von neumann architecture, and contains a large number of resource modules such as CLBs, DSPs and BRAMs inside the FPGA, and these resource modules are connected by highly configurable interconnection lines to form a computing unit and a memory unit in the FPGA. BRAM is used as an important storage unit on chip, on-chip data is mainly stored in BRAM, and on-chip data is transmitted to a computing unit during computing.

Therefore, like other hardware platforms adopting von neumann architecture, the existing FPGA needs to repeatedly transmit a large amount of data between the storage unit and the computing unit in the process of executing various computing tasks, and frequent and large data movement not only consumes a large amount of wiring resources and interconnection resources, but also causes huge delay and energy loss, thereby limiting the data processing efficiency of the FPGA.

Disclosure of Invention

The present inventors have proposed a high data transmission efficiency storage architecture FPGA, which is designed to address the above problems and technical needs, and the technical scheme of the present application is as follows:

The storage and calculation architecture FPGA with high data transmission efficiency comprises a plurality of resource modules arranged according to a preset structure and interconnection resources arranged around the resource modules, wherein the resource modules positioned in the same subarea are connected through the interconnection resources in the FPGA to form a storage and calculation unit, each storage and calculation unit comprises a processor and a local storage unit which are connected through the interconnection resources, the local storage unit in each storage and calculation unit is used for storing instructions and data of the storage and calculation unit, and the processor is used for carrying out operation according to the instructions and the data stored in the connected local storage unit;

The storage and calculation unit with data transmission relation with other storage and calculation units also forms a data transmission interface through a resource module corresponding to the subarea, and the storage and calculation unit comprises a data transmission interface formed as a data output interface and/or a data transmission interface formed as a data input interface; the structure of each data transmission interface of each memory calculation unit is the same, and each data transmission interface comprises a memory and is provided with a write data port, a write clock port, a read data port and a read clock port;

For any two memory computing units with a data transmission relation, a data output interface of a first memory computing unit serving as a data sender is connected with a data input interface of a second memory computing unit serving as a data receiver, the first memory computing unit writes data into a memory in the data output interface through a write data port under the control of a write clock signal of a write clock port and writes the data into a memory in the data input interface of the second memory computing unit in a first-in first-out mode, and the second memory computing unit reads the data from the memory of the data input interface through a read data port under the control of a read clock signal of a read clock port in a first-in first-out mode so as to realize data transmission from the first memory computing unit to the second memory computing unit.

The further technical scheme is that for any one memory computing unit, the write clock signal of the data output interface in the memory computing unit is synchronous with the clock signal of the memory computing unit, and the read clock signal of the data input interface in the memory computing unit is synchronous with the clock signal of the memory computing unit.

The further technical scheme is that the frequency of the write clock signal and the frequency of the read clock signal of each data transmission interface are the same or different, and the write clock signal and the read clock signal are synchronous or asynchronous.

According to the further technical scheme, each storage and calculation unit sequentially executes a plurality of instructions according to respective instruction execution sequences, the first storage and calculation unit sends target data generated after executing a first target instruction to the second storage and calculation unit, and the second storage and calculation unit executes a second target instruction according to the acquired target data;

and the second memory computing unit determines that the first memory computing unit executes the first target instruction according to the instruction execution sequence of the first memory computing unit in the process of executing the first target instruction according to the instruction execution sequence of the second memory computing unit, reads target data from a memory connected with a data input interface of the first memory computing unit, and continues to execute the second target instruction and the later instructions.

Each data transmission interface also comprises a writing state port and a reading state port;

The first memory computing unit as a data sender reads a write state signal of the data output interface through a write state port of the data output interface, determines that a memory in the data output interface is not full when the write state signal of an effective level is read, writes data into the memory in the data output interface, and determines that the memory in the data output interface is full and continues waiting when the write state signal of an ineffective level is read;

The second memory computing unit as a data receiving side reads the read state signal of the data input interface through the read state port of the data input interface, when the read state signal of the effective level is read, the memory in the data input interface is determined to be non-empty, the data is read from the memory in the data input interface, and when the read state signal of the ineffective level is read, the memory in the data input interface is determined to be empty and the waiting is continued.

When the second storage computing unit executes the second target instruction, determining that the first storage computing unit does not execute the first target instruction, and when at least one instruction which is independent of the target data and intermediate data generated by the second target instruction exists in the instructions which are executed after the second target instruction, the second storage computing unit adjusts the instruction execution sequence, skips the second target instruction first and executes the instructions which are independent of the target data and the intermediate data generated by the second target instruction, until determining that the first storage computing unit executes the first target instruction, and reading the target data from a memory which is connected with a data input interface of the first storage computing unit and continuing to execute the second target instruction and the instructions which are remained after the second target instruction.

According to the further technical scheme, a plurality of memory computing units in the memory computing architecture FPGA respectively have a data transmission relation to form a global data stream, each memory computing unit through which the global data stream sequentially passes sequentially adjusts the respective instruction execution sequence, and after the second memory computing unit serving as a data receiving party adjusts the fixed instruction execution sequence in the corresponding first memory computing unit serving as a data sending party, the second memory computing unit serving as a data receiving party adjusts the own instruction execution sequence according to the instruction execution sequence of the corresponding first memory computing unit serving as the data sending party.

The second storage unit as the data receiver reads the reading state signal of the data input interface at intervals of preset time, and reads the data from the memory of the data input interface when the reading state signal of the effective level is read;

Or the read state signal of the data input interface generates an interrupt signal when changing from an inactive level to an active level, and the second memory unit serving as a data receiving side reads out data from the memory of the data input interface under the triggering of the interrupt signal.

The further technical scheme is that any one data transmission interface is realized by a dual-port FIFO memory or an asynchronous register.

The further technical scheme is that a data output interface of a first storage unit and a data input interface of a second storage unit for realizing data transmission are connected to a NOC network, and the data output interface through which the first storage unit passes transmits data to the data input interface of the second storage unit through the NOC network.

The further technical scheme is that the memory architecture FPGA is of a multi-die structure and comprises an FPGA die and a memory die, the first memory unit and the second memory unit are formed on the same FPGA die or different FPGA dies, a data output interface of the first memory unit is connected with the memory die, a data input interface of the second memory unit is connected with the memory die, and the data output interface through which the first memory unit passes transmits data to a data input interface of the second memory unit through the memory die.

The beneficial technical effects of the application are as follows:

the application discloses a storage and calculation architecture FPGA with high data transmission efficiency, wherein resource modules in the same subarea in the storage and calculation architecture FPGA are connected through interconnection resources in the FPGA to realize one storage and calculation unit, and data are transmitted between the storage and calculation units and other storage and calculation units through specific data transmission interfaces, so that a plurality of storage and calculation units can synchronously complete one storage and calculation operation in parallel, and the overall data processing efficiency of the FPGA is high, and the operation speed is high.

The data transmission interface is realized in an asynchronous mode, so that the transmission requirements of different memory units are met, and the data transmission efficiency is ensured. Based on the data transmission interface, the storage and calculation units can also reduce idle waiting time by adjusting the instruction execution sequence, and for the global, the instruction execution sequence of each storage and calculation unit is sequentially adjusted according to the global data flow, so that waiting time can be reduced on the basis of ensuring data transmission accuracy, and the data transmission efficiency is further improved.

Each memory unit uses a data transmission interface with a general structure, so that the design can be simplified, and the data transmission interface is easy to realize and does not occupy excessive FPGA resources.

The local storage unit in each storage unit performs data transmission with the outside of the FPGA chip through the global read-write channel, the processor and the local storage unit of each storage unit are in the same subarea, and the processor only needs to read and write instructions and data in the adjacent local storage units when executing, so that wiring resources and interconnection resources consumed by data transmission between the processor and the local storage unit are reduced, data transmission delay and energy consumption are reduced, and the data processing efficiency of the FPGA is improved. In addition, the FPGA can realize multi-core parallel memory operation through a plurality of memory operation units, the processors of the memory operation units cannot collide with each other, each memory operation unit can execute operation at full speed, the memory operation unit is not influenced by the memory bandwidth, and the memory operation unit can be further improved.

Drawings

Fig. 1 is a schematic diagram of the arrangement of resource modules inside an FPGA and a schematic diagram of 8 sub-areas obtained by dividing in an example of the present application.

FIG. 2 is a circuit block diagram of a memory cell implemented by a resource module within each sub-region in the example of FIG. 1.

Fig. 3 is a schematic diagram of a first memory unit and a second memory unit forming any one of the data transmission paths connected by a data transmission interface.

FIG. 4 is a schematic diagram of data transmission paths formed between different memory cells in the example of FIG. 2.

FIG. 5 is a circuit block diagram of a memory cell implemented by a resource module within each sub-region in the example of FIG. 1.

FIG. 6 is a circuit block diagram of a memory cell implemented by a resource module within each sub-region in the example of FIG. 1.

FIG. 7 is a circuit block diagram of a memory cell implemented by a resource module within each sub-region in the example of FIG. 1.

Detailed Description

The following describes the embodiments of the present application further with reference to the drawings.

The application discloses a high data transmission efficiency storage architecture FPGA, which is the same as a conventional FPGA and comprises a plurality of resource modules arranged according to a preset structure and interconnection resources arranged around the resource modules, wherein the preset structure can be arranged according to actual conditions, for example, the preset structure is a Column-Based structure which is currently and mainly used as a current main stream, the resource modules are arranged according to a row-Column structure to form a two-dimensional array architecture, each Column is provided with the same type of resource modules, the periphery of each resource module is provided with the interconnection resources, and the types of the resource modules contained in the FPGA comprise configurable logic modules CLB, configurable memory modules BRAM and configurable multiply-add modules DSP. The configurable logic module CLB contains units such as a lookup table, a register and the like, and can be configured to realize various logic functions. The configurable multiply-add module DSP contains an arithmetic unit which can execute multiple operations such as multiplication, addition, multi-bit logic gates and the like. The configurable memory module BRAM contains block memory cells, and can realize single-double-port memory and FIFO memory with various widths and depths. Please refer to the schematic diagram of the resource module in the FPGA shown in fig. 1, and the structure of this portion is the structure in the existing FPGA, which is not described in detail in the present application.

All the resource modules arranged according to a preset structure in the FPGA are divided into a plurality of sub-areas, and each sub-area contains a certain amount of resource modules. The resource modules in the same subarea are connected through interconnection resources in the FPGA to realize one memory unit, and the resource modules in a plurality of different subareas correspondingly realize a plurality of memory units.

Each memory unit includes a respective processor and a local storage unit, the processor in each memory unit being coupled to the local storage unit in that memory unit via an interconnection resource. The local memory unit in each memory unit is used for storing instructions and data of the memory unit, and the processor in each memory unit is used for operating according to the instructions and data stored by the connected local memory unit.

In one embodiment, two memory units implemented by a resource module within any two sub-regions use the same or different instruction sets, and two memory units using the same instruction set have the same or different bit widths. In the present application, the memory units formed within the FPGA may form a variety of typical general purpose 8-bit processors including RISC, 8051 and 6802.

In one embodiment, the number and types of the resource modules contained in any two sub-areas obtained by dividing are the same, or in another embodiment, the number and types of the resource modules contained in any two sub-areas obtained by dividing are at least one different, that is, the sub-areas obtained by dividing may be identical or not necessarily identical. In addition, each resource module in the FPGA belongs to a sub-area, and all the resource modules in the FPGA are divided into corresponding sub-areas, or the resource modules in the FPGA are divided into corresponding sub-areas, namely the FPGA can be globally divided into the sub-areas or locally divided into the sub-areas. For example, in the example shown in fig. 1, the resource modules in the FPGA are divided into 8 sub-areas, each sub-area contains the same number and type of resource modules, and the area outlined by each dashed box represents one sub-area obtained by division.

The present application does not limit the number and types of resource modules contained in each sub-area, but the number and types of resource modules contained in each sub-area are sufficient to implement a processor in one memory unit and a local memory unit, the processor in one memory unit is implemented mainly based on DSP, and the memory unit in one memory unit is implemented mainly based on BRAM, in addition to which a certain number of CLBs are required to be combined.

In addition to the resource modules and interconnection resources, compared with the conventional FPGA, the FPGA provided by the application has the advantage that an additional global read-write channel realized by hardware resources is added. And the local storage unit in each storage unit performs data transmission with the outside of the FPGA chip through the global read-write channel. The resource modules within each sub-area as shown in fig. 1 implement one memory unit, and a schematic diagram of the memory unit implemented in fig. 1 is shown in fig. 2.

The FPGA of the framework of the application internally forms a plurality of memory calculation units, and each memory calculation unit supports independent memory calculation in the form of a soft core, so that the FPGA can jointly realize multi-core memory calculation by the plurality of memory calculation units formed in the interior. Because the processor and the local storage unit of each storage unit are in the same subarea, the processor only needs to read and write instructions and data in the adjacent local storage units when executing, so that the wiring resources and interconnection resources consumed by data transmission between the processor and the local storage units are reduced, the data transmission delay and the energy consumption are reduced, and the data processing efficiency of the FPGA is improved. In addition, in a multi-core scene, if a plurality of operation cores read and write the same main memory at the same time, the read-write bandwidth of the main memory forms a bottleneck, so that the data processing efficiency is restricted, the problem is particularly obvious in high-performance calculation, for an FPGA supporting artificial intelligence or machine learning application, the interior of the FPGA generally has a large number of simpler operation cores, the number of the operation cores can exceed 100, the requirement on the read-write bandwidth of the main memory is larger, and the problem is more remarkable. The FPGA is also in a multi-core scene, and the processors in the memory computing units respectively read and write the local storage units instead of all the processors uniformly reading and writing the same memory, so that the processors of the memory computing units cannot collide with each other, each memory computing unit can execute operation at full speed, the influence of the storage bandwidth is avoided, and the overall data processing efficiency and the operation speed are improved. The FPGA can be suitable for a scene containing a large number of memory units, and even if the number of the memory units exceeds 100, the FPGA can have higher operation rate, so that the scene of artificial intelligence and machine learning can be better supported.

Based on this structure, the one-time memory operation realized by the FPGA of the present application through the plurality of memory units includes: firstly, parameters of each memory unit are written into a local storage unit of a corresponding memory unit from the outside of the FPGA chip through a global read-write channel, and the parameters written into the local storage unit of each memory unit comprise instructions and/or data of the memory unit, namely, only instructions or data can be written into the memory unit, and the instructions and the data can be written simultaneously. And starting the storage and calculation units to perform parallel operation simultaneously, and enabling a processor in each storage and calculation unit to perform operation according to the instruction and the data in the local storage unit of the storage and calculation unit and store an operation result in the local storage unit of the storage and calculation unit. And then the operation results in the local storage units of the storage units can be read out of the FPGA chip through the global read-write channel. In addition, according to actual needs, a local storage unit in one storage unit can be connected to other BRAM modules in the FPGA through interconnection resources of the FPGA to realize data exchange.

On the basis, according to actual needs, the FPGA can sequentially realize multiple calculation through the internally realized calculation units until all calculation is completed. When the FPGA realizes any one-time calculation, corresponding parameters are written from the outside of the FPGA to the local storage units of all the calculation units, or corresponding parameters are written to the local storage units of part of calculation units, and the same parameters comprise instructions and/or data of the calculation units. All or part of the memory units are started to operate in parallel. And then the operation results in the local storage units of all the storage units or the operation results in the local storage units of part of the storage units are read out of the FPGA chip. That is, when multiple calculation operations are implemented, each calculation operation may use all calculation units formed in the FPGA, or only a part of calculation units may be used and other calculation units may be skipped, and when a part of calculation units is used, the same or different calculation units may be used multiple times. In addition, each calculation operation can adjust the parameters written into each calculation unit. Each calculation may read the calculation result for all or part of the calculation units. That is, the flow of the executed multiple calculation operations is basically similar, but each calculation operation can flexibly adjust the data read-write of each calculation unit, and the same data read-write operation is not required to be executed on the local storage units in all the calculation units each time.

In the process of completing one memory operation by parallel synchronous execution, a plurality of memory units formed in the FPGA always have a data transmission relationship among different memory units, namely, intermediate data obtained by operation of a certain memory unit is transmitted to other memory units to participate in operation, in order to reduce design difficulty and improve data transmission efficiency among different memory units, the memory computing units which are formed in the memory computing architecture FPGA and have data transmission relation with other memory computing units are also provided with data transmission interfaces through the resource modules corresponding to the sub-areas, the data transmission interfaces are specially used for data transmission among the memory computing units, one data transmission interface can be realized as a data output interface so that data is output from the memory computing units, or one data transmission interface can be realized as a data input interface so that data is input to the memory computing units. Each memory unit comprises one or more data transmission interfaces, which are of the same or different type when one memory unit comprises a plurality of data transmission interfaces, so that each memory unit comprises a data transmission interface DOUT formed as a data output interface and/or a data transmission interface formed as a data input interface DIN.

The structure of each data transmission interface in each memory calculation unit is the same, namely, the memory calculation units in the memory calculation architecture FPGA of the application all use the same data transmission interface, so that the universality is high, and the design can be simplified. Each data transfer interface comprises a memory and has a write data port WD, a write clock port WCK, a read data port RD, a read clock port RCK.

For any two memory units with data transmission relationship, the data output interface DOUT of the first memory unit as the data sender is connected with the data input interface DIN of the second memory unit as the data receiver to form a data transmission path, as shown in fig. 3. Based on the formed data transmission path, the first memory unit writes data into the memory in the data output interface DOUT of the first memory unit through the write data port WD under the control of the write clock signal wclk of the write clock port WCK, and writes data into the memory in the data input interface DIN of the second memory unit in a first-in-first-out manner, and the second memory unit reads data from the memory in the data input interface DIN of the second memory unit through the read data port RD under the control of the read clock signal rclk of the read clock port RD, thereby realizing data transmission from the first memory unit to the second memory unit.

For any one of the memory units, the write clock signal wclk of the data output interface DOUT in that memory unit is synchronized with the clock signal of that memory unit, and the read clock signal rclk of the data input interface DIN in that memory unit is synchronized with the clock signal of that memory unit. Taking the data transmission path between the first memory unit and the second memory unit as shown in fig. 3 as an example, the write clock signal wclk of the data output interface DOUT of the first memory unit is synchronized with the clock signal of the first memory unit, and the read clock signal rclk of the data input interface DIN of the second memory unit is synchronized with the clock signal of the second memory unit.

However, the write clock signal wclk and the read clock signal rclk of each data transfer interface are at the same or different frequency, and the write clock signal wclk and the read clock signal rclk are synchronous or asynchronous, i.e., the data transfer interface of the present application implements asynchronous data transfer. Based on the characteristics, the data writing frequency of the first storage unit and the data reading frequency of the second storage unit at two ends of a transmission path which are formed at will are the same or different, the data reading and writing of the first storage unit and the second storage unit are synchronous or asynchronous, the design difficulty is reduced, and the application requirements of different quantity of transmission needed between different storage units can be met.

In addition, each data transmission interface further includes a write enable port WEN and a read enable port REN, any first memory unit may enable a data output interface through the write enable port WEN, and any second memory unit may enable a data input interface through the read enable port REN.

The data transmission method realized by the first storage unit and the second storage unit by using the data transmission path formed based on the data transmission interface comprises the following realization modes:

(1) Each data transmission interface further comprises a write status port WS and a read status port RS, wherein the write status port is an active level write status signal when the memory of the data transmission interface is not full, and is an inactive level write status signal when the memory of the data transmission interface is full. When the memory of the data transmission interface is not empty, the read status port is a read status signal with an active level, and when the memory of the data transmission interface is empty, the read status port is a read status signal with an inactive level.

Any first memory unit serving as a data sender reads a write state signal of the data output interface DOUT through a write state port of the data output interface DOUT, determines that a memory in the data output interface is not full when the write state signal of an effective level is read, writes data into the memory in the data output interface, and determines that the memory in the data output interface is full and continues waiting when the write state signal of an ineffective level is read. Any second memory unit as a data receiving side reads the read status signal of the data input interface through the read status port of the data input interface DIN, when the read status signal of the active level is read, the memory in the data input interface is determined to be non-empty, the data is read from the memory in the data input interface, when the read status signal of the inactive level is read, the memory in the data input interface is determined to be empty and the waiting is continued.

For any one of the data transmission paths formed, the first memory unit as the data transmitting side writes data into the data output interface DOUT of the first memory unit when reading the writing state signal of the effective level, and the second memory unit as the data receiving side reads data from the data input interface DIN of the second memory unit as the data receiving side when reading the reading state signal of the effective level.

Any mechanism for determining the read status signal of the data input interface DIN by the second memory unit as the data receiver mainly comprises two types: (a) In the first method, the second memory unit actively reads the read status signal of the data input interface DIN at predetermined time intervals, and reads data from the memory of the data input interface when the read status signal of the active level is read. (b) In the second method, when the reading state signal of the data input interface changes from the invalid level to the valid level, an interrupt signal is generated, and the second storage unit serving as a data receiving party reads out data from the memory of the data input interface under the triggering of the interrupt signal.

(2) In the data processing application of the memory computing operation, the processing method of each memory computing unit is known in advance, the execution sequence of the instructions adopted by each memory computing unit to sequentially execute a plurality of instructions is also known in advance, and each memory computing unit needs to send data to other memory computing units or receive data sent by other memory computing units when executing the instructions, so that a read-write mechanism can be determined by the information. In the process that each storage and calculation unit sequentially executes a plurality of instructions according to respective instruction execution sequences, for a first storage and calculation unit and a second storage and calculation unit at two ends of any one data transmission path, according to processing logic of data processing application, when the first storage and calculation unit is known in advance that target data generated after the first target instruction is executed need to be sent to the second storage and calculation unit, the second storage and calculation unit needs to execute a second target instruction according to the acquired target data:

The second memory computing unit can naturally determine that the first memory computing unit has sent the target data to the data input interface of the connected second memory computing unit after the first memory computing unit completes the execution of the first target instruction according to the instruction execution sequence of the first memory computing unit in the process of executing the second memory computing unit according to the instruction execution sequence of the second memory computing unit, and then the second memory computing unit reads the target data from the memory of the data input interface of the connected first memory computing unit and continues to execute the second target instruction and the later instructions. Otherwise, waiting after the instruction before the second target instruction is executed until the first storage unit finishes executing the first target instruction, and continuing to execute the second target instruction and the subsequent instructions after the second storage unit reads the target data.

For example, according to the processing logic of the data processing application, it may be known in advance that the instruction execution order of the first memory unit is instructions I1, I2, I3, I4, I5, I6, and the instruction execution order of the second memory unit is instructions J1, J2, J3, J4, J5, J6, J7, J8, and J9, where instruction I4 is a first target instruction and instruction J2 is a second target instruction. After the first memory unit sequentially executes the instructions I1, I2, I3 and I4, the target data generated after the instruction I4 is executed is written into the data input interface of the connected second memory unit through the data output interface, and then the second memory unit waits until the first memory unit is determined to execute the instruction I4 after the instruction J1 is executed, reads the target data from the data input interface, and continues to execute the instructions J2, J3 and J4.

Regardless of which method is used to implement data transmission between two memory units, it can be seen from the above examples that, especially when the first target instruction is located at a later position in the instruction execution sequence of the first memory unit, the second memory unit may idle to wait for the target data, for example, the second memory unit may need to idle for a longer time after the execution of the instruction J1, and the first memory unit sequentially executes four instructions, so as to continue to execute the next instruction J2, where the delay between the time when the first memory unit executes the first target instruction and the time when the second memory unit executes the second target instruction is greatly different, which affects the data transmission efficiency.

In another embodiment, therefore, when the second storage computing unit determines that the first storage computing unit has not executed to complete the first target instruction when executing the second target instruction, and when there is intermediate data generated by at least one instruction that is independent of the target data sent by the first storage computing unit and the second target instruction in the instructions executed after the second target instruction, the second storage computing unit adjusts the instruction execution order, skips the second target instruction first, and executes the second target instruction, and then the instructions that are independent of the target data and the intermediate data generated by the second target instruction until it is determined that the first storage computing unit has executed to complete the first target instruction, the target data is read from a memory connected to a data input interface of the first storage computing unit and continues to execute the second target instruction and the instructions remaining after the second target instruction.

For example, in the above example, assuming that the instructions J3, J4, J5, J6, and J7 executed by the second computing unit do not depend on the target data generated by the first computing unit executing the instruction I4 and the intermediate data generated by the second computing unit executing the instruction J2, it means that the instructions J3, J4, J5, J6, and J7 and the instruction J2 have no specific execution requirement, so after the instruction J1 is executed, the first computing unit may skip the instruction J2 and directly execute the instructions J3, J4, J5, J6, and J7 in sequence, because the first computing unit has not executed the instruction I4. Assuming that the second memory unit has executed the instruction J5, it is determined that the first memory unit has executed the instruction I4, the second memory unit reads the target data from the data input interface, executes the instruction J2 according to the target data, and then continues to execute the remaining instructions J6, J7, and J8 and J9 in order. Therefore, the actual instruction execution sequence of the second memory computing unit is J1, J3, J4, J5, J2, J6, J7, J8 and J9, so that the waiting time of the second memory computing unit can be reduced on the basis that the data transmission process and the operation result between the two memory computing units are not influenced, and the data transmission efficiency and the operation efficiency are improved.

In practical applications, there is often a requirement that data transmission exists between multiple groups of memory units in the memory architecture FPGA, each memory unit may be used as a data transmitting direction to transmit data to one or more other memory units, or may be used as a data receiving party to receive data transmitted by one or more other memory units, for example, in 8 memory units illustrated in fig. 2, a data transmission relationship between different memory units is shown in fig. 4, a memory unit 1 transmits data to a memory unit 2 and a memory unit 5, and the memory unit 1 also receives data transmitted by the memory unit 2 and the memory unit 5, and a data transmission relationship between other memory units is shown in fig. 4. Based on the data transmission relation among different memory computing units, the data transmission relation among a plurality of memory computing units in the FPGA sequentially forms a global data stream, and the global data stream sequentially passes through the memory computing units to represent the transmission direction and sequence of a group of data among the memory computing units. For example, in fig. 4, the data generated by the storage computing unit 1 is sent to the storage computing unit 2, and the result obtained by the storage computing unit 2 according to the data operation of the storage computing unit 1 is transmitted to the storage computing unit 3, so as to form a global data stream of the storage computing unit 1-the storage computing unit 2-the storage computing unit 3. For any one of the memory units, after the memory unit adjusts the actual instruction execution sequence according to the method provided by the above embodiment, if the data sender of the memory unit adjusts the instruction execution sequence again, the data processing sequence will be wrong. Therefore, the application carries out global adjustment on all the memory computing units, each memory computing unit through which the global data flow sequentially passes sequentially adjusts the respective instruction execution sequence, and after the corresponding first memory computing unit as the data sender adjusts the fixed instruction execution sequence, the second memory computing unit as the data receiver adjusts the own instruction execution sequence according to the instruction execution sequence of the corresponding first memory computing unit as the data sender. In practical application, the data transmission relation among all the memory units is abstracted into a directed graph, each memory unit is used as one node in the directed graph, the directed edges between two nodes represent the data transmission paths formed between the two nodes, and the nodes in the directed graph are ordered according to the topological order, so that the continuous order of executing the sequential adjustment instructions on each memory unit can be determined. For example, in the global data streams of the memory computing units 1,2 and 3, after the memory computing unit 2 adjusts the fixed instruction execution sequence, the memory computing unit 3 adjusts its own instruction execution sequence according to the instruction execution sequence adjusted by the memory computing unit 2, so that the data sequence accuracy is ensured.

Based on the above design, there are multiple implementation manners for each data transmission interface in each memory unit, and one typical implementation method is that any one data transmission interface is implemented by a dual-port FIFO memory, and the storage requirement of the dual-port FIFO memory is not very large, so that the implementation is not needed to occupy a BRAM module, and the implementation can be achieved by using a distributed RAM of SLICEM resources in a CLB module, so that the data transmission interface does not occupy excessive FPGA resources. The read-write depth of the dual-port FIFO memory is related to the write clock signal and the read clock signal, as well as the data transfer rate and the reaction speed of the memory cell, and it is common practice to set the read-write depth of the dual-port FIFO memory to 1. The data width of the dual-port FIFO memory is generally the same as the soft core data width implemented by the corresponding memory unit, and it is common practice to set the data width of the dual-port FIFO memory to 8.

In addition, any one of the data transmission interfaces can be realized by an asynchronous register realized by a handshake signal plus a register, and compared with a dual-port FIFO memory, the structure has fewer resource modules required for realizing the structure, and the occupied FPGA resource is further reduced.

When the data output interface of any first memory unit is the same as the data input interface of the corresponding second memory unit, the data output interface of the first memory unit and the data input interface of the second memory unit for realizing data transmission are realized through interconnection resources of the FPGA, in another embodiment, the data output interface of the first memory unit and the data input interface of the second memory unit for realizing data transmission are both connected to a NOC network, and the data output interface of the first memory unit sends data to the data input interface of the second memory unit through the NOC network, so that the data transmission efficiency is improved.

In addition, the memory architecture FPGA may also be a multi-die structure, and if the memory architecture FPGA is a multi-die structure and includes an FPGA die and a memory die, each of the FPGA die and the memory die may include one or more first memory cells and second memory cells formed on the same FPGA die or on different FPGA dies. The data output interface of the first memory computing unit is connected with the memory die, the data input interface of the second memory computing unit is connected with the memory die, and the data output interface through which the first memory computing unit passes transmits data to the data input interface of the second memory computing unit through the memory die.

In one embodiment, when the predetermined structure adopted by the arrangement of the resource modules in the FPGA is a Column-Based structure, the BRAM modules occupy one or more columns, and the local storage units in each storage unit are implemented Based on the BRAM modules, so that the BRAM modules for implementing the local storage units in each storage unit are located in one or more columns, and are used for implementing the data transmission between the local storage units in each storage unit and a plurality of BRAM modules located in the same Column and the FPGA chip through the same global read-write channel, and the FPGA chip can exchange data with any local storage unit connected with the global read-write channel through each global read-write channel. Based on the above, the application provides two methods for realizing the data transmission between the local storage unit in each storage unit and the outside of the FPGA chip:

In one embodiment, the local memory unit in the computing unit is implemented, and a plurality of BRAM modules located in the same column are connected with an on-chip read-write interface of the FPGA and the exterior of the FPGA through the same global read-write channel for data transmission, and different global read-write channels are respectively connected with different on-chip read-write interfaces of the FPGA. In the example shown in fig. 2, 8 memory units are formed inside the FPGA, and local memory units in the memory units 1 to 4 are implemented by BRAM modules located in the same column, and the local memory units of the 4 memory units are connected with an on-chip read-write interface 1 of the FPGA through a global read-write channel 1 to implement data transmission with the outside of the FPGA. The local storage units in the storage units 5-8 are realized through BRAM modules positioned in the same column, and the local storage units of the 4 storage units are connected with an on-chip read-write interface 2 of the FPGA through a global read-write channel 2 to realize data transmission with the outside of the FPGA.

In the method provided by the embodiment, a plurality of on-chip read-write interfaces of the FPGA are often occupied, and resources of the on-chip read-write interfaces of the FPGA are precious, so in another embodiment, only one on-chip read-write interface of the FPGA is occupied, and the method is used for realizing a local storage unit in a soft core processor, a plurality of BRAM modules located in the same column are connected with one active end of a multiplexer MUX through the same global read-write channel, a fixed end of the multiplexer MUX is connected with the on-chip read-write interface of the FPGA, each global read-write channel can be gated through each active end of a gating multiplexer MUX, and data exchange can be performed on any local storage unit connected with the global read-write channel outside the FPGA through the gated global read-write channel. For example, in the structure diagram shown in fig. 5, corresponding to fig. 2, a local storage unit in the storage units 1 to 4 is connected to one active end of the multiplexer MUX through the global read-write channel 1, a local storage unit in the storage units 5 to 8 is connected to the other active end of the multiplexer MUX through the global read-write channel 2, and a fixed end of the multiplexer MUX is connected to an on-chip read-write interface of the FPGA to realize data transmission with the off-chip, so compared with fig. 5, the method of the embodiment only needs to occupy one on-chip read-write interface of the FPGA.

According to the traditional von neumann architecture, instructions and data in each storage unit are stored in a mixed mode, but in order to improve the data processing efficiency, in one embodiment, the instructions and the data of each storage unit are stored separately and independently according to a Harvard type structure, the local storage unit in one storage unit comprises an instruction local storage unit and a data local storage unit, the instruction local storage unit is used for storing instructions of the storage unit, and the data local storage unit is used for storing data of the storage unit, so that the instructions and the data do not interfere with each other when the processor operates, and the data processing efficiency is improved. The instruction local memory unit is implemented by at least one BRAM module and the data local memory unit is implemented by at least one BRAM module, so that based on this, a common sub-area contains at least one DSP, two BRAMs and 300 CLBs.

When the local storage unit in each storage unit comprises an instruction local storage unit and a data local storage unit, the storage sizes of the instruction local storage units of any two storage units are the same or different, and the storage sizes of the data local storage units of any two storage units are the same or different.

When the local storage unit in each memory unit includes an instruction local storage unit and a data local storage unit: (1) In one embodiment, the instruction local storage unit and the data local storage unit in one storage unit share the global read-write channel for data transmission, so that the data transmission between the instruction local storage unit and the data local storage unit in one storage unit and a plurality of BRAM modules located in the same column and the outside of the FPGA chip is realized through the same global read-write channel. In this embodiment, the global read/write channel may be connected to the on-chip read/write interface of the FPGA according to the method of the embodiment shown in fig. 2, or to an active side of the multiplexer MUX according to the method of the embodiment shown in fig. 5. Taking the structure shown in fig. 2 as an example, as shown in fig. 6, the instruction local storage units and the data local storage units in the memory units 1 to 4 are connected to the global read-write channel 1, and the instruction local storage units and the data local storage units in the memory units 5 to 8 are connected to the global read-write channel 2. (2) Or in another embodiment, the instruction local storage unit and the data local storage unit use respective global read-write channels respectively, the instruction local storage unit in one storage unit transmits instructions via the instruction global read-write channel, and the data local storage unit transmits data via the data global read-write channel. The data transmission is carried out between the plurality of BRAM modules located in the same column and the FPGA chip through the same instruction global read-write channel, and the data transmission is carried out between the plurality of BRAM modules located in the same column and the FPGA chip through the same data global read-write channel. Also in this embodiment, each instruction global read/write channel and data global read/write channel may be connected to an on-chip read/write interface of the FPGA according to the method of the embodiment shown in fig. 2, or to an active side of the multiplexer MUX according to the method of the embodiment shown in fig. 5. Taking the structure shown in fig. 5 as an example, as shown in fig. 7, the instruction local storage units in the storage units 1-4 are all connected to the instruction global read-write channel 1, the data local storage units in the storage units 1-4 are all connected to the data global read-write channel 1, the instruction local storage units in the storage units 5-8 are all connected to the instruction global read-write channel 2, the data local storage units in the storage units 5-8 are all connected to the data global read-write channel 2, and the instruction global read-write channel 1, the data global read-write channel 1, the instruction global read-write channel 2 and the data global read-write channel 2 are respectively connected to one active end of the multiplexer MUX.

In another embodiment, the resource modules in each sub-region are further configured to implement a control signal and a status signal of a corresponding one of the memory units, the control signal being configured to control an operating state of the memory unit, and the status signal being configured to indicate the operating state of the memory unit. The control signals and the state signals of the memory unit are connected with other circuit structures in the FPGA, or the control signals and the state signals of the memory unit are connected to the outside of the FPGA. I.e. the control signals and status signals required by each memory unit when performing a memory operation are implemented by the resource modules within the respective sub-area.

The control signal of each memory unit comprises at least one of an enabling signal, a clock signal, a reset signal and an interrupt signal, wherein the enabling signal is used for controlling the corresponding memory unit to execute an instruction or stop executing the instruction, the reset signal is used for resetting the memory unit, and the interrupt signal provides external interrupt capability. The control signals of any two memory units are the same or different.

The above is only a preferred embodiment of the present application, and the present application is not limited to the above examples. It is to be understood that other modifications and variations which may be directly derived or contemplated by those skilled in the art without departing from the spirit and concepts of the present application are deemed to be included within the scope of the present application.

Claims

1. The storage and calculation architecture FPGA with high data transmission efficiency is characterized by comprising a plurality of resource modules arranged according to a preset structure and interconnection resources arranged around the resource modules, wherein the resource modules positioned in the same subarea are connected through the interconnection resources in the FPGA to form a storage and calculation unit, each storage and calculation unit comprises a processor and a local storage unit which are connected through the interconnection resources, the local storage unit in each storage and calculation unit is used for storing instructions and data of the storage and calculation unit, and the processor is used for carrying out operation according to the instructions and the data stored in the connected local storage unit;

the storage and calculation unit which has a data transmission relation with other storage and calculation units is also provided with a data transmission interface through a resource module corresponding to the subarea, and the storage and calculation unit comprises a data transmission interface formed as a data output interface and/or a data transmission interface formed as a data input interface; the structure of each data transmission interface of each memory calculation unit is the same, and each data transmission interface comprises a memory and is provided with a write data port, a write clock port, a read data port and a read clock port;

For any two memory units with a data transmission relation, a data output interface of a first memory unit serving as a data sender is connected with a data input interface of a second memory unit serving as a data receiver, the first memory unit writes data into a memory in the data output interface through a write data port under the control of a write clock signal of the write clock port and writes the data into the memory in the data input interface of the second memory unit in a first-in first-out mode, and the second memory unit reads the data from the memory of the data input interface through a read data port under the control of a read clock signal of the read clock port in a first-in first-out mode so as to realize data transmission from the first memory unit to the second memory unit;

Each storage and calculation unit sequentially executes a plurality of instructions according to respective instruction execution sequences, the first storage and calculation unit sends target data generated after executing a first target instruction to the second storage and calculation unit, and the second storage and calculation unit executes a second target instruction according to the acquired target data; the second storage unit determines that the first storage unit finishes executing the first target instruction according to the instruction execution sequence of the first storage unit in the process of executing the second storage unit according to the instruction execution sequence of the second storage unit, reads the target data from a memory connected with a data input interface of the first storage unit, and continues executing the second target instruction and the later instructions; when the second storage computing unit executes the second target instruction, determining that the first storage computing unit does not execute the first target instruction, and when at least one instruction which is independent of the target data and the intermediate data generated by the second target instruction exists in the instructions executed after the second target instruction, the second storage computing unit adjusts the instruction execution sequence, skips the second target instruction first and executes the second target instruction, and does not depend on the target data and the intermediate data generated by the second target instruction until determining that the first storage computing unit reads the target data from a memory connected with a data input interface of the first storage computing unit and continues executing the instructions which are remained after the second target instruction and the second target instruction after the first storage computing unit executes the first target instruction.

2. The memory architecture FPGA of claim 1 wherein, for any one memory cell, a write clock signal of a data output interface in the memory cell is synchronized with a clock signal of the memory cell and a read clock signal of a data input interface in the memory cell is synchronized with the clock signal of the memory cell.

3. The memory architecture FPGA of claim 1 wherein the write clock signal and the read clock signal of each data transfer interface are the same or different in frequency, the write clock signal being synchronous or asynchronous with the read clock signal.

4. The storage architecture FPGA of claim 1, wherein each of the data transfer interfaces further comprises a write status port and a read status port;

the method comprises the steps that a first memory computing unit serving as a data sender reads a write state signal of a data output interface through a write state port of the data output interface, when the write state signal of an effective level is read, a memory in the data output interface is determined to be not full, data is written into the memory in the data output interface, and when the write state signal of an ineffective level is read, the memory in the data output interface is determined to be full and continues waiting;

The second memory computing unit as a data receiving side reads a read state signal of the data input interface through a read state port of the data input interface, when the read state signal of an effective level is read, the memory in the data input interface is determined to be non-empty, data is read from the memory in the data input interface, and when the read state signal of an ineffective level is read, the memory in the data input interface is determined to be empty and the waiting is continued.

5. The FPGA of claim 1, wherein the plurality of memory units in the FPGA of the memory architecture have a data transmission relationship to form a global data stream, and each memory unit through which the global data stream sequentially passes sequentially adjusts a respective instruction execution order, and after the second memory unit as the data receiver adjusts a fixed instruction execution order according to the instruction execution order of the corresponding first memory unit as the data sender, the second memory unit as the data receiver adjusts its own instruction execution order according to the instruction execution order of the corresponding first memory unit as the data sender.

6. The storage architecture FPGA of claim 4,

A second memory unit serving as a data receiving party reads a read state signal of a data input interface at intervals of a preset time, and reads data from a memory of the data input interface when the read state signal of an effective level is read;

or the reading state signal of the data input interface generates an interrupt signal when changing from an invalid level to an active level, and the second storage unit serving as a data receiving party reads out data from the memory of the data input interface under the triggering of the interrupt signal.

7. The memory architecture FPGA of claim 1 wherein any one of the data transfer interfaces is implemented by a dual port FIFO memory or an asynchronous register.

8. The memory architecture FPGA of claim 1 wherein the data output interface of the first memory unit and the data input interface of the second memory unit that implement data transfer both access the NOC network, the data output interface through which the first memory unit passes transmitting data to the data input interface of the second memory unit via the NOC network.

9. The memory architecture FPGA of claim 1, wherein the memory architecture FPGA is of a multi-die structure and comprises an FPGA die and a memory die, the first memory unit and the second memory unit are formed on the same FPGA die or on different FPGA dies, a data output interface of the first memory unit is connected to the memory die, a data input interface of the second memory unit is connected to the memory die, and a data output interface through which the first memory unit sends data to a data input interface of the second memory unit via the memory die.