CN112734020B - Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network - Google Patents
Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network Download PDFInfo
- Publication number
- CN112734020B CN112734020B CN202011587375.4A CN202011587375A CN112734020B CN 112734020 B CN112734020 B CN 112734020B CN 202011587375 A CN202011587375 A CN 202011587375A CN 112734020 B CN112734020 B CN 112734020B
- Authority
- CN
- China
- Prior art keywords
- convolution
- input
- data
- addition
- multiplication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Neurology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a convolution multiplication accumulation hardware accelerating device, a convolution multiplication accumulation hardware accelerating system and a convolution multiplication accumulation hardware accelerating method for a convolution neural network, relates to the technical field of artificial convolution neural networks, can solve the problem that the linear speed calculation throughput cannot be achieved by floating point convolution multiplication accumulation in fixed-point DSP hard core design based on an FPGA, and improves the traditional linear speed throughput of floating point convolution multiplication accumulation by at least three times. The convolution multiplication calculation unit is mainly used for finishing multiplication operation of input feature map floating point data and input parameter floating point data and outputting convolution multiplication matrix calculation result floating point data. And the convolution addition tree calculation unit is mainly used for finishing the data accumulation function related to the same output characteristic graph. And the convolution forward addition chain calculation unit completes the data accumulation function associated with the same output characteristic graph for multiple times. Based on the scheme, the invention can improve the computation throughput bottleneck brought by the existing volume multiplication accumulation closed-loop operation mode, and has the effect of achieving the linear speed computation throughput capacity in floating point volume multiplication accumulation computation.
Description
Technical Field
The invention relates to the technical field of artificial convolutional neural networks, in particular to a convolution multiplication accumulation hardware accelerating device, system and method of a convolutional neural network.
Background
In hardware-accelerated computation of convolutional neural networks based on FPGAs, convolution is a conventional computational operation. The method is limited by the acceleration performance and the constraint of resources on an FPGA chip, most of convolution calculation based on the FPGA is designed based on the convolution calculation of fixed-point operands, fixed-point convolution multiplication and accumulation operation is carried out by adopting an FPGA fixed-point DSP hard core, and the throughput capacity of linear speed calculation can be achieved in the fixed-point convolution multiplication and accumulation calculation. The conventional convolution multiply-accumulate calculation based on the FPGA fixed-point DSP hard core has a closed-loop multiply-accumulate operation, the operation has low delay in a fixed-point precision operation mode, and the delay in the fixed-point precision operation mode is one clock cycle and can realize linear speed throughput; under the floating point precision operation mode, the delay is a plurality of clock cycles, and due to the closed loop multiply-accumulate calculation mode, the linear speed calculation throughput requirement cannot be met, and the calculation throughput capacity of the FPGA for accelerating the floating point convolution neural network is greatly lost.
Therefore, in the floating point convolution neural network accelerated calculation mode requiring higher precision, a floating point convolution product accumulation calculation method capable of linear speed calculation throughput needs to be redesigned.
Disclosure of Invention
In view of this, the present invention provides a convolution multiply-accumulate hardware acceleration apparatus, system and method for a convolution neural network, which can solve the problem that the linear speed computation throughput cannot be achieved by floating point convolution multiply-accumulate design based on the fixed-point DSP hard core of the FPGA, and improve the linear speed throughput of the conventional floating point convolution multiply-accumulate by at least three times.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
a convolution multiplication and accumulation hardware accelerator for a convolution neural network comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.
The convolution multiplication unit comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit.
The convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent into one floating point addition tree.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.
The convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result.
Further, each addition chain accumulates input data of more than one clock cycle, and the length of the addition chain is determined according to the number of clock cycles required to be accumulated, specifically:
each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated.
For the current addition chain, the corresponding input data is one of the addition tree results.
The PE-th addition tree result, PE takes the value of 1-PE.
The first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
The invention also provides a convolution multiply-accumulate hardware acceleration system of the convolution neural network, which comprises a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected by a PCIe bus;
the FPGA convolution calculation system consists of a memory and on-chip convolution reasoning calculation logic;
the on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit.
The input characteristic diagram data buffer scheduling unit is used for storing the input characteristic diagram data and sending the input characteristic diagram data into the convolution multiply-accumulate calculating unit.
The input parameter data buffer scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit.
The convolution multiply accumulate calculating unit adopts the convolution multiply accumulate hardware accelerator structure of the convolution neural network, and performs convolution multiply accumulate operation on input parameter data by using input characteristic diagram data, and sends output characteristic diagram data serving as a convolution multiply accumulate result to the output data cache scheduling unit.
And the output data cache scheduling unit sends the convolution multiplication accumulation result to the memory for storage.
The host reads the convolution multiply-accumulate result from the memory through the PCIe bus.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, including the following steps:
respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; and performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting the PE multiplied by SIMD multiplication results as input data of a convolution addition tree unit.
The delay of the floating-point multiplier is more than one clock cycle.
And step two, taking the SIMD multiplication results of the PE multiplied by SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting the PE addition tree results as the input of a convolution forward addition chain unit.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle.
Thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and the PE floating point data output by all the addition chains form output characteristic diagram data, and the output characteristic diagram data is a convolution multiplication accumulation result.
Further, each addition chain accumulates input data of more than one clock cycle, and the length of the addition chain is determined according to the number of clock cycles required to be accumulated, specifically:
each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated.
For the current addition chain, the corresponding input data is one of the addition tree results.
The value of PE is 1-PE in the result of the PE-th addition tree;
the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
Has the advantages that:
aiming at the deep convolutional neural network reasoning calculation, the convolution calculation of each layer can be expressed as input feature map floating point data and input parameter floating point data, the output feature map floating point data is output through convolution specific calculation, and the multilayer deep convolutional neural network result is obtained by performing iterative calculation layer by layer according to the mode and finally outputting the convolutional neural network calculation result. The convolution multiplication calculation unit is mainly used for finishing multiplication operation of input feature map floating point data and input parameter floating point data and outputting convolution multiplication matrix calculation result floating point data. The floating-point multiplication calculation of the unit is an open-loop output, so that the problem that the component cannot realize linear speed throughput due to the time delay of the floating-point multiplication operation is avoided. The convolution addition tree calculation unit is mainly used for finishing a data accumulation function related to the same output characteristic graph, and the operation of the addition tree is adopted, so that the floating point accumulation calculation can be output in an open loop mode, the problem that the part cannot be subjected to linear speed throughput due to the floating point accumulation operation delay is avoided, and finally the floating point data is output as the convolution addition tree calculation matrix result floating point data. And the convolution forward addition chain calculation unit is used for finishing a data accumulation function associated with the same output characteristic diagram for multiple times, the calculation part is still an open-loop accumulated output, the problem that the part cannot perform linear speed throughput due to floating point accumulation operation delay is avoided, and finally the output characteristic diagram floating point data is output. Based on the scheme, the invention can improve the computation throughput bottleneck brought by the existing volume multiplication accumulation closed-loop operation mode, and has the effect of achieving the linear speed computation throughput capacity in floating point volume multiplication accumulation computation.
Drawings
FIG. 1 is a block diagram of a convolution multiplication matrix according to an embodiment of the present invention;
FIG. 2 is a block diagram of a convolution addition tree calculation matrix according to an embodiment of the present invention;
FIG. 3 is a diagram of a convolution forward addition chain calculation matrix structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a convolution inference calculation system based on FPGA acceleration according to an embodiment of the present invention;
FIG. 5 is a flowchart of a forward addition chain based floating point convolution multiply accumulate calculation according to an embodiment of the present invention;
fig. 6 is a flowchart of convolution inference calculation based on FPGA acceleration according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a convolution multiplication and accumulation hardware accelerating device of a convolution neural network, which comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.
The convolution multiplication unit structure is shown in FIG. 1 and comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit. As shown in fig. 1, the input parameters are PE (process element) data, the input feature map data is SIMD (single Instruction Multiple data), where PE is 4 and SIMD is 4; the multiplication result is pexsmd data. The convolution multiplication calculation matrix comprises PExSIMD floating-point multipliers, and the input is input characteristic diagram data such as Wgt0, Wgt1, Wgt2 and Wgt3 and input parameter data such as Ipt0, Ipt1, Ipt2 and Ipt 3. The time delay of each floating-point multiplier can be a plurality of clock cycles, and due to the open-loop design mode, the calculation throughput capacity of the multiplication matrix is not influenced, and the linear speed calculation throughput can be achieved.
The structure of the convolution addition tree unit is shown in fig. 2, and comprises PE floating point addition trees, the input data of the convolution addition tree unit is PE × SIMD multiplication results, and the PE × SIMD multiplication results, SIMD multiplication results corresponding to each input feature map data are taken as a group, and are divided into PE groups to be respectively sent into one floating point addition tree. The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit. The time delay of each floating-point adder can be a plurality of clock cycles, and due to the open-loop design mode, the calculation throughput capacity of the addition tree calculation matrix is not influenced, and the linear speed calculation throughput can be achieved.
The convolution forward addition chain unit structure is shown in fig. 3, and comprises PE addition chains, wherein the input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods required to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result. The accumulation calculation is designed by adopting an open-loop addition chain, so that the problem that the accumulated result cannot be processed at a linear speed due to the multi-period calculation of a floating point accumulation unit is solved.
In the embodiment of the invention, each addition chain accumulates the input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, which specifically comprises the following steps:
each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated.
For the current addition chain, the corresponding input data is one of the addition tree results;
the PE-th addition tree result, PE takes the value of 1-PE.
The first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration system of a convolutional neural network, the structure of which is shown in fig. 4, and the system is an example of a convolution inference computation system based on FPGA (field Programmable Gate array) acceleration, and the system includes a host part and a convolution inference computation system part based on FPGA acceleration, and the host and the system example perform data communication and control through a pcie (peripheral Component Interconnect express) link.
The host and the FPGA convolution computing system are connected by a PCIe bus.
The FPGA convolution computing system consists of a memory and on-chip convolution reasoning computing logic.
The on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit.
The input characteristic diagram data buffer scheduling unit is used for storing the input characteristic diagram data and sending the input characteristic diagram data into the convolution multiply-accumulate calculating unit.
The input parameter data buffer scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit.
The convolution multiply accumulate calculating unit adopts the convolution multiply accumulate hardware accelerator structure of the convolution neural network, and performs convolution multiply accumulate operation on input parameter data by using input characteristic diagram data, and sends output characteristic diagram data serving as a convolution multiply accumulate result to the output data cache scheduling unit.
And the output data cache scheduling unit sends the convolution multiplication accumulation result to the memory for storage.
The host reads the convolution multiply-accumulate result from the memory through the PCIe bus.
The convolution reasoning calculation flow accelerated based on the FPGA is shown in FIG. 6, the system initialization is carried out in the first step, the calculated data are written into the memory by the host in the second step, the characteristic diagram data and the weight data are input to the floating point convolution multiplication accumulation calculating unit in the third step through memory scheduling, the floating point convolution multiplication accumulation calculating is carried out in the fourth step, the characteristic data are output through the fifth step of scheduling, the output data are written into the memory in the sixth step, and the convolution calculation result is read from the memory by the host in the seventh step.
Another embodiment of the present invention provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, the flow of which is shown in fig. 5, and the method includes the following steps:
respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; and performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting the PE multiplied by SIMD multiplication results as input data of a convolution addition tree unit.
The delay of the floating-point multiplier is more than one clock cycle.
And step two, taking the SIMD multiplication results of the PE multiplied by SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting the PE addition tree results as the input of a convolution forward addition chain unit.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle.
Thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and the PE floating point data output by all the addition chains form output characteristic diagram data, and the output characteristic diagram data is a convolution multiplication accumulation result.
Each addition chain accumulates input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, and specifically comprises the following steps:
each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated.
For the current addition chain, the corresponding input data is one of the addition tree results; the PE-th addition tree result, PE takes the value of 1-PE.
The first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
The convolution multiply accumulate hardware acceleration scheme using the above convolutional neural network has the following example: the convolutional neural network image classification application scene based on deep learning is characterized in that a convolutional multiply accumulate hardware accelerating device completes one-layer operation of a multilayer convolutional neural network, the input of the convolutional neural network is picture data or characteristic map data, the input parameters are specific parameters of each layer of the convolutional neural network, the parameters can be obtained through a deep learning training image classifier, and the output of the convolutional neural network image classification application scene is characteristic map data or final image classification result data.
The convolution-based image filtering algorithm inputs image data and filter parameters, outputs the filtered image data, and completes the image convolution filtering operation of specific filtering parameters by a convolution multiplication accumulation hardware accelerating device.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A convolution multiplication accumulation hardware accelerating device of a convolution neural network is characterized by comprising a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit;
the convolution multiplication unit comprises PE multiplied SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE multiplied SIMD multiplication results as input data of the convolution addition tree unit;
the convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent to the floating point addition trees;
the floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs PE addition tree results as the input of the convolution forward addition chain unit;
the convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result.
2. The apparatus of claim 1, wherein each adder chain accumulates input data for more than one clock cycle, and wherein the length of the adder chain is determined by the number of clock cycles to be accumulated, and specifically:
each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated;
for the current addition chain, the corresponding input data is one of the addition tree results
The value of PE is 1-PE in the result of the PE-th addition tree;
the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder;
the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder;
and so on;
until the last adder outputs the output result of the current addition chain.
3. A convolution multiply-accumulate hardware acceleration system of a convolution neural network is characterized by comprising a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected through a PCIe bus;
the FPGA convolution calculation system consists of a memory and on-chip convolution reasoning calculation logic;
the on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit;
the input characteristic diagram data cache scheduling unit is used for storing input characteristic diagram data and sending the input characteristic diagram data to the convolution multiply-accumulate calculating unit;
the input parameter data cache scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit;
the convolution multiply-accumulate calculating unit adopts the convolution multiply-accumulate hardware accelerator structure of the convolution neural network as claimed in claim 1 or 2, and performs convolution multiply-accumulate operation on the input parameter data by using the input characteristic diagram data, so as to output characteristic diagram data as a convolution multiply-accumulate result and send the result to the output data cache scheduling unit;
the output data cache scheduling unit sends the convolution multiplication accumulation result to a memory for storage;
and the host reads the volume multiplication accumulation result from the memory through a PCIe bus.
4. A convolution multiply-accumulate hardware acceleration method of a convolution neural network is characterized by comprising the following steps:
respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting PE multiplied by SIMD multiplication results as input data of the convolution addition tree unit;
the time delay of the floating-point multiplier is more than one clock cycle;
step two, taking the SIMD multiplication results of the PE multiplied SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting PE addition tree results as the input of the convolution forward addition chain unit;
the floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle;
thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and forming output characteristic diagram data by the PE floating point data output by all the addition chains, wherein the output characteristic diagram data is a convolution multiplication accumulation result.
5. The method of claim 4, wherein each adder chain accumulates input data for more than one clock cycle, and wherein the length of the adder chain is determined by the number of clock cycles to be accumulated, and specifically:
each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated;
for the current addition chain, the corresponding input data is one of the addition tree results
The value of PE is 1-PE in the result of the PE-th addition tree;
the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder;
the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder;
and so on;
until the last adder outputs the output result of the current addition chain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011587375.4A CN112734020B (en) | 2020-12-28 | 2020-12-28 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011587375.4A CN112734020B (en) | 2020-12-28 | 2020-12-28 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112734020A CN112734020A (en) | 2021-04-30 |
CN112734020B true CN112734020B (en) | 2022-03-25 |
Family
ID=75607380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011587375.4A Active CN112734020B (en) | 2020-12-28 | 2020-12-28 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112734020B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298236B (en) * | 2021-06-18 | 2023-07-21 | 中国科学院计算技术研究所 | Low-precision neural network computing device and acceleration method based on data flow structure |
CN113568597B (en) * | 2021-07-15 | 2024-07-26 | 上海交通大学 | Convolution neural network-oriented DSP compact word multiplication method and system |
CN113361699B (en) * | 2021-07-16 | 2023-05-26 | 安谋科技(中国)有限公司 | Multiplication circuit, system on chip and electronic device |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629189B (en) * | 2012-03-15 | 2014-12-10 | 湖南大学 | Water floating point multiply-accumulate method based on FPGA |
US20180046903A1 (en) * | 2016-08-12 | 2018-02-15 | DeePhi Technology Co., Ltd. | Deep processing unit (dpu) for implementing an artificial neural network (ann) |
US20180053091A1 (en) * | 2016-08-17 | 2018-02-22 | Hawxeye, Inc. | System and method for model compression of neural networks for use in embedded platforms |
CN108133270B (en) * | 2018-01-12 | 2020-08-04 | 清华大学 | Convolutional neural network acceleration method and device |
CN109063825B (en) * | 2018-08-01 | 2020-12-29 | 清华大学 | Convolutional neural network accelerator |
CN109828744B (en) * | 2019-01-18 | 2020-09-08 | 东北师范大学 | Configurable floating point vector multiplication IP core based on FPGA |
WO2020215124A1 (en) * | 2019-04-26 | 2020-10-29 | The University Of Sydney | An improved hardware primitive for implementations of deep neural networks |
CN110852416B (en) * | 2019-09-30 | 2022-10-04 | 梁磊 | CNN hardware acceleration computing method and system based on low-precision floating point data representation form |
-
2020
- 2020-12-28 CN CN202011587375.4A patent/CN112734020B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN112734020A (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862374B (en) | Neural network processing system and processing method based on assembly line | |
CN107609641B (en) | Sparse neural network architecture and implementation method thereof | |
CN111459877B (en) | Winograd YOLOv2 target detection model method based on FPGA acceleration | |
CN112734020B (en) | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network | |
JP6857286B2 (en) | Improved performance of neural network arrays | |
CN108280514B (en) | FPGA-based sparse neural network acceleration system and design method | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN107239829B (en) | Method for optimizing artificial neural network | |
CN110321997B (en) | High-parallelism computing platform, system and computing implementation method | |
CN112840356A (en) | Operation accelerator, processing method and related equipment | |
CN105681628A (en) | Convolution network arithmetic unit, reconfigurable convolution neural network processor and image de-noising method of reconfigurable convolution neural network processor | |
CN113344179B (en) | IP core of binary convolution neural network algorithm based on FPGA | |
CN110543936B (en) | Multi-parallel acceleration method for CNN full-connection layer operation | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
Abdelsalam et al. | An efficient FPGA-based overlay inference architecture for fully connected DNNs | |
CN110580519B (en) | Convolution operation device and method thereof | |
CN113283587A (en) | Winograd convolution operation acceleration method and acceleration module | |
CN113516236A (en) | VGG16 network parallel acceleration processing method based on ZYNQ platform | |
CN114626516A (en) | Neural network acceleration system based on floating point quantization of logarithmic block | |
CN111008691B (en) | Convolutional neural network accelerator architecture with weight and activation value both binarized | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN112149047A (en) | Data processing method and device, storage medium and electronic device | |
CN112836793B (en) | Floating point separable convolution calculation accelerating device, system and image processing method | |
CN112988229B (en) | Convolutional neural network resource optimization configuration method based on heterogeneous computation | |
CN116167425B (en) | Neural network acceleration method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220627 Address after: 100083 No. 211 middle Fourth Ring Road, Haidian District, Beijing Patentee after: NO.15 INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp. Patentee after: CLP Taiji (Group) Co., Ltd Address before: 100083 No. 211 middle Fourth Ring Road, Haidian District, Beijing Patentee before: NO.15 INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp. |