CN112734020B

CN112734020B - Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network

Info

Publication number: CN112734020B
Application number: CN202011587375.4A
Authority: CN
Inventors: 张志超; 刘忠麟; 王志乾; 王虎; 喻金桃
Original assignee: CETC 15 Research Institute
Current assignee: Clp Taiji Group Co Ltd; CETC 15 Research Institute
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-03-25
Anticipated expiration: 2040-12-28
Also published as: CN112734020A

Abstract

The invention discloses a convolution multiplication accumulation hardware accelerating device, a convolution multiplication accumulation hardware accelerating system and a convolution multiplication accumulation hardware accelerating method for a convolution neural network, relates to the technical field of artificial convolution neural networks, can solve the problem that the linear speed calculation throughput cannot be achieved by floating point convolution multiplication accumulation in fixed-point DSP hard core design based on an FPGA, and improves the traditional linear speed throughput of floating point convolution multiplication accumulation by at least three times. The convolution multiplication calculation unit is mainly used for finishing multiplication operation of input feature map floating point data and input parameter floating point data and outputting convolution multiplication matrix calculation result floating point data. And the convolution addition tree calculation unit is mainly used for finishing the data accumulation function related to the same output characteristic graph. And the convolution forward addition chain calculation unit completes the data accumulation function associated with the same output characteristic graph for multiple times. Based on the scheme, the invention can improve the computation throughput bottleneck brought by the existing volume multiplication accumulation closed-loop operation mode, and has the effect of achieving the linear speed computation throughput capacity in floating point volume multiplication accumulation computation.

Description

Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network

Technical Field

The invention relates to the technical field of artificial convolutional neural networks, in particular to a convolution multiplication accumulation hardware accelerating device, system and method of a convolutional neural network.

Background

In hardware-accelerated computation of convolutional neural networks based on FPGAs, convolution is a conventional computational operation. The method is limited by the acceleration performance and the constraint of resources on an FPGA chip, most of convolution calculation based on the FPGA is designed based on the convolution calculation of fixed-point operands, fixed-point convolution multiplication and accumulation operation is carried out by adopting an FPGA fixed-point DSP hard core, and the throughput capacity of linear speed calculation can be achieved in the fixed-point convolution multiplication and accumulation calculation. The conventional convolution multiply-accumulate calculation based on the FPGA fixed-point DSP hard core has a closed-loop multiply-accumulate operation, the operation has low delay in a fixed-point precision operation mode, and the delay in the fixed-point precision operation mode is one clock cycle and can realize linear speed throughput; under the floating point precision operation mode, the delay is a plurality of clock cycles, and due to the closed loop multiply-accumulate calculation mode, the linear speed calculation throughput requirement cannot be met, and the calculation throughput capacity of the FPGA for accelerating the floating point convolution neural network is greatly lost.

Therefore, in the floating point convolution neural network accelerated calculation mode requiring higher precision, a floating point convolution product accumulation calculation method capable of linear speed calculation throughput needs to be redesigned.

Disclosure of Invention

In view of this, the present invention provides a convolution multiply-accumulate hardware acceleration apparatus, system and method for a convolution neural network, which can solve the problem that the linear speed computation throughput cannot be achieved by floating point convolution multiply-accumulate design based on the fixed-point DSP hard core of the FPGA, and improve the linear speed throughput of the conventional floating point convolution multiply-accumulate by at least three times.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

a convolution multiplication and accumulation hardware accelerator for a convolution neural network comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.

The convolution multiplication unit comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit.

The convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent into one floating point addition tree.

The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.

The convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result.

Further, each addition chain accumulates input data of more than one clock cycle, and the length of the addition chain is determined according to the number of clock cycles required to be accumulated, specifically:

each addition chain comprising log₂n adders, where n is the number of clock cycles to be accumulated.

For the current addition chain, the corresponding input data is one of the addition tree results.

The PE-th addition tree result, PE takes the value of 1-PE.

The first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.

The invention also provides a convolution multiply-accumulate hardware acceleration system of the convolution neural network, which comprises a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected by a PCIe bus;

the FPGA convolution calculation system consists of a memory and on-chip convolution reasoning calculation logic;

the on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit.

The input characteristic diagram data buffer scheduling unit is used for storing the input characteristic diagram data and sending the input characteristic diagram data into the convolution multiply-accumulate calculating unit.

The input parameter data buffer scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit.

The convolution multiply accumulate calculating unit adopts the convolution multiply accumulate hardware accelerator structure of the convolution neural network, and performs convolution multiply accumulate operation on input parameter data by using input characteristic diagram data, and sends output characteristic diagram data serving as a convolution multiply accumulate result to the output data cache scheduling unit.

And the output data cache scheduling unit sends the convolution multiplication accumulation result to the memory for storage.

The host reads the convolution multiply-accumulate result from the memory through the PCIe bus.

Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, including the following steps:

respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; and performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting the PE multiplied by SIMD multiplication results as input data of a convolution addition tree unit.

The delay of the floating-point multiplier is more than one clock cycle.

And step two, taking the SIMD multiplication results of the PE multiplied by SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting the PE addition tree results as the input of a convolution forward addition chain unit.

The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle.

Thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and the PE floating point data output by all the addition chains form output characteristic diagram data, and the output characteristic diagram data is a convolution multiplication accumulation result.

The value of PE is 1-PE in the result of the PE-th addition tree;

Has the advantages that:

aiming at the deep convolutional neural network reasoning calculation, the convolution calculation of each layer can be expressed as input feature map floating point data and input parameter floating point data, the output feature map floating point data is output through convolution specific calculation, and the multilayer deep convolutional neural network result is obtained by performing iterative calculation layer by layer according to the mode and finally outputting the convolutional neural network calculation result. The convolution multiplication calculation unit is mainly used for finishing multiplication operation of input feature map floating point data and input parameter floating point data and outputting convolution multiplication matrix calculation result floating point data. The floating-point multiplication calculation of the unit is an open-loop output, so that the problem that the component cannot realize linear speed throughput due to the time delay of the floating-point multiplication operation is avoided. The convolution addition tree calculation unit is mainly used for finishing a data accumulation function related to the same output characteristic graph, and the operation of the addition tree is adopted, so that the floating point accumulation calculation can be output in an open loop mode, the problem that the part cannot be subjected to linear speed throughput due to the floating point accumulation operation delay is avoided, and finally the floating point data is output as the convolution addition tree calculation matrix result floating point data. And the convolution forward addition chain calculation unit is used for finishing a data accumulation function associated with the same output characteristic diagram for multiple times, the calculation part is still an open-loop accumulated output, the problem that the part cannot perform linear speed throughput due to floating point accumulation operation delay is avoided, and finally the output characteristic diagram floating point data is output. Based on the scheme, the invention can improve the computation throughput bottleneck brought by the existing volume multiplication accumulation closed-loop operation mode, and has the effect of achieving the linear speed computation throughput capacity in floating point volume multiplication accumulation computation.

Drawings

FIG. 1 is a block diagram of a convolution multiplication matrix according to an embodiment of the present invention;

FIG. 2 is a block diagram of a convolution addition tree calculation matrix according to an embodiment of the present invention;

FIG. 3 is a diagram of a convolution forward addition chain calculation matrix structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a convolution inference calculation system based on FPGA acceleration according to an embodiment of the present invention;

FIG. 5 is a flowchart of a forward addition chain based floating point convolution multiply accumulate calculation according to an embodiment of the present invention;

fig. 6 is a flowchart of convolution inference calculation based on FPGA acceleration according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention provides a convolution multiplication and accumulation hardware accelerating device of a convolution neural network, which comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.

The convolution multiplication unit structure is shown in FIG. 1 and comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit. As shown in fig. 1, the input parameters are PE (process element) data, the input feature map data is SIMD (single Instruction Multiple data), where PE is 4 and SIMD is 4; the multiplication result is pexsmd data. The convolution multiplication calculation matrix comprises PExSIMD floating-point multipliers, and the input is input characteristic diagram data such as Wgt0, Wgt1, Wgt2 and Wgt3 and input parameter data such as Ipt0, Ipt1, Ipt2 and Ipt 3. The time delay of each floating-point multiplier can be a plurality of clock cycles, and due to the open-loop design mode, the calculation throughput capacity of the multiplication matrix is not influenced, and the linear speed calculation throughput can be achieved.

The structure of the convolution addition tree unit is shown in fig. 2, and comprises PE floating point addition trees, the input data of the convolution addition tree unit is PE × SIMD multiplication results, and the PE × SIMD multiplication results, SIMD multiplication results corresponding to each input feature map data are taken as a group, and are divided into PE groups to be respectively sent into one floating point addition tree. The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit. The time delay of each floating-point adder can be a plurality of clock cycles, and due to the open-loop design mode, the calculation throughput capacity of the addition tree calculation matrix is not influenced, and the linear speed calculation throughput can be achieved.

The convolution forward addition chain unit structure is shown in fig. 3, and comprises PE addition chains, wherein the input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods required to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result. The accumulation calculation is designed by adopting an open-loop addition chain, so that the problem that the accumulated result cannot be processed at a linear speed due to the multi-period calculation of a floating point accumulation unit is solved.

In the embodiment of the invention, each addition chain accumulates the input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, which specifically comprises the following steps:

For the current addition chain, the corresponding input data is one of the addition tree results;

the PE-th addition tree result, PE takes the value of 1-PE.

Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration system of a convolutional neural network, the structure of which is shown in fig. 4, and the system is an example of a convolution inference computation system based on FPGA (field Programmable Gate array) acceleration, and the system includes a host part and a convolution inference computation system part based on FPGA acceleration, and the host and the system example perform data communication and control through a pcie (peripheral Component Interconnect express) link.

The host and the FPGA convolution computing system are connected by a PCIe bus.

The FPGA convolution computing system consists of a memory and on-chip convolution reasoning computing logic.

The convolution reasoning calculation flow accelerated based on the FPGA is shown in FIG. 6, the system initialization is carried out in the first step, the calculated data are written into the memory by the host in the second step, the characteristic diagram data and the weight data are input to the floating point convolution multiplication accumulation calculating unit in the third step through memory scheduling, the floating point convolution multiplication accumulation calculating is carried out in the fourth step, the characteristic data are output through the fifth step of scheduling, the output data are written into the memory in the sixth step, and the convolution calculation result is read from the memory by the host in the seventh step.

Another embodiment of the present invention provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, the flow of which is shown in fig. 5, and the method includes the following steps:

The delay of the floating-point multiplier is more than one clock cycle.

Each addition chain accumulates input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, and specifically comprises the following steps:

For the current addition chain, the corresponding input data is one of the addition tree results; the PE-th addition tree result, PE takes the value of 1-PE.

The convolution multiply accumulate hardware acceleration scheme using the above convolutional neural network has the following example: the convolutional neural network image classification application scene based on deep learning is characterized in that a convolutional multiply accumulate hardware accelerating device completes one-layer operation of a multilayer convolutional neural network, the input of the convolutional neural network is picture data or characteristic map data, the input parameters are specific parameters of each layer of the convolutional neural network, the parameters can be obtained through a deep learning training image classifier, and the output of the convolutional neural network image classification application scene is characteristic map data or final image classification result data.

The convolution-based image filtering algorithm inputs image data and filter parameters, outputs the filtered image data, and completes the image convolution filtering operation of specific filtering parameters by a convolution multiplication accumulation hardware accelerating device.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A convolution multiplication accumulation hardware accelerating device of a convolution neural network is characterized by comprising a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit;

the convolution multiplication unit comprises PE multiplied SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE multiplied SIMD multiplication results as input data of the convolution addition tree unit;

the convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent to the floating point addition trees;

the floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs PE addition tree results as the input of the convolution forward addition chain unit;

2. The apparatus of claim 1, wherein each adder chain accumulates input data for more than one clock cycle, and wherein the length of the adder chain is determined by the number of clock cycles to be accumulated, and specifically:

each addition chain comprising log₂n adders, wherein n is the number of clock cycles needing to be accumulated;

for the current addition chain, the corresponding input data is one of the addition tree results

The value of PE is 1-PE in the result of the PE-th addition tree;

the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder;

the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder;

and so on;

until the last adder outputs the output result of the current addition chain.

3. A convolution multiply-accumulate hardware acceleration system of a convolution neural network is characterized by comprising a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected through a PCIe bus;

the on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit;

the input characteristic diagram data cache scheduling unit is used for storing input characteristic diagram data and sending the input characteristic diagram data to the convolution multiply-accumulate calculating unit;

the input parameter data cache scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit;

the convolution multiply-accumulate calculating unit adopts the convolution multiply-accumulate hardware accelerator structure of the convolution neural network as claimed in claim 1 or 2, and performs convolution multiply-accumulate operation on the input parameter data by using the input characteristic diagram data, so as to output characteristic diagram data as a convolution multiply-accumulate result and send the result to the output data cache scheduling unit;

the output data cache scheduling unit sends the convolution multiplication accumulation result to a memory for storage;

and the host reads the volume multiplication accumulation result from the memory through a PCIe bus.

4. A convolution multiply-accumulate hardware acceleration method of a convolution neural network is characterized by comprising the following steps:

respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting PE multiplied by SIMD multiplication results as input data of the convolution addition tree unit;

the time delay of the floating-point multiplier is more than one clock cycle;

step two, taking the SIMD multiplication results of the PE multiplied SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting PE addition tree results as the input of the convolution forward addition chain unit;

the floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle;

thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and forming output characteristic diagram data by the PE floating point data output by all the addition chains, wherein the output characteristic diagram data is a convolution multiplication accumulation result.

5. The method of claim 4, wherein each adder chain accumulates input data for more than one clock cycle, and wherein the length of the adder chain is determined by the number of clock cycles to be accumulated, and specifically:

The value of PE is 1-PE in the result of the PE-th addition tree;

and so on;

until the last adder outputs the output result of the current addition chain.