CN113516236A

CN113516236A - VGG16 network parallel acceleration processing method based on ZYNQ platform

Info

Publication number: CN113516236A
Application number: CN202110807193.1A
Authority: CN
Inventors: 王树龙; 赵蓉; 杜林�; 郑俊伟; 孙承坤; 孙彪; 马兰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-10-19

Abstract

The invention belongs to the technical field of artificial intelligence and FPGA (field programmable gate array) design, and particularly discloses a VGG16 network parallel acceleration processing method based on a ZYNQ platform, which reduces huge calculation amount caused by input and output data, weight data and bias through fixed-point quantization of data, thereby reducing power consumption, reducing calculation amount and solving the problem of on-chip resource limitation; by adopting a resource parallel mode, the parallel computing problem of data correlation between layers of the convolutional neural network is avoided on one hand, and the requirement on hardware circuit resources is reduced on the other hand. According to the invention, a hardware acceleration architecture capable of realizing high parallelism of the VGG16 network is designed through a ZYNQ platform, so that the acceleration performance and the acceleration efficiency are improved on the basis of realizing smaller resource consumption and lower power consumption.

Description

VGG16 network parallel acceleration processing method based on ZYNQ platform

Technical Field

The invention relates to the technical field of artificial intelligence and FPGA (field programmable gate array) design, in particular to a radar imaging technology, and specifically relates to a VGG16 network parallel acceleration processing method based on a ZYNQ platform.

Background

With the rapid development of artificial intelligence technology in recent years, machine learning algorithms become the subject of research in advance for researchers. The convolutional neural network algorithm has important application value and remarkable research significance in the fields of image recognition and classification, voice analysis and retrieval, target detection and monitoring and the like. Among them, the structure of VGGNet is very compact, and the whole network uses the convolution kernel size (3x3) and the maximum pooling size (2x2) with the same size. Under the condition of ensuring the same perception field, the depth of the network is improved, and the effect of the neural network is improved to a certain extent. The convolutional neural network is originally implemented by a software programming method, and as the convolutional neural network is researched and developed, researchers gradually start to implement the convolutional neural network by a hardware method.

The ZYNQ platform is an SOC product which supports software and hardware collaborative design and is promoted by Xilinx company. The development based on the ZYNQ platform can not only benefit from the abundant ARM resources, but also benefit from the expandability and flexibility of the FPGA.

In recent years, with the greatly improved computing power of computers, the research of convolutional neural networks has been developed rapidly, and the complexity, the computing amount and the data amount of the convolutional neural networks are improved, so that the design of the convolutional neural networks is restricted by various factors such as hardware computing power, hardware resource utilization amount and energy consumption.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a VGG16 network parallel acceleration processing method based on a ZYNQ platform, which improves the data processing speed and performance of a VGG16 network and improves the acceleration performance and the acceleration efficiency on the basis of realizing smaller resource consumption and lower power consumption.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

The VGG16 network parallel acceleration processing method based on the ZYNQ platform comprises the following steps:

step 1, a PS end of ZYNQ obtains characteristic diagram data and weight data of a data set to be processed through an upper computer, stores the characteristic diagram data and the weight data into a DDR storage module, then drives an AXI DMA module, and loads the characteristic data and the weight data into a cache module;

step 2, ARM controls the characteristic data and the weight data of the cache module to be transmitted to a convolution module at a PL end of ZYNQ for parallel convolution operation, and an operation result is stored in the cache module;

step 3, the data are quantized and pooled in a post-processing module, and after one layer of operation is completed, the PS end of the ZYNQ reconfigures register information required by the next layer of convolution module for calculation;

and 4, repeatedly executing the step 1 to the step 3 until all layers of operation are completed, then transmitting the final result to the DDR for storage through AXI _ DMA, and transmitting the result to the upper computer through a serial port by the PS end of the ZYNQ.

Further, the convolution module performs parallel convolution operation, which specifically includes:

performing parallel operation of convolution of multiple input channels and multiple output channels by adopting a high-parallelism PE array;

and performing data slice optimization processing on the height of the input feature map and the direction of an output channel, and reducing the total memory access requirement by multiplexing the input feature data and the weight data.

Further, the performing parallel operation of convolution of multiple input channels and multiple output channels by using the high-parallelism PE array specifically includes:

performing parallel convolution of multiple input channels on one dimension, namely performing multiply-accumulate operation on multiple input channels of one characteristic point of an input characteristic diagram and corresponding channel data of one convolution kernel corresponding to the input characteristic point; and the convolution of the multiple output channels is performed in parallel in the other dimension, namely, the multiplication and accumulation operation of all the input feature maps and a plurality of convolution kernels.

Further, the data slice optimization processing is performed on the height of the input feature map and the direction of the output channel, specifically: for a plurality of layers of structures at the input end of the VGG16 network, a scheduling scheme of outputting channel direction slices first and then inputting height direction slices of a feature map is adopted for slice optimization, and the total access demand is reduced by maximizing input feature data multiplexing;

for a plurality of layers of structures of the VGG16 network close to the output end, a scheduling scheme that slicing is firstly carried out in the height direction of an input feature map and then slicing is carried out in the direction of an output channel is adopted for carrying out slicing optimization, so that the total access demand is reduced through weight data multiplexing.

Further, the slicing optimization is performed by using a scheduling scheme of outputting the channel direction slices first and then inputting the height direction slices of the feature map, specifically: calculating the convolution of all output channel directions and a first slice along the height direction, calculating the convolution of all output channel directions and a second slice along the height direction, and so on, calculating the convolution of all output channel directions and a last slice along the height direction, and finally splicing the convolution results of different slices along the height direction in sequence.

Further, the slicing optimization is performed by using a scheduling scheme of slicing in the height direction of the input feature map and then slicing in the direction of the output channel, specifically: and calculating the convolution of all the slices in the height direction and the first output channel, switching the slice of the second output channel, continuously completing the convolution operation with all the slices in the height direction, continuously switching the slices of the output channels, and repeating the operation in the same way to complete the convolution operation of all the slices of the output channels.

Further, the quantization of the data is specifically: INT8 quantization is adopted, and the quantization process is as follows:

where x represents the original FP32 value; scales denotes the scaling factor of FP 32; zero _ point represents an offset of the value; round means rounding to approximate rounding or rounding up or rounding down; q represents the quantized INT8 value.

Further, the pooling of the data is: decomposing the two-dimensional pooling operation into a one-dimensional operation with two dimensions of horizontal and vertical;

firstly, performing transverse one-dimensional pooling operation, taking out a line of data to be pooled according to a width-height-channel mode, performing one-dimensional pooling operation according to pooling parameters to obtain a first line of pooling result, and performing local caching on the first line of pooling result according to a width-height-channel sorting mode; the one-dimensional pooling of all the row data is completed by analogy, and all the transverse one-dimensional pooling results form a temporary matrix in sequence;

and then, performing longitudinal one-dimensional pooling operation, inputting the temporary matrix which is the operation result of the transverse pooling operation, and performing one-dimensional pooling operation in the height direction on the temporary matrix to obtain an output characteristic diagram.

(II) VGG16 network parallel acceleration device based on ZYNQ platform includes: the system comprises an FPGA, an ARM and an AXI bus, wherein the FPGA is used for realizing hardware acceleration of a convolutional neural network and comprises a convolution module, a post-processing module, a full-connection module and a Softmax module; the ARM is used for preloading input characteristic data, bias and weight and configuring CNN register information, and comprises an ARM processor, a data input port, a classification result output port, a DDR storage module and a parameter configuration table, wherein the data input port, the classification result output port, the DDR storage module and the parameter configuration table are controlled by the ARM processor; the AXI bus is used to communicate with peripheral modules, including AXI4 and AXI _ Lite.

Furthermore, the convolution module is a plurality of PE arrays, performs convolution operation on the feature data and the weight taken out from the buffer module, and outputs the feature data and the weight through an activation function;

the post-processing module is a nonlinear processing unit and is used for data quantization and pooling operation, storing the result into the cache module, completing a layer of convolution operation and writing back to the DDR storage module;

the full-connection module is positioned at the position close to the back of the network, firstly, all the characteristic data are cached in the RAM, and then, the weight data are read to the RAM in a grading manner to be subjected to full-connection operation with the characteristic data;

the Softmax module is used for making probability of the operation result, giving probability and a label, and transmitting the probability result back to the ARM through the DDR storage module.

Compared with the prior art, the invention has the beneficial effects that:

(1) the INT8 data quantization scheme adopted by the invention reduces the size of a network model and the requirement of memory bandwidth, and realizes processing acceleration.

(2) The invention optimizes the VGG16 network structure, improves the calculation speed, reduces the cost of moving each data, reduces the memory access times with higher power consumption and improves the accelerated calculation performance of the convolutional neural network from the aspects of on-chip cache optimization, multi-input and output channel parallelism, optimized slicing scheme, pooling module optimization and the like.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

FIG. 1 is a network architecture diagram of the neural network algorithm VGG16 employed by the present invention;

FIG. 2 is a block diagram of the overall architecture of the design of the present invention;

FIG. 3 is a PE array employed by the convolution module convolution calculation unit of the present invention;

FIG. 4 is a diagram of a single PE unit structure employed by the convolution module convolution calculation unit of the present invention;

FIG. 5 is two data scheduling orders for a data slice optimization scheme employed by the present invention;

fig. 6 is a flow chart of hardware quantization employed by the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

Referring to fig. 1, the network structure of the neural network algorithm VGG16 accelerated by the present invention is composed of 13 convolutional layers, 5 downsampling layers, and 3 full-connection layers. The original input image is 224 × 224 three-channel data, the entire network uses convolution kernels with size 3 × 3 and step size 1 to construct convolution layers, and the result of each convolution layer calculation is subjected to nonlinear processing through a ReLU activation function. After convolution layers and before activation of functions, the transfer of the feature map of the neural network between the layers is adjusted with the BN layer. And every 2 or 3 convolutional layers are passed, a maximum pooling layer with the length of 2 multiplied by 2 and the step length of 2 is used for reducing the network size, the width and the height of the feature map are reduced by half after pooling, and the number of channels is not changed. The last three layers are full connection layers, and the output is 1000 kinds of classified categories and probability. The calculation amount of the network is mainly concentrated on a convolutional layer and a full link layer, the hardware of the convolutional neural network is accelerated and designed, and the core point is to accelerate large-scale multiply-add operation.

Example 1

The invention provides a VGG16 network parallel acceleration processing method based on a ZYNQ platform, which comprises the following steps:

step 1, a PS end of ZYNQ obtains characteristic diagram data and weight data of a data set to be processed through an upper computer, stores the characteristic diagram data and the weight data into a DDR storage module, then drives an AXI _ DMA module, and loads the characteristic data and the weight data into a cache module;

the convolution module adopts a high parallel PE array and data slice optimization to improve the performance in the aspects of throughput, bandwidth and delay.

The convolution calculation unit adopts a rectangular multiply-add array to realize 32 × 32 parallel design, namely, 32 input channels and 32 output channels are parallel. Performing multiply-accumulate operation on 32 input channels of one feature point of the input feature map and 32 channel data of one convolution kernel corresponding to the input feature point on one dimension; and performing multiplication and accumulation operation of the input feature map and 32 convolution kernels on the other dimension.

Referring to fig. 3, the PE array of the convolution calculation unit of the present invention has 32 PE units to perform 32 output parallel and 32 input parallel calculations. The weight data and the feature map data are respectively taken from an on-chip cache weight _ input _ buffer and an image _ input _ buffer, the weight _ input _ buffer for storing the weight data is 32 in total, the data of different convolution kernels are respectively cached, and the data of 32 groups of weight data are obtained by simultaneously reading the data of the 32 buffers, wherein the weight data and the feature map data are respectively from different convolution kernels. The characteristic data is read by adopting a pipeline design mode, each time the characteristic data passes through one PE unit, one stage of trigger delay is added, the characteristic data is multiplied by the weight data of different convolution kernels in a pipeline mode, 32 output parallels are realized, and the results of 32 output characteristic graphs are calculated at the same time.

Referring to fig. 4, the single PE unit structure of the convolution calculation unit of the present invention is composed of 32 DSPs, and each clock completes 32 multiplication operations of 16bits × 16 bits. The bit width of the data stored in weight _ input _ buffer and image _ input _ buffer is 32 x 16bits, the taken data is divided according to the bit width, each 16bits is data of one channel, and the data is respectively sent to 32 dsps to complete multiplication, so that parallel calculation of 32 input channels is realized.

Referring to fig. 5, the data slice optimization scheme employed by the present invention has two scheduling orders: n- > M and M- > N. The M → N scheduling scheme firstly calculates the convolution operation of all slices in the height direction and a first output channel, then switches a second output channel slice, continues to complete the convolution operation with all slices in the height direction, continues to switch the output channel slices, and sequentially completes the convolution operation of all the output channel slices; the scheduling scheme of N → M calculates the convolution of all output channel directions first, and then calculates the convolution result concatenation of different slices along the height direction. From the perspective of the output characteristic diagram, the former is to finish calculating a group of output characteristic diagrams and then calculate other output characteristic diagrams, and the latter is to finish calculating 1/M range of all output characteristic diagram planes and then sequentially calculate other positions of the planes. The feature data input by the first scheme needs to be repeatedly loaded for N times, and the weight data does not need to be repeatedly loaded; the second scheduling scheme needs to repeatedly load the weight data M times, and the input feature data needs to be repeatedly loaded due to the existence of overlapping data. In order to maximize data multiplexing, two slicing schemes are comprehensively utilized, for the condition that the characteristic data quantity of the first layers in the network is large, an N → M scheduling method can be adopted, the total memory access requirement is reduced by maximizing input characteristic data multiplexing, and for the condition that the weight data quantity of the later layers in the network is particularly large, the M → N scheduling method can be adopted to reduce the total memory access requirement by weight data multiplexing.

in particular, the post-processing module is mainly responsible for data quantization and pooling operations, wherein the data quantization is quantized with INT 8. The quantization method of INT8 generally employs linear quantization, and the common linear quantization process is:

where x represents the original FP32 value; scales denotes the scaling factor of FP 32; zero _ point represents an offset of the value; round represents a mathematical function that rounds to approximate an integer, and in addition to rounding, rounding up or rounding down may also be used; q represents a quantized INT8 value.

Quantization is not only a quantization of the weight data and the output result, but also an intermediate result. In the process of realizing design acceleration of hardware, as reverse reasoning is not carried out and only forward propagation is carried out, the scaling factor scales and the weight obtained by software training are directly adopted, and meanwhile, some special processing needs to be carried out by combining with a network. Through network training, scaling factors scales and quantized weights are obtained, wherein weight data are the result of fusing a BN layer, and the accuracy of the result can be ensured only if the quantization method is consistent with a software end in the process of using an FPGA (field programmable gate array) as hardware acceleration.

The pooling module decomposes two-dimensional pooling operations into one-dimensional operations of two dimensions, transverse and longitudinal, to reduce computational effort and on-chip storage requirements. Firstly, performing transverse pooling operation, taking out a line of data according to a W-H-C (width-height-channel) mode during data taking, performing one-dimensional pooling operation according to pooling parameters to obtain a first line of pooling results, and temporarily storing the results by using a local cache according to a W-H-C sorting mode; and forming a temporary matrix by the results obtained by all the transverse pooling operations, wherein the size of the matrix is determined by the pooling parameters, the step length is 2 for 2 multiplied by 2 pooling, the width of the temporary storage matrix is half of the input characteristic diagram, and the height of the temporary storage matrix is H. Because the data is transmitted to the pooling operation module in a W-H-C mode, the temporary storage of the data in the channel direction is not involved, and the transverse pooling module does not need to buffer too much data. Then, a vertical pooling operation is performed, and the input is a temporary matrix, which is the operation result of the horizontal pooling. After the height direction pooling operation is performed on the temporary matrix, an output characteristic diagram can be obtained.

Referring to fig. 6, the hardware quantization process adopted by the present invention: performing convolution on the quantization weight data obtained by training at the software end and the image data; multiplying the convolution result by a scaling factor scale, wherein the bit width of the obtained result is increased; secondly, limiting the result in an INT8 range by using a clamp function, and after the bias result is accumulated, limiting the bit width again; and finally, outputting through a Relu activation function.

Example 2

Referring to fig. 2, the present invention further provides a VGG16 network parallel acceleration apparatus based on the ZYNQ platform, including: the system comprises an FPGA, an ARM and an AXI bus, wherein the FPGA is used for realizing hardware acceleration of a convolutional neural network and comprises a convolution module, a post-processing module, a full-connection module and a Softmax module; the ARM is used for preloading input characteristic data, bias and weight and configuring CNN register information, and comprises an ARM processor, a data input port, a classification result output port, a DDR storage module and a parameter configuration table, wherein the data input port, the classification result output port, the DDR storage module and the parameter configuration table are controlled by the ARM processor; the AXI bus is used to communicate with peripheral modules, including AXI4 and AXI _ Lite.

Furthermore, 32 PE arrays are provided in the convolution module, and convolution operation is performed on the feature data and the weight extracted from the input buffer, and the feature data and the weight are output through an activation function.

The post-processing module is a non-linear processing unit and is mainly responsible for quantizing data and pooling operation, storing the result into an output buffer in the module, completing a layer of convolution operation and writing back to the DDR.

The full-connection module carries out the operation of the last three full-connection layers, because the full-connection layers are the layers behind the network, the characteristic data are less, the weight data are more, the characteristic data are all cached in the on-chip RAM in the module in the design process, and then the weight data are read to the characteristic data on the chip and in the RAM for operation in a grading way.

And the Softmax module completes probability operation and gives a probability and a label, and finally the probability is transmitted back to the ARM through the DDR memory.

The DDR storage module is used for storing input image characteristic data, bias, weight and temporary storage of intermediate results after each layer of convolution operation is finished.

The CNN register module mainly completes the setting of convolution registers, such as controlling the number of currently running layers, the sizes of a characteristic diagram and a convolution kernel, the image blocking times, Padding control, the initial addresses of read data and weights, the weight transmission times and the like.

The AXI4 is responsible for data transmission, and the AXI _ Lite bus is responsible for signal transmission.

According to the invention, through fixed-point quantization of data, huge calculation amount caused by input and output data, weight data and bias is reduced, so that power consumption is reduced, calculation amount is reduced, and the problem of on-chip resource limitation is solved; by adopting a resource parallel mode, the parallel computing problem of data correlation between layers of the convolutional neural network is avoided on one hand, and the requirement on hardware circuit resources is reduced on the other hand.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. The VGG16 network parallel acceleration processing method based on the ZYNQ platform is characterized by comprising the following steps:

2. The VGG16 network parallel acceleration processing method based on the ZYNQ platform of claim 1, wherein the convolution module performs parallel convolution operation, specifically comprising:

3. The VGG16 network parallel acceleration processing method based on the ZYNQ platform of claim 2, wherein the parallel operation of the convolution of multiple input channels and multiple output channels is performed by using a high-parallelism PE array, specifically:

4. The VGG16 network parallel acceleration processing method based on the ZYNQ platform of claim 2, wherein the data slice optimization processing is performed on the height of the input feature map and the direction of the output channel, specifically: for a plurality of layers of structures at the input end of the VGG16 network, a scheduling scheme of outputting channel direction slices first and then inputting height direction slices of a feature map is adopted for slice optimization, and the total access demand is reduced by maximizing input feature data multiplexing;

5. The ZYNQ platform-based VGG16 network parallel acceleration processing method of claim 4, wherein the slicing optimization is performed by using a scheduling scheme of first outputting channel direction slices and then inputting height direction slices of the feature map, specifically: calculating the convolution of all output channel directions and a first slice along the height direction, calculating the convolution of all output channel directions and a second slice along the height direction, and so on, calculating the convolution of all output channel directions and a last slice along the height direction, and finally splicing the convolution results of different slices along the height direction in sequence.

6. The VGG16 network parallel acceleration processing method based on the ZYNQ platform of claim 4, wherein the slicing optimization is performed by using a scheduling scheme of slicing in the height direction of an input feature map and then slicing in the direction of an output channel, specifically: and calculating the convolution of all the slices in the height direction and the first output channel, switching the slice of the second output channel, continuously completing the convolution operation with all the slices in the height direction, continuously switching the slices of the output channels, and repeating the operation in the same way to complete the convolution operation of all the slices of the output channels.

7. The ZYNQ platform-based VGG16 network parallel acceleration processing method of claim 1, wherein the data quantization specifically is as follows: INT8 quantization is adopted, and the quantization process is as follows:

8. The VGG16 network parallel acceleration processing method based on the ZYNQ platform, according to claim 1, wherein the pooling of the data is: decomposing the two-dimensional pooling operation into a one-dimensional operation with two dimensions of horizontal and vertical;

9. A VGG16 network parallel accelerator based on a ZYNQ platform is characterized by comprising: the system comprises an FPGA, an ARM and an AXI bus, wherein the FPGA is used for realizing hardware acceleration of a convolutional neural network and comprises a convolution module, a post-processing module, a full-connection module and a Softmax module; the ARM is used for preloading input characteristic data, bias and weight and configuring CNN register information, and comprises an ARM processor, a data input port, a classification result output port, a DDR storage module and a parameter configuration table, wherein the data input port, the classification result output port, the DDR storage module and the parameter configuration table are controlled by the ARM processor; the AXI bus is used to communicate with peripheral modules, including AXI4 and AXI _ Lite.

10. The VGG16 network parallel acceleration device based on the ZYNQ platform, as claimed in claim 9, wherein the convolution module is a plurality of PE arrays, and performs convolution operation on the feature data and the weight extracted from the buffer module, and outputs the result through an activation function;

the full-connection module is positioned at the position close to the back of the network, firstly, all the characteristic data are cached in the RAM, and then, the weight data are read into the RAM in a grading manner to be subjected to full-connection operation with the characteristic data;