CN118468952A - Neural network model deployment method and system based on FPGA acceleration - Google Patents
Neural network model deployment method and system based on FPGA acceleration Download PDFInfo
- Publication number
- CN118468952A CN118468952A CN202410927277.2A CN202410927277A CN118468952A CN 118468952 A CN118468952 A CN 118468952A CN 202410927277 A CN202410927277 A CN 202410927277A CN 118468952 A CN118468952 A CN 118468952A
- Authority
- CN
- China
- Prior art keywords
- neural network
- network model
- image
- fpga
- development board
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003062 neural network model Methods 0.000 title claims abstract description 147
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000001133 acceleration Effects 0.000 title claims abstract description 27
- 238000011161 development Methods 0.000 claims abstract description 95
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000013139 quantization Methods 0.000 claims description 51
- 238000004364 calculation method Methods 0.000 claims description 42
- 238000013138 pruning Methods 0.000 claims description 39
- 238000007781 pre-processing Methods 0.000 claims description 28
- 238000012805 post-processing Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 23
- 238000009826 distribution Methods 0.000 claims description 22
- 230000006835 compression Effects 0.000 claims description 13
- 238000007906 compression Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 4
- 238000012938 design process Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 230000002093 peripheral effect Effects 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 18
- 235000009392 Vitis Nutrition 0.000 description 6
- 241000219095 Vitis Species 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 229920006395 saturated elastomer Polymers 0.000 description 5
- 230000006978 adaptation Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a neural network model deployment method and system based on FPGA acceleration, which are characterized in that a neural network model is selected and compressed, an FPGA development board is selected, a global framework suitable for the neural network model is designed based on the FPGA development board, a description file of the global framework is added into an equipment tree file, a linux system capable of running on the FPGA development board is generated, the generated linux system is burnt into NAND FLASH on the FPGA development board, the compressed neural network model is compiled into a binary code file and deployed on the FPGA development board, the deployed FPGA development board is obtained, an image to be detected is acquired and input into the deployed FPGA development board for processing, and a prediction frame and confidence score of a target to be detected obtained after processing are loaded on the image to be detected and displayed. The method has the advantages of high detection speed, low power consumption and high accuracy, and can be suitable for occasions with low power consumption and low delay.
Description
Technical Field
The invention relates to the technical field of target detection, in particular to a neural network model deployment method and system based on FPGA acceleration.
Background
The neural network is deployed in an embedded occasion, and most of hardware is not suitable for low-power consumption and low-delay application in consideration of energy limitation. The FPGA has the advantages of low power consumption, low delay and reconfigurability, and is very suitable for the deducing stage of the neural network. Compared with other hardware, the FPGA can change the hardware structure through configuration, customize the hardware for specific applications, and is suitable for deep learning of a scene which is iterated and updated continuously. Therefore, the FPGA is a potential choice, and can well meet the current development needs, but due to the defects of low development efficiency, low parallelism, low efficiency of various processor architectures, low precision and poor universality of the FPGA language, the design of a proper neural network model compression and deployment method is necessary.
Disclosure of Invention
Based on the defects of the prior method, the invention provides a neural network model deployment method and system based on FPGA acceleration.
The invention provides a neural network model deployment method based on FPGA acceleration, which comprises the following steps:
S1, selecting and training a neural network model, and performing compression processing on the trained neural network model to obtain a compressed neural network model;
S2, selecting an FPGA development board, designing a global framework suitable for a neural network model based on the FPGA development board, adding a description file of the global framework into an equipment tree file, and generating a linux system capable of running on the FPGA development board;
S3, burning the generated linux system into NAND FLASH on an FPGA development board, powering on and starting, compiling the compressed neural network model into a binary code file, and deploying the binary code file on the FPGA development board to obtain a deployed FPGA development board;
S4, acquiring an image to be detected, inputting the image to be detected to a deployed FPGA development board for processing, obtaining a prediction frame and confidence score of a target to be detected, loading the prediction frame and confidence score of the target to be detected on the image to be detected, and transmitting the image to be detected to a display for displaying.
Preferably, S1 specifically includes the following:
S11, selecting a neural network model, acquiring an image to be detected, manufacturing a data set, and storing the data set on a server with a GPU;
S12, pre-training the selected neural network model on a server with a GPU through a data set to obtain a pre-trained neural network model;
S13, performing sparse training on the pre-trained neural network model, and calculating model loss through a preset loss function to obtain sparse weights;
s14, pruning treatment is carried out on the sparse weight, and fine adjustment is carried out, so that a neural network model after pruning fine adjustment is obtained;
s15, carrying out quantization compression on the neural network model subjected to pruning fine adjustment to obtain a neural network model subjected to quantization compression.
Preferably, the loss function preset in S13 is specifically:
;
Wherein, To add a scaling factor to the loss function of the neural network model,Representing the loss function of the neural network model,A convolution calculation is represented and is performed,In order to input the feature map,For the convolution calculation result of the input feature map,Is a weight parameter of the neural network model,Is the firstThe number of scaling factors is a function of the scaling factors,Is to the firstIndividual scaling factorsA penalty caused by the sparsity of (a) is provided,In order to be a penalty factor,Is the total number of scaling factors.
Preferably, the specific procedure of S14 is as follows:
S141, calculating the mean value and variance of all parameters in the feature map parameter set;
s142, calculating the average value of each feature map parameter after normalization according to the average value and the variance;
S143, training the characteristic map parameters according to the average value after the normalization of each characteristic map parameter batch and updating the scaling factors;
S144, the pruning rate is set according to the scaling factors, pruning is carried out on the neural network model, fine tuning training is carried out on the neural network model after pruning, and the neural network model after pruning fine tuning is obtained.
Preferably, in S15, the neural network model after pruning fine tuning training is quantized and compressed, specifically, the KLD quantization method is adopted for quantization and compression, and the method includes the following steps:
s151, obtaining a calculation map of the neural network model after pruning fine adjustment, and inserting a pseudo quantization operator into the calculation map to obtain the calculation map after inserting the pseudo quantization operator;
s152, acquiring an unlabeled dataset, and inputting the unlabeled dataset into a calculation map inserted with a pseudo quantization operator to generate a calculation result distribution histogram;
S153, calculating the probability distribution of the calculation result inserted with the pseudo quantization operator and the probability distribution of the calculation result not inserted with the pseudo quantization operator according to the calculation result distribution histogram, and calculating the KL divergence of the two probability distributions;
S154, selecting a quantized value range corresponding to the smallest KL divergence, and quantizing the weight and the activation function of the pruned and fine-tuned neural network model according to the quantized value range to obtain a quantized and compressed neural network model.
Preferably, the neural network model in S1 includes an image preprocessing unit, a convolution computing unit and an image post-processing unit that are sequentially connected, and in S2, a global architecture suitable for the neural network model is designed based on an FPGA development board, and the specific design process of the global architecture is as follows:
S21, selecting an FPGA development board, wherein the FPGA development board comprises an ARM end, an FPGA end and peripheral equipment which are connected with each other, the ARM end and the FPGA end are communicated through an AXI bus, the ARM end is responsible for inputting and decoding images, and the FPGA end is responsible for preprocessing, calculating and post-processing the images by a neural network model;
S22, the ARM end comprises an ARM controller, a DDR memory and a hardware encoder, wherein an original image is encoded by the hardware encoder and then is input into the DDR memory, and the ARM controller calls the DDR memory through the DDR controller to write or read the original image;
s23, designing an image preprocessing kernel, a convolution computing kernel and an image post-processing kernel which respectively correspond to the image preprocessing unit, the convolution computing unit and the image post-processing unit at the FPGA end, connecting the image preprocessing kernel with the input end of the convolution computing kernel through a first in-chip DDR, and connecting the image post-processing kernel with the output end of the convolution computing kernel through a second in-chip DDR;
s24, the image preprocessing kernel is connected with a DDR memory for storing images at an ARM end through AXI_M, reads original images from the DDR memory and preprocesses the original images, and writes the preprocessed images into a first DDR in an FPGA end;
s25, reading the preprocessed image from the first on-chip DDR of the FPGA end by a convolution calculation kernel, calculating through the on-chip BRAM, generating a target frame position and a confidence coefficient score of each frame of image, and writing the target frame position and the confidence coefficient score of each frame of image into the second on-chip DDR of the FPGA end;
S26, the image post-processing kernel reads out the target frame position and the confidence score of each frame of image from the DDR in the second chip at the FPGA end, processes the target frame position and the confidence score, and screens out a prediction frame and a corresponding confidence score which meet preset conditions.
Preferably, the image preprocessing kernel in S24 reads the original image from the DDR memory and preprocesses the original image, and specifically includes the following steps:
S241, carrying out format conversion on the input image to obtain an image with a converted format;
s242, adjusting the size of the image after the format conversion according to the requirement of the neural network;
S243, quantizing the data bit number of the image after the resizing to obtain a preprocessed image.
Preferably, S26 specifically includes the following:
S261, reading out the target frame position and the confidence coefficient score of each frame of image from the DDR in the second chip on the FPGA end, and decoding a prediction frame of each frame of image according to the target frame position and the confidence coefficient score of each frame of image;
S262, inversely quantizing the predicted frame of each frame of image into 32-bit data, and converting the position of the predicted frame of each frame of image into actual image coordinates;
s263, screening out a prediction frame meeting a preset condition and a corresponding confidence score according to the confidence score of the target frame.
The invention further provides a neural network model deployment system based on FPGA acceleration, which comprises a camera, a personal computer, a server, a display and an FPGA development board, wherein the FPGA development board is respectively connected with the camera, the server and the display, the personal computer is respectively connected with the server and the FPGA development board through a router, the personal computer is also connected with the display through an HDMI converter, the FPGA development board is provided with a neural network model after quantization compression and a linux system capable of running on the FPGA development board,
The camera is used for collecting an image to be detected;
the server is used for training and compressing the neural network model to obtain a compressed neural network model;
the personal computer is used for reasoning the starting of the FPGA development board;
the FPGA development board is used for receiving the image to be detected, processing the image through the compressed network model and the linux system which can be operated on the FPGA development board, obtaining a prediction frame and a confidence score of the object to be detected, loading the prediction frame and the confidence score of the object to be detected on the image to be detected, and transmitting the prediction frame and the confidence score to the display for displaying.
According to the FPGA acceleration-based neural network model deployment method and system, the neural network model is selected and trained, the trained neural network model is compressed to obtain the compressed neural network model, the FPGA development board is selected, the global framework suitable for the neural network model is designed based on the FPGA development board, the description file of the global framework is added into the equipment tree file, the linux system capable of running on the FPGA development board is generated, the generated linux system is burnt into NAND FLASH on the FPGA development board and is electrified to be started, the compressed neural network model is compiled into a binary code file and deployed on the FPGA development board, the deployed FPGA development board is obtained, the image to be detected is collected and input into the deployed FPGA development board to be processed, the prediction frame and the confidence score of the object to be detected are obtained, and the prediction frame and the confidence score of the object to be detected are loaded onto the image to be detected and are transmitted to the display to be displayed. The method has the advantages of high detection speed, low power consumption and high accuracy, and can be suitable for occasions with low power consumption and low delay.
Drawings
FIG. 1 is a flow chart of a neural network model deployment method based on FPGA acceleration in an embodiment of the invention;
FIG. 2 is a schematic diagram of pruning of a neural network model according to an embodiment of the present invention, wherein FIG. 2 (a) is a schematic diagram before pruning and FIG. 2 (b) is a schematic diagram after pruning;
FIG. 3 is a diagram illustrating unsaturated quantization and saturated quantization according to an embodiment of the invention, wherein FIG. 3 (a) is a diagram illustrating unsaturated quantization and FIG. 3 (b) is a diagram illustrating saturated quantization;
FIG. 4 is a schematic diagram of a quantization process using saturation quantization according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a circular convolution expansion in an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of pipeline computations in an embodiment of the present invention;
FIG. 7 is a schematic diagram of a global architecture based on FPGA acceleration in accordance with one embodiment of the present invention;
FIG. 8 is a schematic diagram of a neural network model deployment system based on FPGA acceleration in accordance with an embodiment of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart of a neural network model deployment method based on FPGA acceleration in an embodiment of the present invention. A neural network model deployment method based on FPGA acceleration comprises the following steps:
s1, selecting and training a neural network model, and compressing the trained neural network model to obtain a compressed neural network model.
In one embodiment, S1 specifically includes the following:
S11, selecting a neural network model, acquiring an image to be detected, manufacturing a data set, and storing the data set on a server with a GPU.
S12, pre-training the selected neural network model on a server with the GPU through a data set to obtain a pre-trained neural network model.
Specifically, the neural network model YOLOv is selected and the feature map extraction portion of YOLOv4, that is, the size of the largest pooling layer (MaxPool) of Spatial Pyramid Pooling (SPP), is changed from 5x5, 9x9, 13x13 to 3x3, 5x5, 7x7, and LeakyReLU is selected as the activation function.
And S13, performing sparse training on the pre-trained neural network model, and calculating model loss through a preset loss function to obtain sparse weights.
Further, the preset loss function is specifically:
the loss function preset in S13 specifically includes:
;
Wherein, To add a scaling factor to the loss function of the neural network model,Representing the loss function of the neural network model (i.e. the neural network model selected in S11),A convolution calculation is represented and is performed,In order to input the feature map,For the convolution calculation result of the input feature map,Is a weight parameter of the neural network model,Is an L1 regular constraint term of a BN layer of the neural network model,Is the firstThe number of scaling factors is a function of the scaling factors,Is to the firstIndividual scaling factorsA penalty caused by the sparsity of (a) is provided,In order to be a penalty factor,Is the total number of scaling factors.
And S14, pruning the sparse weight and fine tuning to obtain the neural network model after pruning fine tuning.
In one embodiment, the specific procedure of S14 is as follows:
S141, calculating the mean value and variance of all parameters in the feature map parameter set;
s142, calculating the average value of each feature map parameter after normalization according to the average value and the variance;
,
Wherein, ;
Wherein,A set of parameters of the feature map is represented,Representative feature map parameter setThe first of (3)The parameters of the characteristic map are set to be the same,,Representing the number of parameters of the feature map,Represents the firstAverage value of the characteristic map parameter after normalization,Is a small positive number, the number is a small positive number,Representing the mean value of the batch normalization,Representing the variance of the batch normalization.
S143, training the feature map parameters according to the average value after the normalization of each feature map parameter batch and updating the scaling factors;
S144, according to the scaling factorSetting pruning rate, pruning the neural network model, and performing fine tuning training on the pruned neural network model to obtain the neural network model after the pruning fine tuning training.
Specifically, referring to fig. 2, fig. 2 is a schematic diagram of pruning of a neural network model according to an embodiment of the present invention, where fig. 2 (a) is a schematic diagram before pruning, and fig. 2 (b) is a schematic diagram after pruning.
The pruning rate needs to be set according to the scaling factor when pruning, for example, the larger the scaling factor is, the larger the pruning rate can be selected. Scaling factor after selecting different pruning ratesThe more easily the channel approaching 0 is pruned, the higher the pruning rate is selected, the smaller the model is obtained, but when the pruning amount is too high, the accuracy of the model is lowered, and at this time, the model accuracy can be improved by adding a small amount of training again.
S15, carrying out quantization compression on the neural network model after pruning fine tuning training to obtain a neural network model after quantization compression.
In one embodiment, in S15, the pruned and trimmed neural network model is quantized and compressed, specifically, by using a KLD quantization method, where the method includes the following steps:
s151, obtaining a calculation map of the neural network model after pruning fine adjustment, and inserting a pseudo quantization operator into the calculation map to obtain the calculation map after inserting the pseudo quantization operator;
s152, acquiring an unlabeled dataset, and inputting the unlabeled dataset into a calculation map inserted with a pseudo quantization operator to generate a calculation result distribution histogram;
S153, calculating the probability distribution of the calculation result inserted with the pseudo quantization operator and the probability distribution of the calculation result not inserted with the pseudo quantization operator according to the calculation result distribution histogram, and calculating the KL divergence of the two probability distributions;
S154, selecting a quantized value range corresponding to the smallest KL divergence, and quantizing the weight and the activation function of the pruned and fine-tuned neural network model according to the quantized value range to obtain a quantized and compressed neural network model.
Specifically, referring to fig. 3 and fig. 4, fig. 3 is a schematic diagram of unsaturated quantization and saturated quantization in an embodiment of the invention, where fig. 3 (a) is a schematic diagram of unsaturated quantization and fig. 3 (b) is a schematic diagram of saturated quantization; fig. 4 is a schematic diagram of a quantization process using saturation quantization according to an embodiment of the present invention.
Quantization can be classified into unsaturated quantization, which is to directly map the weights [ min, max ] to [ -127,127], and saturated quantization, which is to determine the maximum and minimum values of quantization weights by reasoning over the calibration data set.
In this embodiment, a KLD saturation quantization method is specifically adopted, and because the value range of the input feature map is not fixed, the value range of the input feature map is selected and quantized multiple times, and convolved with the input feature map to obtain a calculation result, a calculation result histogram is generated, the probability distribution of the calculation result inserted with the pseudo quantization operator and the probability distribution of the calculation result not inserted with the pseudo quantization operator are calculated according to the histogram, the KL divergence of the two probability distributions is calculated, and the value range corresponding to the probability distribution most similar (that is, the KL divergence is the smallest) is selected as the value range after quantization. And according to the quantized value range, the weight and the activation function of the neural network model after pruning and fine tuning are quantized, and the quantized and compressed neural network model is obtained and corresponds to the int8 bit quantized and compressed neural network model in fig. 4.
Taking a certain layer in the neural network model after pruning fine tuning as an example, measuring the similarity between two distributions by using KL divergence, and the calculation formula is as follows:
;
Wherein, For a certain layer of neural network in the pruned and fine-tuned neural network model, adopting the sum of two distributed KL divergences corresponding to a calculation result histogram obtained in unquantized and quantized modes,The resulting probability distribution values are calculated for the convolutions of the unquantized ith convolution kernel,To select the probability distribution value of the convolution calculation result of the ith convolution kernel after different quantization intervals, N represents the number of convolution kernels of this layer.
S2, selecting an FPGA development board, designing a global framework suitable for a neural network model based on the FPGA development board, adding a description file of the global framework into an equipment tree file, and generating a linux system capable of running on the FPGA development board;
In one embodiment, the neural network model in S1 includes an image preprocessing unit, a computing unit and an image post-processing unit that are sequentially connected, and in S2, a global architecture suitable for the neural network model is designed based on an FPGA development board, and the specific design process of the global architecture is as follows:
S21, selecting an FPGA development board, wherein the FPGA development board comprises an ARM end and an FPGA end which are mutually connected, the ARM end and the FPGA end are communicated through an AXI bus, the ARM end is responsible for inputting and decoding images, and the FPGA end is responsible for preprocessing, calculating and post-processing the images by a neural network model.
S22, the ARM end comprises an ARM controller, a DDR memory and a hardware encoder, the original image is encoded by the hardware encoder and then is input into the DDR memory, and the ARM controller calls the DDR memory through the DDR controller to write or read the original image.
Specifically, the original image may be video data, the video data is input to a hardware encoder at the ARM end frame by frame, the encoded image data is obtained by using h.264 line by line encoding, and the encoded image data is transmitted and stored in the NV12 format.
S23, designing an image preprocessing kernel, a convolution computing kernel and an image post-processing kernel which respectively correspond to the image preprocessing unit, the convolution computing unit and the image post-processing unit at the FPGA end, connecting the image preprocessing kernel with the input end of the convolution computing kernel through the DDR in the first chip, and connecting the image post-processing kernel with the output end of the convolution computing kernel through the DDR in the second chip.
Specifically, three kernels are designed in the FPGA development board for preprocessing, convolution calculation and post-processing, and the three kernels respectively correspond to an image preprocessing unit, a calculation unit and an image post-processing unit of the neural network model and are respectively used for preprocessing, calculating and post-processing of the image of the neural network model.
S24, the image preprocessing kernel is connected with a DDR memory for storing images at an ARM end through AXI_M, reads original images from the DDR memory and preprocesses the images, and writes the preprocessed images into a first DDR in a FPGA end.
Further, the image preprocessing kernel in S24 reads the original image from the DDR memory and preprocesses the original image, which specifically includes the following steps:
S241, carrying out format conversion on the input image to obtain an image with a converted format;
s242, adjusting the size of the image after the format conversion according to the requirement of the neural network;
S243, quantizing the data bit number of the image after the resizing to obtain a preprocessed image.
Specifically, the image preprocessing kernel converts the encoded image data stored in the format of NV12 according to the format required by the neural network model after quantization compression, specifically converts the format of the input image into the BGR format, then adjusts the image size of the BGR format to the input size and bit width required by the neural network model, and quantizes the data bit number of the BGR format image after the size adjustment to 8-bit fixed point number (int 8), thereby obtaining the preprocessed image.
S25, the convolution calculation kernel reads out the preprocessed image from the first on-chip DDR of the FPGA end, calculates through the on-chip BRAM, generates the target frame position and the confidence score of each frame of image, and writes the target frame position and the confidence score of each frame of image into the second on-chip DDR of the FPGA end.
Specifically, parameters and weights of a neural network model are loaded from NAND FLASH to an on-chip BRAM, a convolution computation kernel reads out preprocessed image data from a first on-chip DDR to the on-chip BRAM for computation (the computation kernel is composed of a plurality of computation engines, each computation engine is composed of a group of multipliers and adders, the computation task is distributed by an FPGA driver, each FPGA kernel represents one thread, so that pipeline design can be realized by utilizing thread parallelism, one frame of image does not need to be completely read out and then processed, and therefore cyclic expansion convolution (fig. 5) and pipeline computation (see fig. 6) are realized, the on-chip BRAM is used for data buffering and input and output, data multiplexing can be improved, data access is reduced), an original image is input according to frames and is read according to rows, convolution computation is carried out by copying and expanding each group of convolution kernels, finally, the position and confidence score of a target frame corresponding to each frame of image are generated, and finally the position and confidence score of the target frame corresponding to each frame of image are written back into a second on-chip DDR.
S26, the image post-processing kernel reads out the target frame position and the confidence score of each frame of image from the DDR in the second chip at the FPGA end, processes the target frame position and the confidence score, and screens out a prediction frame and a corresponding confidence score which meet preset conditions.
Further, S26 specifically includes the following:
S261, reading out the target frame position and the confidence coefficient score of each frame of image from the DDR in the second chip on the FPGA end, and decoding a prediction frame of each frame of image according to the target frame position and the confidence coefficient score of each frame of image;
S262, inversely quantizing the predicted frame of each frame of image into 32-bit data, and converting the position of the predicted frame of each frame of image into actual image coordinates;
s263, screening out a prediction frame meeting a preset condition and a corresponding confidence score according to the confidence score of the target frame.
Specifically, the screening of the prediction frames meeting the preset conditions is to screen out a preset number of prediction frames corresponding to the target frames with higher confidence scores according to the confidence scores of the target frames.
Specifically, referring to fig. 7, taking a ZYNQ development board (the ZYNQ development board is a series of development boards collectively referred to as a set of development boards) as an example, the ZYNQ development board includes an ARM end and an FPGA end, the ARM end is responsible for inputting and decoding images, and three kernels of the FPGA development board are respectively responsible for preprocessing, calculating and post-processing images by a neural network model. The global architecture has the characteristics of high parallelism, convolution blocking and cyclic expansion, and has 3-dimensional parallelism: pixel parallelism, input channel parallelism, and output channel parallelism. And adding the designed description file of the global framework into the equipment tree file to generate a Linux system running on a board. The specific process is as follows:
Vivado platform design: the Vivado design increases the input and output parameters of the platform, adds the IP core, the IP adopts an AXI interface, and the input and output parameters comprise: the method comprises the steps of inputting a feature map data stream, outputting a feature map data stream, a weight data stream and a configuration data stream, wherein the configuration data stream comprises input and output feature map dimension information and the type of a current layer. The physical interface of the hardware is provided for the Vitis compiler to create an extensible hardware platform, and the main component of the hardware platform is a hardware description file generated by Vivado.
The Vitis HLS designs pre-treatment IP and post-treatment IP: the C/c++ code pre-and post-processing code is compiled into FPGA pre-and post-processing IP using advanced synthesis (HLS), the kernel is compiled into xo files and used in the next step by the Vitis compiler.
3. Cross-compilation adds library functions: and adding the packed library functions into a hardware platform generated by Vivado by using a cross compiling tool SDK, and providing the hardware platform for a Vitis compiler to use.
Vitis generating device mirror System: and integrating the FPGA kernel generated in the HLS into an extensible hardware platform generated by Vivado in the Vitis, and finally generating a linux system running on a board.
And S3, burning the generated linux system into NAND FLASH on an FPGA development board, powering on and starting, compiling the compressed neural network model into a binary code file, and deploying the binary code file on the FPGA development board to obtain a deployed FPGA development board.
Specifically, a camera is connected with an FPGA development board (corresponding to the ZYNQ development board in fig. 8), an HDMI interface on the FPGA development board is connected with a display, a network port on the FPGA development board is connected with a server, the camera is powered on, a generated linux system is burnt to NAND FLASH on the FPGA development board, and a quantized compressed neural network model is compiled into a binary code file and transmitted to the FPGA development board, so that the deployed FPGA development board is obtained.
S4, acquiring an image to be detected, inputting the image to be detected to a deployed FPGA development board for processing, obtaining a prediction frame and confidence score of a target to be detected, loading the prediction frame and confidence score of the target to be detected on the image to be detected, and transmitting the image to be detected to a display for displaying.
Specifically, an image to be detected is collected and input into a neural network model deployed on an FPGA development board, the FPGA development board performs real-time reasoning to obtain a prediction frame and a confidence score of a target to be detected, and the prediction frame and the confidence score of the target to be detected are loaded onto the image to be detected and transmitted to a display for display.
In one embodiment, referring to fig. 8, a neural network model deployment system based on FPGA acceleration includes a camera, a personal computer, a server, a display, and an FPGA development board, the FPGA development board is respectively connected with the camera, the server, and the display, the personal computer is respectively connected with the server and the FPGA development board through a router, the personal computer is further connected with the display through an HDMI converter, the FPGA development board is provided with a compressed neural network model and a linux system capable of running on the FPGA development board, wherein,
The camera is used for collecting an image to be detected;
the server is used for training and compressing the neural network model to obtain a compressed neural network model;
The personal computer is used for starting the FPGA development board to perform reasoning;
the FPGA development board is used for receiving the image to be detected, processing the image through the compressed network model and the linux system which can be operated on the FPGA development board, obtaining a prediction frame and a confidence score of the object to be detected, loading the prediction frame and the confidence score of the object to be detected on the image to be detected, and transmitting the prediction frame and the confidence score to the display for displaying.
The state of the FPGA development board can be monitored in real time through a network, and the quantized and compressed neural network model can be replaced in real time on line according to the requirement (because the trained and quantized and compressed neural network model can only identify one or more corresponding objects during training, the model needs to be retrained and compressed when the other types of objects need to be identified, and then the model can be deployed on the FPGA development board).
Specific limitation regarding a neural network model deployment system based on FPGA acceleration can be found in the above limitation regarding a neural network model deployment method based on FPGA acceleration, and will not be described herein.
According to the FPGA acceleration-based neural network model deployment method and system, the neural network model is selected and trained, the trained neural network model is compressed to obtain the compressed neural network model, the FPGA development board is selected, the global framework suitable for the neural network model is designed based on the FPGA development board, the description file of the global framework is added into the equipment tree file, the linux system capable of running on the FPGA development board is generated, the generated linux system is burnt into NAND FLASH on the FPGA development board and is electrified to be started, the compressed neural network model is compiled into a binary code file and deployed on the FPGA development board, the image to be detected is collected and input into the FPGA development board, the compressed neural network model deployed on the FPGA development board is used for processing to obtain the prediction frame and the confidence score of the object to be detected, the prediction frame and the confidence score of the object to be detected are loaded onto the image to be detected, and the image to be detected is transmitted to the display for display. The method has the advantages of high detection speed, low power consumption and high accuracy, and can be suitable for occasions with low power consumption and low delay.
The neural network model deployment method and the neural network model deployment system based on FPGA acceleration provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the core concepts of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.
Claims (9)
1. The neural network model deployment method based on FPGA acceleration is characterized by comprising the following steps:
S1, selecting and training a neural network model, and performing compression processing on the trained neural network model to obtain a compressed neural network model;
S2, selecting an FPGA development board, designing a global framework suitable for a neural network model based on the FPGA development board, adding a description file of the global framework into an equipment tree file, and generating a linux system capable of running on the FPGA development board;
S3, burning the generated linux system into NAND FLASH on an FPGA development board, powering on and starting, compiling the compressed neural network model into a binary code file, and deploying the binary code file on the FPGA development board to obtain a deployed FPGA development board;
S4, acquiring an image to be detected, inputting the image to be detected to a deployed FPGA development board for processing, obtaining a prediction frame and confidence score of a target to be detected, loading the prediction frame and confidence score of the target to be detected on the image to be detected, and transmitting the image to be detected to a display for displaying.
2. The method for deploying a neural network model based on FPGA acceleration as set forth in claim 1, wherein S1 specifically comprises the following steps:
S11, selecting a neural network model, acquiring an image to be detected, manufacturing a data set, and storing the data set on a server with a GPU;
S12, pre-training the selected neural network model on a server with a GPU through a data set to obtain a pre-trained neural network model;
S13, performing sparse training on the pre-trained neural network model, and calculating model loss through a preset loss function to obtain sparse weights;
s14, pruning treatment is carried out on the sparse weight, and fine adjustment is carried out, so that a neural network model after pruning fine adjustment is obtained;
s15, carrying out quantization compression on the neural network model subjected to pruning fine adjustment to obtain a neural network model subjected to quantization compression.
3. The method for deploying a neural network model based on FPGA acceleration according to claim 2, wherein the loss function preset in S13 specifically is:
;
Wherein, To add a scaling factor to the loss function of the neural network model,Representing the loss function of the neural network model,A convolution calculation is represented and is performed,In order to input the feature map,For the convolution calculation result of the input feature map,Is a weight parameter of the neural network model,Is the firstThe number of scaling factors is a function of the scaling factors,Is to the firstIndividual scaling factorsA penalty caused by the sparsity of (a) is provided,In order to be a penalty factor,Is the total number of scaling factors.
4. The neural network model deployment method based on FPGA acceleration as set forth in claim 3, wherein the specific process of S14 is as follows:
S141, calculating the mean value and variance of all parameters in the feature map parameter set;
s142, calculating the average value of each feature map parameter after normalization according to the average value and the variance;
S143, training the characteristic map parameters according to the average value after the normalization of each characteristic map parameter batch and updating the scaling factors;
S144, the pruning rate is set according to the scaling factors, pruning is carried out on the neural network model, fine tuning training is carried out on the neural network model after pruning, and the neural network model after pruning fine tuning is obtained.
5. The FPGA acceleration-based neural network model deployment method according to claim 4, wherein the step of compressing the neural network model after pruning fine tuning training in a quantization manner, specifically compressing the neural network model in a quantization manner by using a KLD quantization method, comprises the steps of:
s151, obtaining a calculation map of the neural network model after pruning fine adjustment, and inserting a pseudo quantization operator into the calculation map to obtain the calculation map after inserting the pseudo quantization operator;
s152, acquiring an unlabeled dataset, and inputting the unlabeled dataset into a calculation map inserted with a pseudo quantization operator to generate a calculation result distribution histogram;
S153, calculating the probability distribution of the calculation result inserted with the pseudo quantization operator and the probability distribution of the calculation result not inserted with the pseudo quantization operator according to the calculation result distribution histogram, and calculating the KL divergence of the two probability distributions;
S154, selecting a quantized value range corresponding to the smallest KL divergence, and quantizing the weight and the activation function of the pruned and fine-tuned neural network model according to the quantized value range to obtain a quantized and compressed neural network model.
6. The FPGA acceleration-based neural network model deployment method according to claim 5, wherein the neural network model in S1 comprises an image preprocessing unit, a convolution computing unit and an image post-processing unit which are sequentially connected, and the global architecture suitable for the neural network model is designed based on an FPGA development board in S2, and the specific design process of the global architecture is as follows:
S21, selecting an FPGA development board, wherein the FPGA development board comprises an ARM end, an FPGA end and peripheral equipment which are connected with each other, the ARM end and the FPGA end are communicated through an AXI bus, the ARM end is responsible for inputting and decoding images, and the FPGA end is responsible for preprocessing, calculating and post-processing the images by a neural network model;
S22, the ARM end comprises an ARM controller, a DDR memory and a hardware encoder, wherein an original image is encoded by the hardware encoder and then is input into the DDR memory, and the ARM controller calls the DDR memory through the DDR controller to write or read the original image;
s23, designing an image preprocessing kernel, a convolution computing kernel and an image post-processing kernel which respectively correspond to the image preprocessing unit, the convolution computing unit and the image post-processing unit at the FPGA end, connecting the image preprocessing kernel with the input end of the convolution computing kernel through a first in-chip DDR, and connecting the image post-processing kernel with the output end of the convolution computing kernel through a second in-chip DDR;
s24, the image preprocessing kernel is connected with a DDR memory for storing images at an ARM end through AXI_M, reads original images from the DDR memory and preprocesses the original images, and writes the preprocessed images into a first DDR in an FPGA end;
s25, reading the preprocessed image from the first on-chip DDR of the FPGA end by a convolution calculation kernel, calculating through the on-chip BRAM, generating a target frame position and a confidence coefficient score of each frame of image, and writing the target frame position and the confidence coefficient score of each frame of image into the second on-chip DDR of the FPGA end;
S26, the image post-processing kernel reads out the target frame position and the confidence score of each frame of image from the DDR in the second chip at the FPGA end, processes the target frame position and the confidence score, and screens out a prediction frame and a corresponding confidence score which meet preset conditions.
7. The FPGA acceleration-based neural network model deployment method of claim 6, wherein the image preprocessing kernel in S24 reads and preprocesses the original image from the DDR memory, and specifically comprises the following steps:
S241, carrying out format conversion on the input image to obtain an image with a converted format;
s242, adjusting the size of the image after the format conversion according to the requirement of the neural network;
S243, quantizing the data bit number of the image after the resizing to obtain a preprocessed image.
8. The FPGA acceleration-based neural network model deployment method of claim 7, wherein S26 specifically comprises the following:
S261, reading out the target frame position and the confidence coefficient score of each frame of image from the DDR in the second chip on the FPGA end, and decoding a prediction frame of each frame of image according to the target frame position and the confidence coefficient score of each frame of image;
S262, inversely quantizing the predicted frame of each frame of image into 32-bit data, and converting the position of the predicted frame of each frame of image into actual image coordinates;
s263, screening out a prediction frame meeting a preset condition and a corresponding confidence score according to the confidence score of the target frame.
9. The FPGA-based acceleration neural network model deployment system is characterized by comprising a camera, a personal computer, a server, a display and an FPGA development board, wherein the FPGA development board is respectively connected with the camera, the server and the display, the personal computer is respectively connected with the server and the FPGA development board through a router, the personal computer is also connected with the display through an HDMI converter, the FPGA development board is provided with a neural network model after quantization compression and a linux system capable of running on the FPGA development board,
The camera is used for collecting an image to be detected;
the server is used for training and compressing the neural network model to obtain a compressed neural network model;
the personal computer is used for reasoning the starting of the FPGA development board;
the FPGA development board is used for receiving the image to be detected, processing the image through the compressed network model and the linux system which can be operated on the FPGA development board, obtaining a prediction frame and a confidence score of the object to be detected, loading the prediction frame and the confidence score of the object to be detected on the image to be detected, and transmitting the prediction frame and the confidence score to the display for displaying.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410927277.2A CN118468952B (en) | 2024-07-11 | 2024-07-11 | Neural network model deployment method and system based on FPGA acceleration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410927277.2A CN118468952B (en) | 2024-07-11 | 2024-07-11 | Neural network model deployment method and system based on FPGA acceleration |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118468952A true CN118468952A (en) | 2024-08-09 |
CN118468952B CN118468952B (en) | 2024-10-01 |
Family
ID=92160880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410927277.2A Active CN118468952B (en) | 2024-07-11 | 2024-07-11 | Neural network model deployment method and system based on FPGA acceleration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118468952B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429486A (en) * | 2020-04-27 | 2020-07-17 | 山东万腾电子科技有限公司 | DNNDK model-based moving object real-time detection tracking system and method |
CN111709522A (en) * | 2020-05-21 | 2020-09-25 | 哈尔滨工业大学 | Deep learning target detection system based on server-embedded cooperation |
CN113780529A (en) * | 2021-09-08 | 2021-12-10 | 北京航空航天大学杭州创新研究院 | FPGA-oriented sparse convolution neural network multi-level storage computing system |
CN114580568A (en) * | 2022-03-24 | 2022-06-03 | 华南理工大学 | Fish species identification method based on deep learning |
CN115482456A (en) * | 2022-09-29 | 2022-12-16 | 河南大学 | High-energy-efficiency FPGA (field programmable Gate array) acceleration framework of YOLO (YOLO) algorithm |
CN116108896A (en) * | 2023-04-11 | 2023-05-12 | 上海登临科技有限公司 | Model quantization method, device, medium and electronic equipment |
CN117035028A (en) * | 2023-07-15 | 2023-11-10 | 淮阴工学院 | FPGA-based convolution accelerator efficient calculation method |
US20240176938A1 (en) * | 2022-11-25 | 2024-05-30 | Dspace Gmbh | Method for preparing and providing an fpga build result of an fpga model |
-
2024
- 2024-07-11 CN CN202410927277.2A patent/CN118468952B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429486A (en) * | 2020-04-27 | 2020-07-17 | 山东万腾电子科技有限公司 | DNNDK model-based moving object real-time detection tracking system and method |
CN111709522A (en) * | 2020-05-21 | 2020-09-25 | 哈尔滨工业大学 | Deep learning target detection system based on server-embedded cooperation |
CN113780529A (en) * | 2021-09-08 | 2021-12-10 | 北京航空航天大学杭州创新研究院 | FPGA-oriented sparse convolution neural network multi-level storage computing system |
CN114580568A (en) * | 2022-03-24 | 2022-06-03 | 华南理工大学 | Fish species identification method based on deep learning |
CN115482456A (en) * | 2022-09-29 | 2022-12-16 | 河南大学 | High-energy-efficiency FPGA (field programmable Gate array) acceleration framework of YOLO (YOLO) algorithm |
US20240176938A1 (en) * | 2022-11-25 | 2024-05-30 | Dspace Gmbh | Method for preparing and providing an fpga build result of an fpga model |
CN116108896A (en) * | 2023-04-11 | 2023-05-12 | 上海登临科技有限公司 | Model quantization method, device, medium and electronic equipment |
CN117035028A (en) * | 2023-07-15 | 2023-11-10 | 淮阴工学院 | FPGA-based convolution accelerator efficient calculation method |
Non-Patent Citations (1)
Title |
---|
李向阳: "基于ZYNQ的车载目标检测系统设计与实现", 机械设计, vol. 37, no. 1, 31 July 2020 (2020-07-31), pages 35 - 38 * |
Also Published As
Publication number | Publication date |
---|---|
CN118468952B (en) | 2024-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113469073B (en) | SAR image ship detection method and system based on lightweight deep learning | |
CN108510067B (en) | Convolutional neural network quantification method based on engineering realization | |
Nagaraj et al. | Competent ultra data compression by enhanced features excerption using deep learning techniques | |
KR102721008B1 (en) | Complex binary decomposition network | |
CN107679572B (en) | Image distinguishing method, storage device and mobile terminal | |
CN113159276B (en) | Model optimization deployment method, system, equipment and storage medium | |
CN115210719A (en) | Adaptive quantization for executing machine learning models | |
CN114429208A (en) | Model compression method, device, equipment and medium based on residual structure pruning | |
CN118468952B (en) | Neural network model deployment method and system based on FPGA acceleration | |
CN112734025A (en) | Neural network parameter sparsification method based on fixed base regularization | |
CN116543214A (en) | Pulse neural network target detection method based on uniform poisson coding | |
CN114998661B (en) | Target detection method based on fixed point quantitative determination | |
CN116740808A (en) | Animal behavior recognition method based on deep learning target detection and image classification | |
Vogel et al. | Guaranteed compression rate for activations in cnns using a frequency pruning approach | |
US11657282B2 (en) | Efficient inferencing with fast pointwise convolution | |
Chai et al. | Low precision neural networks using subband decomposition | |
CN113505804A (en) | Image identification method and system based on compressed deep neural network | |
Latif et al. | Online Multimodal Compression using Pruning and Knowledge Distillation for Iris Recognition | |
CN111738084A (en) | Real-time target detection method and system based on CPU-GPU heterogeneous multiprocessor system on chip | |
Wall et al. | Real time texture classification using field programmable gate arrays | |
US20240144012A1 (en) | Method and apparatus for compressing neural network model by using hardware characteristics | |
CN113298248B (en) | Processing method and device for neural network model and electronic equipment | |
Imani et al. | Deep neural network acceleration framework under hardware uncertainty | |
Neseem | AI at the Edge: Efficient Deep Learning for Resource-Constrained Environments | |
Bansal et al. | Compressing VAE-Based Out-of-Distribution Detectors for Embedded Deployment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |