CN112465110B - Hardware accelerator for convolution neural network calculation optimization - Google Patents
Hardware accelerator for convolution neural network calculation optimization Download PDFInfo
- Publication number
- CN112465110B CN112465110B CN202011279360.1A CN202011279360A CN112465110B CN 112465110 B CN112465110 B CN 112465110B CN 202011279360 A CN202011279360 A CN 202011279360A CN 112465110 B CN112465110 B CN 112465110B
- Authority
- CN
- China
- Prior art keywords
- data
- neural network
- kernel
- convolution
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Processing (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a hardware accelerating device for convolutional neural network calculation optimization, which comprises a parameter storage module, a scheduling control module and a plurality of accelerating kernel modules, wherein each accelerating kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a correction linear unit and an output image cache unit. The invention can keep a simple hardware structure of pipelining and paralleling, reduces the calculated amount and improves the acceleration performance of hardware by removing the zero value of the input characteristic diagram; the original structure of the convolutional neural network algorithm is kept, the convolutional neural network algorithm is not required to be optimized for reducing extra calculation amount, network operation irregularity is avoided, and the convolutional neural network algorithm is suitable for hardware acceleration of various convolutional neural network algorithms.
Description
Technical Field
The application belongs to the technical field of computers, and particularly relates to a hardware accelerating device for convolutional neural network calculation optimization.
Background
In recent years, the related algorithms of the deep neural network have been applied to the fields of image processing, audio processing and the like on a large scale, and have a great influence on the world economy and social activities. The deep convolutional neural network technology is widely concerned in many machine learning fields, has higher precision compared with the traditional machine learning algorithm, and can realize the accuracy exceeding that of human beings.
Generally, the deeper the number of layers of the convolutional neural network, the more accurate the inference result. But at the same time, deeper networks mean more computing resources are consumed. In convolutional neural network architectures, intra-layer computations are independent and uncorrelated, while inter-layer computations are similar to pipeline architectures and are not efficient to implement using a general purpose processor. Due to the special calculation mode of the convolutional neural network, the convolutional neural network is particularly suitable for hardware acceleration implementation.
The deep neural network has the advantage of high accuracy, but has the disadvantage of huge calculation amount, so how to reduce the calculation amount of the convolutional neural network is always a popular research direction in the field of artificial intelligence. How to keep a simple structure of running water and parallel without adding extra preprocessing on the premise of being compatible with more deep neural network algorithms and reducing the calculation amount is the difficulty of hardware acceleration at present.
Disclosure of Invention
The application aims to provide a hardware acceleration device for convolutional neural network calculation optimization, which can obviously reduce the amount of convolutional calculation and improve the hardware acceleration performance.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
the utility model provides a hardware accelerator of convolutional neural network computational optimization, hardware accelerator of convolutional neural network computational optimization includes parameter storage module, dispatch control module, a plurality of speed up kernel module, each speed up kernel module includes input image cache unit, weight cache unit, zero-removing processing unit, multiply and accumulate operation array unit, correction linear unit and output image cache unit, wherein:
the parameter storage module is used for caching the convolutional neural network to be accelerated and a convolution kernel corresponding to the convolutional neural network;
the scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module and distributing the input feature map data to be processed to the idle acceleration core module;
the input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module;
the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module;
the zero-removing processing unit is used for removing zero values in the input feature map data;
the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result;
the correction linear unit is used for correcting the negative number in the convolution operation result to be a zero value to obtain a correction result;
and the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
Preferably, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by the accelerated kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
Preferably, the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
Preferably, the weight buffer unit is a second random access memory for buffering the weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores the weight data of M sets of convolution kernels of one point.
Preferably, the multiply-accumulate array unit includes M parallel MAC units, and each MAC unit implements multiply-accumulate operation of the input feature map data and the weight data of a set of convolution kernels.
Preferably, if the size of the input feature map to be processed is C '× R' × N ', where C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;
if N '> N, the input image cache unit uses a plurality of continuous address spaces to store data of N' channels of one pixel point; and if C '. R' > C.R, splitting the input feature graph to be processed into a plurality of C.R.N size blocks, and distributing the blocks to a plurality of acceleration core module operation.
Preferably, in the parameter storage module, if the size of the convolution kernel to be processed is W '. H'. N '. M', where W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of passes of the convolution kernel to be processed, and M' represents the number of groups of the convolution kernels to be processed;
if M' > M, splitting the convolution kernel into a plurality of M groups of convolution kernels, and distributing the convolution kernels to a plurality of accelerated kernel modules for operation;
or, if M '> M, the weight cache unit stores the weight data of the M' sets of convolution kernels for one point using consecutive addresses.
Compared with the prior art, the hardware accelerating device for the convolutional neural network calculation optimization has the following beneficial effects:
(1) the method and the device have the advantages that the simple hardware structure of the flow and the parallelism is kept, the calculated amount is reduced by removing the zero value of the input characteristic diagram, and the hardware acceleration performance is improved.
(2) The method and the device keep the original structure of the convolutional neural network algorithm, do not need the convolutional neural network algorithm to carry out extra optimization for reducing the calculated amount, avoid the irregularity of network operation, and are suitable for hardware acceleration of various convolutional neural network algorithms.
Drawings
FIG. 1 is a schematic structural diagram of a hardware acceleration apparatus for convolutional neural network computational optimization according to the present application;
FIG. 2 is a schematic diagram illustrating a storage manner of an input feature map according to the present application;
FIG. 3 is a schematic diagram of the input feature map zeroing process of the present application;
FIG. 4 is a schematic diagram of a convolution kernel weight data storage method according to the present application;
FIG. 5 is a schematic diagram of the operation of convolution multiply accumulate operation of the present application;
fig. 6 is a schematic diagram of an input/output characteristic image storage method according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In one embodiment, a hardware acceleration device for calculation optimization of a convolutional neural network is provided, which solves the problems that the conventional convolutional neural network consumes more calculation resources and a general processor is low in implementation efficiency.
As shown in fig. 1, the hardware acceleration apparatus for convolutional neural network computational optimization of this embodiment includes a parameter storage module, a scheduling control module, and a plurality of acceleration kernel modules, and each acceleration kernel module includes an input image buffer unit, a weight buffer unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit, and an output image buffer unit.
In the convolution calculation process, the application of each module unit is as follows:
the parameter storage module is used for caching the convolutional neural network to be accelerated and the corresponding convolutional kernel. In practical application, the parameter storage module actually stores compiled network layer information, for example, a YOLO series or a MobileNet series is stored as a neural network to be accelerated and optimized in the convolution operation process.
The scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module, and distributing the input feature map data to be processed (namely, a new operation request) to the idle acceleration core module. In the embodiment, the dispatching control module is utilized to comprehensively control the working conditions of all the accelerating core modules, so that the utilization rate of each accelerating core module can be effectively improved, and the calculation waiting time is reduced. It is easy to understand that detecting the idle condition of the acceleration core module is a conventional detection means, and may be determined by an identifier or by a state, which is not described herein again.
The input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module, and the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module.
The zero-removing processing unit is used for removing zero values in the input feature map data.
And the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result.
And the correction linear unit (namely the activation function Relu) is used for correcting the negative number in the convolution operation result into a zero value to obtain a correction result. In the embodiment, the activation function Relu operation after each layer of convolution operation can enable all negative numbers of convolution operation results to be zero, so that zero in each layer of input feature map data can be removed conveniently, and the calculation amount of the algorithm can be greatly reduced.
And the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
The hardware acceleration device for the convolutional neural network calculation optimization keeps the running and parallel original structure of the convolutional neural network algorithm, and simultaneously switches and stores various convolutional neural networks in combination with the parameter storage module, so that the hardware acceleration device can be compatible with the accelerated calculation of various convolutional neural networks; meanwhile, the characteristic that the negative value is changed into zero by the activating function Relu is utilized, and the negative value in the output characteristic diagram data of the current layer is changed into zero, so that the zero data can be removed when the next layer carries out convolution operation, the calculated amount is reduced, and the hardware acceleration performance is improved.
In the whole convolution calculation process, input feature map data and convolution kernels are obtained through an input bus, and final calculation result data are output through an output bus after the convolution calculation is finished. The data transmission is realized through the bus, and the data transmission stability and the integrity are higher.
It is easy to understand that if the output feature map data of the current layer does not complete the whole convolution calculation, the output feature map data of the current layer will be used as the input feature map data of the convolution operation of the next layer; if the output characteristic diagram data of the current layer does not complete the whole convolution calculation, the output characteristic diagram data of the current layer is output through an output bus.
In order to improve the accuracy of the convolution operation, in another embodiment, the hardware acceleration apparatus further includes an offset buffer unit, where the offset buffer unit is configured to buffer offset data and provide offset compensation for the convolution calculation. Since the operations of the weight data and the input feature map data are sequential, the zeroing processing unit is required to control the read address of the weight buffer unit at the same time. If the hardware acceleration device comprises an offset cache unit, the zero-clearing processing unit also needs to control the read address of the offset cache unit. The offset control of the read address is a conventional technique in the convolution operation, and is not described in detail in this embodiment.
Specifically, in order to ensure a simple hardware structure, each acceleration core module has a maximum data size that can be processed at one time, and the maximum data size is also the maximum storage size of the input image buffer unit and the weight buffer unit. In one embodiment, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by each acceleration core module is set as: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
As shown in fig. 2, when the input feature map and the weight data are obtained and stored, the input image buffer unit is a first Random Access Memory (RAM) for buffering the input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
As shown in fig. 3, when removing zero values in the input feature map, the zero-removing processing unit sequentially removes zero stored in each address space in the input image cache unit, and zero-removes and integrates the original N data into L data, where N is greater than or equal to L.
As shown in fig. 4, the weight buffer unit is a second Random Access Memory (RAM) for buffering weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolution kernels of one point.
In order to ensure that the multiply-accumulate operation array can perform parallel calculation according to the single maximum data quantity, in one embodiment, the multiply-accumulate operation array unit comprises M parallel MAC units (multiply-add devices), each MAC unit realizes the multiply-accumulate operation of input feature map data and weight data of a group of convolution kernels, the convolution algorithm is ensured to have the maximum parallelism, and the operation efficiency is obviously improved.
The working flow of the hardware acceleration device for calculating and optimizing the convolutional neural network of the present application is further described by the following embodiments.
Example 1
And when convolution calculation is carried out, selecting the acceleration kernel module which works immediately according to the idle condition of the acceleration kernel.
And acquiring an input characteristic image of C R N from the input bus, wherein C represents the width of the image, R represents the height of the image, N represents the channel number of the image, and each address space in the input image cache unit RAM stores data of N channel numbers of one pixel point.
Obtaining convolution kernels of W, H, N and M from an input bus, wherein W represents the width of the convolution kernels, H represents the height of the convolution kernels, N represents the number of channels of the convolution kernels, M represents the number of groups of the convolution kernels, and each address space in a weight cache unit RAM stores weight data of M groups of convolution kernels of one point.
N data (i.e., data in one address space) in a parallel line are read from the input image buffer unit, and the zero-removing processing unit removes zero data from the N data and converts the data into L data in series.
As shown in fig. 5, the multiply-accumulate operation array unit takes one of the L data as the data to be calculated in each period, and takes one weight data in the M sets of convolution kernels, which is in the same number of channels as the data to be calculated, to perform the multiply-accumulate operation, where the multiply-accumulate operation array is M parallel MAC units, and each MAC unit implements the multiply-accumulate operation of a set of convolution kernels and the input feature map data. And outputting the output characteristic diagram data of M channels of one pixel point after W X H L periods.
As shown in fig. 6, the output image buffer unit is also an RAM, the RAM is a third random access memory, each address space in the third random access memory stores data of M channels of one pixel, the output feature map data is output and buffered, a negative value in the output feature map data is changed to zero by an activation function Relu, convolution operation of one layer is completed, and the output is output through an output bus.
The above is the calculation process of the hardware acceleration apparatus based on the maximum calculation amount once in this embodiment, and it should be noted that if the input data exceeds the maximum calculation amount once, the hardware acceleration apparatus needs to be preprocessed before the convolution operation:
for the input feature map: if the size of the input feature map to be processed is C '. R'. N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed; and if the N '> N is obtained, the input image cache unit stores the data of N' channels of one pixel point by using a plurality of continuous address spaces, and adopts N '/N (integer) accelerating kernel modules to operate the input characteristic diagram, and when the N'/N is calculated, the integer is obtained by adding 1 to the quotient if the remainder is not zero.
For example, the maximum storable input feature size C × R × N of the input image cache unit is 20 × 32, and when N' is within 32, C × R points, that is, 20 × 20, may be stored; when N ' is 64 or more than 32 and less than 64, C R N/N ' points, i.e. 20 x 20/2, can be stored, and so on for other values of N '.
And if the C '. R' > C.R, splitting the input characteristic image into a plurality of C.R.N size blocks and distributing the blocks to a plurality of accelerating core module operation. When splitting the input feature image, if C '. multidot.r' is not an integer multiple of c.multidot.r, the outer edge of C '. multidot.r' may be adjusted to zero, and then split.
For example, when C '× R' is 6 × 8 and C × R is 2 × 3, the outer edge of C '× R' needs to be subjected to zero padding to become a specification of 6 × 9, and then to the separation process. It should be noted that, in practical applications, in order to improve the rigor, the problem of overlapping borders may need to be considered in the splitting process of the input feature image.
The splitting of the input characteristic image can be completed by a splitting module at the CPU side, and after the splitting module performs splitting processing, each data block is sent to a scheduling control module for distribution, and the jump address is supported to read external data. It should be noted that the splitting module may serve as a part of the hardware acceleration device of this embodiment, and may also serve as an external module connected to the hardware acceleration device of this embodiment. Of course, the splitting of the input feature image can also be completed by the scheduling control module.
For the convolution kernel: if the size of the convolution kernel to be processed is W '. H'. N '. M', wherein W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of channels to be processed, and M' represents the group number of the convolution kernel to be processed; if M' > M, the convolution kernel can be split into a plurality of M groups of convolution kernels and distributed to a plurality of accelerated kernel module operations, and similarly, the convolution kernel can also be completed by the same or similar splitting module or a parameter storage module; or, if M ' > M, the weight cache unit uses a plurality of consecutive addresses to store the weight data of the M ' set of convolution kernels of one point, which is similar to the processing when the feature map N ' > N is input, and is not described in detail.
After the input feature map data or the convolution kernel is split, the output feature map is output to the address of an external SDRAM cache through each accelerated kernel module after processing the operation request, and splicing is realized in the external SDRAM cache. And jumping addresses for each row of output data corresponding to each acceleration core module, for example, splitting the input characteristic image left and right, wherein the two outputs are 10 × 20 (the number of channels is not considered at first), the 0-9 addresses store the data of the acceleration core module 1, the 10-19 addresses store the data of the acceleration core module 2, the 20-29 addresses store the data of the acceleration core module 1, the 30-39 addresses store the data of the acceleration core module 2, and the like.
The embodiment adopts a parallel computing structure and simultaneously skips the multiply-accumulate operation of the input image as zero, thereby achieving the purpose of accelerating the neural network operation.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.
Claims (7)
1. The hardware accelerator for the calculation optimization of the convolutional neural network is characterized by comprising a parameter storage module, a scheduling control module and a plurality of accelerating kernel modules, wherein each accelerating kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit and an output image cache unit, and the hardware accelerator is characterized in that:
the parameter storage module is used for caching the convolutional neural network to be accelerated and a convolution kernel corresponding to the convolutional neural network;
the scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module and distributing the input feature map data to be processed to the idle acceleration core module;
the input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module;
the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module;
the zero-removing processing unit is used for removing zero values in the input feature map data;
the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result;
the correction linear unit is used for correcting the negative number in the convolution operation result to be a zero value to obtain a correction result;
and the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
2. The hardware acceleration device for convolutional neural network computational optimization of claim 1, wherein the maximum data amount that can be directly processed by the multiplication-accumulation operation performed by the acceleration kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
3. The hardware accelerator of convolutional neural network computational optimization of claim 2, wherein the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
4. The hardware acceleration device for convolutional neural network computational optimization of claim 2, wherein the weight caching unit is a second random access memory for caching weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolutional kernels of one point.
5. The hardware acceleration apparatus for convolutional neural network computational optimization of claim 2, wherein said multiply-accumulate array unit comprises M parallel MAC units, each MAC unit implementing a multiply-accumulate operation of the input profile data and the weight data of a set of convolutional kernels.
6. The hardware acceleration apparatus for convolutional neural network computational optimization of claim 3, wherein if the size of the input feature map to be processed is C '× R' × N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;
if N '> N, the input image cache unit uses a plurality of continuous address spaces to store data of N' channels of one pixel point; and if C '. R' > C.R, splitting the input feature graph to be processed into a plurality of C.R.N size blocks, and distributing the blocks to a plurality of acceleration core module operation.
7. The hardware acceleration device for convolutional neural network computational optimization of claim 4, wherein in the parameter storage module, if the size of the convolutional kernel to be processed is W '× H' × N '× M', where W 'represents the width of the convolutional kernel to be processed, H' represents the height of the convolutional kernel to be processed, N 'represents the number of tracks passing through the convolutional kernel to be processed, and M' represents the number of groups of the convolutional kernel to be processed;
if M' > M, splitting the convolution kernel into a plurality of M groups of convolution kernels, and distributing the convolution kernels to a plurality of accelerated kernel modules for operation;
or, if M '> M, the weight cache unit stores the weight data of the M' sets of convolution kernels for one point using consecutive addresses.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011279360.1A CN112465110B (en) | 2020-11-16 | 2020-11-16 | Hardware accelerator for convolution neural network calculation optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011279360.1A CN112465110B (en) | 2020-11-16 | 2020-11-16 | Hardware accelerator for convolution neural network calculation optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112465110A CN112465110A (en) | 2021-03-09 |
CN112465110B true CN112465110B (en) | 2022-09-13 |
Family
ID=74836284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011279360.1A Active CN112465110B (en) | 2020-11-16 | 2020-11-16 | Hardware accelerator for convolution neural network calculation optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112465110B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989270A (en) * | 2021-04-27 | 2021-06-18 | 南京风兴科技有限公司 | Convolution calculating device based on hybrid parallel |
CN113487017A (en) * | 2021-07-27 | 2021-10-08 | 湖南国科微电子股份有限公司 | Data convolution processing method and device and computer equipment |
CN113435586B (en) * | 2021-08-03 | 2021-11-30 | 北京大学深圳研究生院 | Convolution operation device and system for convolution neural network and image processing device |
CN113642724B (en) * | 2021-08-11 | 2023-08-01 | 西安微电子技术研究所 | CNN accelerator for high bandwidth storage |
CN115759212A (en) * | 2021-09-03 | 2023-03-07 | Oppo广东移动通信有限公司 | Convolution operation circuit and method, neural network accelerator and electronic equipment |
CN114254740B (en) * | 2022-01-18 | 2022-09-30 | 长沙金维信息技术有限公司 | Convolution neural network accelerated calculation method, calculation system, chip and receiver |
CN114169514B (en) * | 2022-02-14 | 2022-05-17 | 浙江芯昇电子技术有限公司 | Convolution hardware acceleration method and convolution hardware acceleration circuit |
CN114611683A (en) * | 2022-02-16 | 2022-06-10 | 杭州未名信科科技有限公司 | Convolutional neural network operation implementation method, device, equipment and storage medium |
CN114997392B (en) * | 2022-08-03 | 2022-10-21 | 成都图影视讯科技有限公司 | Architecture and architectural methods for neural network computing |
CN116152520B (en) * | 2023-04-23 | 2023-07-07 | 深圳市九天睿芯科技有限公司 | Data processing method for neural network accelerator, chip and electronic equipment |
CN117391149B (en) * | 2023-11-30 | 2024-03-26 | 爱芯元智半导体(宁波)有限公司 | Processing method, device and chip for neural network output data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN110390382A (en) * | 2019-06-20 | 2019-10-29 | 东南大学 | A kind of convolutional neural networks hardware accelerator with novel feature figure cache module |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
KR102038390B1 (en) * | 2018-07-02 | 2019-10-31 | 한양대학교 산학협력단 | Artificial neural network module and scheduling method thereof for highly effective parallel processing |
CN110622206A (en) * | 2017-06-15 | 2019-12-27 | 三星电子株式会社 | Image processing apparatus and method using multi-channel feature map |
CN110826707A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN111242277A (en) * | 2019-12-27 | 2020-06-05 | 中国电子科技集团公司第五十二研究所 | Convolutional neural network accelerator supporting sparse pruning and based on FPGA design |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11487845B2 (en) * | 2018-11-28 | 2022-11-01 | Electronics And Telecommunications Research Institute | Convolutional operation device with dimensional conversion |
KR20200072307A (en) * | 2018-12-12 | 2020-06-22 | 삼성전자주식회사 | Method and apparatus for load balancing in neural network |
-
2020
- 2020-11-16 CN CN202011279360.1A patent/CN112465110B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN110622206A (en) * | 2017-06-15 | 2019-12-27 | 三星电子株式会社 | Image processing apparatus and method using multi-channel feature map |
KR102038390B1 (en) * | 2018-07-02 | 2019-10-31 | 한양대학교 산학협력단 | Artificial neural network module and scheduling method thereof for highly effective parallel processing |
CN110826707A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN110390382A (en) * | 2019-06-20 | 2019-10-29 | 东南大学 | A kind of convolutional neural networks hardware accelerator with novel feature figure cache module |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN111242277A (en) * | 2019-12-27 | 2020-06-05 | 中国电子科技集团公司第五十二研究所 | Convolutional neural network accelerator supporting sparse pruning and based on FPGA design |
Non-Patent Citations (2)
Title |
---|
A high performance FPGA-based accelerator for lager -scale convolutional neural networks;Huimin Li et al;《In Proceedings of the International Conference on Field Programmable Logic and Applications -FPL’16,IEEE》;20161231;全文 * |
卷积神经网络的FPGA并行加速方案设计;方睿等;《计算机工程与应用》;20151231;第51卷(第08期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112465110A (en) | 2021-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112465110B (en) | Hardware accelerator for convolution neural network calculation optimization | |
CN111242277B (en) | Convolutional neural network accelerator supporting sparse pruning based on FPGA design | |
CN108805266B (en) | Reconfigurable CNN high-concurrency convolution accelerator | |
CN109948774B (en) | Neural network accelerator based on network layer binding operation and implementation method thereof | |
US20210357736A1 (en) | Deep neural network hardware accelerator based on power exponential quantization | |
CN108229671B (en) | System and method for reducing storage bandwidth requirement of external data of accelerator | |
CN112668708B (en) | Convolution operation device for improving data utilization rate | |
CN112418396B (en) | Sparse activation perception type neural network accelerator based on FPGA | |
CN109993293B (en) | Deep learning accelerator suitable for heap hourglass network | |
CN110069444A (en) | A kind of computing unit, array, module, hardware system and implementation method | |
WO2022134465A1 (en) | Sparse data processing method for accelerating operation of re-configurable processor, and device | |
CN111105023A (en) | Data stream reconstruction method and reconfigurable data stream processor | |
CN114491402A (en) | Calculation method for sparse matrix vector multiplication access optimization | |
CN108197075B (en) | Multi-core implementation method of Inceptation structure | |
CN110598844A (en) | Parallel convolution neural network accelerator based on FPGA and acceleration method | |
CN113516236A (en) | VGG16 network parallel acceleration processing method based on ZYNQ platform | |
CN209708122U (en) | A kind of computing unit, array, module, hardware system | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN108647780B (en) | Reconfigurable pooling operation module structure facing neural network and implementation method thereof | |
CN114692079A (en) | GPU batch matrix multiplication accelerator and processing method thereof | |
CN111860819B (en) | Spliced and sectionable full-connection neural network reasoning accelerator and acceleration method thereof | |
CN110766136B (en) | Compression method of sparse matrix and vector | |
KR20220101418A (en) | Low power high performance deep-neural-network learning accelerator and acceleration method | |
CN103577160A (en) | Characteristic extraction parallel-processing method for big data | |
CN211554991U (en) | Convolutional neural network reasoning accelerator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |