CN106951961B - A kind of convolutional neural networks accelerator that coarseness is restructural and system - Google Patents
A kind of convolutional neural networks accelerator that coarseness is restructural and system Download PDFInfo
- Publication number
- CN106951961B CN106951961B CN201710104029.8A CN201710104029A CN106951961B CN 106951961 B CN106951961 B CN 106951961B CN 201710104029 A CN201710104029 A CN 201710104029A CN 106951961 B CN106951961 B CN 106951961B
- Authority
- CN
- China
- Prior art keywords
- unit
- coarseness
- addition unit
- convolutional neural
- neural networks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 45
- 229910052729 chemical element Inorganic materials 0.000 claims abstract description 39
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000001133 acceleration Effects 0.000 claims description 3
- 230000005611 electricity Effects 0.000 claims description 2
- 238000000034 method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000007423 decrease Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/28—Indexing scheme for image data processing or generation, in general involving image processing hardware
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
Abstract
The present invention provides a kind of convolutional neural networks accelerator that coarseness is restructural and system, the accelerator includes multiple processing unit clusters, each processing unit cluster includes several basic computational ele- ments, several basic computational ele- ments are connected respectively to a female addition unit by a sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster;Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for the sub- addition unit that adds up.The present invention is in such a way that coarseness can be reconfigured, different weight and picture track are linked by SRAM or other interconnection units, to realize different convolution kernel processing structures, different size of network and convolution kernel can be efficiently supported, while largely reducing the expense reconfigured.
Description
Technical field
It is restructural more particularly, to a kind of coarseness the present invention relates to high energy efficiency hardware accelerator design field
Convolutional neural networks accelerator and system.
Background technique
Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, it
Artificial neuron can respond the surrounding cells in a part of coverage area, have outstanding performance for large-scale image procossing.Convolution
Neural network has become in the most common algorithm in the fields such as image recognition, speech recognition, and this kind of methods need very more
Calculation amount needs the accelerator of design specialized.Also there is good application prospect in movable equipment.But due to movable equipment
It is resource-constrained, at present in GPU and FPGA (Field Programmable Gate Array, field programmable gate array) platform
The accelerator of upper design is difficult to use on these requirements low-power consumption, resource-constrained platform.
Since convolutional neural networks have the network structure and convolution kernel of a variety of sizes, dedicated convolutional network accelerator is answered
This efficiently supports these different size of networks and convolution kernel.Traditional accelerator is in order to support the diversity of convolutional network
It may be generally divided into two major classes;First major class is instruction type accelerator, and different convolution kernel calculating operations is disassembled into one
Item instruction takes out correct weighted data and image data in synchronization, and this method needs a large amount of on piece bandwidth and on piece
Storage is that comparison is efficient, but weighted data can not be stored entirely on piece when processing big network handling small network, so energy
Amount efficiency decline is serious;Second major class supports different size of network and convolution by the way of fine granularity reconfigurable circuit
Core is arranged an address to each processing unit, sends data to every time accordingly for example, by using the mode of reconstruct network-on-chip
Location, although this mode is more efficient than instruction type accelerator when handling different convolutional neural networks, fine granularity reconstruct electricity
Road brings many additional energy and reconfigures expense.
In large-scale calculations field, reconfigurable system is a research hotspot of current architecture, it is by general place
Manage the flexibility of device and the height of ASIC (Application Specific Integrated Circuits, specific integrated circuit)
Effect property combines well, is towards solution more satisfactory in large-scale calculations.Traditional DSP (Digital
Signal Processing, Digital Signal Processing) with arithmetic speed is low, hardware configuration is not restructural, exploitation upgrade cycle is long
Not the disadvantages of portable, when towards large-scale calculations, this disadvantage is with regard to more obvious.ASIC is in performance, area and power consumption
Etc. there is greater advantage, but the complexity of changeable application demand and rapid growth makes the design and validation difficulty of ASIC
Greatly, the development cycle is long, is difficult to meet the requirement that product is quickly applied.In the programmable logic device, although Xilinx company
Virtex-6 Series FPGA is realized using the DSP48E1slice of 600MHz (multiplies accumulating fortune 1 × 1012 time more than 1000GMACS
Calculation/second) performance, but when towards large-scale calculations, the circuit scale needed to configure is excessive, and comprehensive and setup time is too long,
And actual operating frequency is not high, it is difficult to keep it is high performance simultaneously, pursue flexibility and low-power consumption target.
Therefore, it is badly in need of designing a kind of dedicated accelerator architecture of low-power consumption high energy efficiency to meet the movable equipment of low-power consumption
Use.
Summary of the invention
It is restructural that the present invention provides a kind of coarseness for overcoming the above problem or at least being partially solved the above problem
Convolutional neural networks accelerator and system pass through SRAM (Static Random Access in such a way that coarseness can be reconfigured
Memory, i.e. static random access memory) or other interconnection units link different weight and picture track, to realize difference
Convolution kernel processing structure, can efficiently support different too small networks and convolution kernel, while largely reducing and reconfiguring
Expense.
According to an aspect of the present invention, a kind of convolutional neural networks accelerator that coarseness is restructural is provided, including more
A processing unit cluster, each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments pass through
One sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster are connected respectively to a female addition unit;It is described every
A sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for cumulative described
Sub- addition unit.
Preferably, the basic computational ele- ment includes 3 × 3 convolution units.
Preferably, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;It is described every
A processing unit cluster includes 4 basic computational ele- ments, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.
Preferably, each basic computational ele- ment includes 9 multipliers in nine grids arrangement, it further include 1
Adder, the input register of 3 multipliers in the same row are shift register.
Preferably, basic computational ele- ment adjacent in each every row of processing unit cluster matrix is interconnected by weight
Unit connection weight track, two neighboring basic computational ele- ment connects picture track by image interconnection unit in each column;
The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections,
Weighted data is selected from weight track to each basic computational ele- ment;
Described image interconnection unit is for connecting basic computational ele- ment and image data, from picture track under the control of SRAM
3 continuous data are selected in the output set of road and a upper basic computational ele- ment.
Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition
Unit and female addition unit are powering off when not used.
A kind of convolutional neural networks acceleration system that coarseness is restructural accelerates including several parallel convolutional neural networks
Device.
The application proposes a kind of convolutional neural networks accelerator that coarseness is restructural and system, can be reconfigured using coarseness
Mode, link different weight and picture track, by SRAM or other interconnection units to realize different convolution kernel processing
Structure can efficiently support different too small networks and convolution kernel, while largely reduce the expense reconfigured.Pass through one kind
The restructural accelerator hardware framework of coarseness can support heterogeneous networks with few expense that reconfigures, and design efficiently branch
The computing unit for holding coarseness reconstruction structure, the interconnection architecture for supporting coarseness reconfigurable reconstruct big volume with small convolution kernel
The mechanism of product core reconfigures speed and improves 10 compared to traditional reconfigurable FPGA5Times, energy efficiency has reached 18.8 times.Phase
Than the ASIC that can be reconfigured in traditional fine granularity, reconfiguration time reduces 81.0%, and average energy efficiency improves 80.0%.
Detailed description of the invention
Fig. 1 is the convolutional neural networks accelerator structure schematic diagram restructural according to the coarseness of the embodiment of the present invention;
It is that coarseness matches the operating mode signal for postponing support different size convolution kernel that Fig. 2, which is according to the embodiment of the present invention,
Figure;
Fig. 3 is the schematic equivalent circuit configured after 5x5 mode according to the accelerator architecture of the embodiment of the present invention;
Fig. 4 is that ASIC accelerator, traditional reconfigurable FPGA, of the invention thick is reconstructed according to fine granularity in the embodiment of the present invention
The restructural convolutional neural networks accelerator efficiency comparison schematic diagram of granularity.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below
Example is not intended to limit the scope of the invention for illustrating the present invention.
Fig. 1 shows a kind of convolutional neural networks accelerator that coarseness is restructural, including multiple processing unit clusters, described
Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit,
The ADDB1-ADDB4 of the sub- addition unit as shown in figure 1, the sub- addition unit of the multiple processing unit cluster are connected respectively to one
Female addition unit, the mother addition unit ADDB0 as shown in Figure 1, the sub- addition unit and female addition unit structure phase
Together;Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is used
In the sub- addition unit that adds up.
Granularity refers to the bit wide size of the restructural component of system (or reconfigurable processing unit) operation data, arithmetic element
Granularity be divided into fine granularity, coarseness, combination grain;In the present embodiment, the basic computational ele- ment includes 3 × 3 convolution lists
Member, 3 × 3 convolution units are most common neural network convolution kernel.Since fine-grained restructural meeting be brought a large amount of chip
Area and power dissipation overhead.Therefore, the present invention proposes that a kind of pair of 3x3 convolution kernel does special optimization, while being reconstructed by coarseness
The accelerator architecture of method support other types convolution kernel.Since accelerator has done special optimization to 3x3, it can be efficient
Handle the convolution kernel of 3x3.Since the convolution kernel of 3x3 is big in common neural network proportion, it can be obviously improved efficiency, led to
The method being reconfigured for crossing coarse grain combines these 3x3 convolution units and constitutes bigger core.Therefore, the side restructural using coarseness
Method supports other convolution kernels, and expense can be reconfigured by substantially reducing under the premise of not losing too many performance.
In the present embodiment, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;It is described
Each processing unit cluster includes 4 basic computational ele- ments, NE11, NE12, NE21, NE22 as shown in Figure 1 and sub- addition list
First ADDB1 forms first processing units cluster, and NE13, NE14, NE23, NE24 and sub- addition unit ADDB2 form second processing list
First cluster, NE31, NE32, NE41, NE42 and sub- addition unit ADDB3 form third processing unit cluster, NE33, NE34, NE43,
NE44 and sub- addition unit ADDB4 forms fourth processing unit cluster, 4 basic computational ele- ments in each processing unit cluster
Orthogonal thereto matrix arrangement;As shown in figure 1 shown in (e), the sub- addition unit include four inputs (input 0 as shown in Figure 1,
3) and buffer input 1, input 2, input;Four inputs are separately connected the first processing units cluster, second processing list
First cluster, third processing unit cluster, fourth processing unit cluster;The buffer exports the (addition i.e. in figure as sub- addition unit
Device output).
Preferably, each basic computational ele- ment includes 9 multiplier MUL in nine grids arrangement, it further include 1
A adder ADD;The 9 multiplier MUL and adder ADD can closed when not used to save power consumption.It is same
The input register of three multiplier MUL on column is shift register, and image data can move from the top down.Meanwhile substantially
Computing unit has output port, image data can be removed this unit.
As shown in figure 1 shown in (d), adjacent basic computational ele- ment passes through weight in each every row of processing unit cluster matrix
Interconnection unit FC connection weight track, two neighboring basic computational ele- ment passes through image interconnection unit IC connection picture track in each column
Road;
The weight interconnection unit FC is used to pass through SRAM (Static to each basic computational ele- ment connection weight track
Random Access Memory, static random access memory) control selections, weighted data is selected from weight track to every
A basic computational ele- ment;
Described image interconnection unit is for connecting basic computational ele- ment and picture track, since each basic computational ele- ment has
Three column, so the collection that image interconnection unit exports under the control of SRAM from picture track and a upper basic computational ele- ment image
Three continuous data are selected in conjunction.When needing to reconfigure chip, it is only necessary to data are loaded into configuration SRAM, it can be complete
It is reconfigured at chip.
Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition
Unit and female addition unit are powering off when not used, to save power consumption.
The operating mode schematic diagram for supporting different size convolution kernel is postponed as shown in Fig. 2, matching for coarseness, the present invention supports
1x1 to 12x12 convolution kernel size may be configured to 16 (1x1) to the processing of (3x3) core or 4 (4x4) and arrive (6x6) core
With 1 (7x7)-(12x12) core.Such as the core of a 5x5, it will there are 4 basic computational ele- ments and a sub- addition unit to constitute,
Wherein the partial product device in 4 basic computational ele- ments there are three in can be powered down, and guarantee the size of 5x5 core, while save function
Consumption.
As shown in figure 3, configuring the schematic equivalent circuit after 5x5 mode for accelerator architecture;It is reconfigured by coarseness,
By taking the core of 5x5 as an example, which can form an efficient operating structure, and two kinds of data-reusing modes are efficiently utilized,
It to substantially reduce the carrying of data, is promoted and calculates efficiency, the first multiplexing is multiplexing in convolution kernel, such as the volume of a 5x5
Product core, is reconstructed by coarseness, has 4 pixels that can be reused between adjacent convolution kernel, do not need to be loaded into again.
Meanwhile being reconstructed by coarseness, each image data can be public by N number of convolution kernel, until N number of convolution kernel has all been handled.It is this
The internuclear multiplexing of convolution decreases moving for image data, and after N number of convolution kernel has all been handled, whole image data move down one
Row, basis repeats the above process herein.Data-reusing in the convolution kernel in another direction is realized simultaneously.
As shown in figure 4, it is restructural to reconstruct ASIC accelerator, traditional reconfigurable FPGA, coarseness of the invention for fine granularity
Convolutional neural networks accelerator efficiency comparison schematic diagram, be respectively that fine granularity reconstructs ASIC accelerator, traditional reconfigurable in figure
FPGA, accelerator of the invention are in AlexNet depth convolutional network, Clarifai network model, Overfeat algorithm, VGG16
Efficiency comparison schematic diagram when being applied in depth convolutional neural networks;It can be seen from the figure that the present invention can be weighed compared to tradition
Structure FPGA reconfigures speed and improves 105Times, energy efficiency has reached 18.8 times.It can be reconfigured compared to traditional fine granularity
ASIC, reconfiguration time reduce 81.0%, and average energy efficiency improves 80.0%.
A kind of convolutional neural networks acceleration system that coarseness is restructural is additionally provided in the present embodiment, including several parallel
Convolutional neural networks accelerator, due between different units no data exchange, this framework parallel after bring income be line
Property.
The application proposes a kind of convolutional neural networks accelerator that coarseness is restructural and system, can be reconfigured using coarseness
Mode, link different weight and picture track, by SRAM or other interconnection units to realize different convolution kernel processing
Structure can efficiently support different too small networks and convolution kernel, while largely reduce the expense reconfigured.Pass through one kind
The restructural accelerator hardware framework of coarseness can support heterogeneous networks with few expense that reconfigures, and design efficiently branch
The computing unit for holding coarseness reconstruction structure, the interconnection architecture for supporting coarseness reconfigurable reconstruct big volume with small convolution kernel
The mechanism of product core reconfigures speed and improves 10 compared to traditional reconfigurable FPGA5Times, energy efficiency has reached 18.8 times.Phase
Than the ASIC that can be reconfigured in traditional fine granularity, reconfiguration time reduces 81.0%, and average energy efficiency improves 80.0%.
Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention
Within the scope of.
Claims (6)
1. a kind of convolutional neural networks accelerator that coarseness is restructural, which is characterized in that described including multiple processing unit clusters
Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit,
The sub- addition unit of the multiple processing unit cluster is connected respectively to a female addition unit;Every sub- addition unit is for producing
The part of raw adjacent several basic addition units and, mother's addition unit is for the sub- addition unit that adds up;The place
Managing cluster of cells is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;Each processing unit cluster includes 4 basic meters
Calculate unit, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.
2. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described basic
Computing unit includes 3 × 3 convolution units.
3. the restructural convolutional neural networks accelerator of coarseness according to claim 2, which is characterized in that described each
Basic computational ele- ment includes 9 multipliers in nine grids arrangement, further includes 1 adder, 3 multiplication in same row
The input register of device is shift register.
4. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described each
Adjacent basic computational ele- ment is adjacent in each column by weight interconnection unit connection weight track in the every row of processing unit cluster matrix
Two basic computational ele- ments connect picture track by image interconnection unit;
The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections, from power
Weighted data is selected in heavy rail road to each basic computational ele- ment;
Described image interconnection unit for connecting basic computational ele- ment and picture track, under the control of SRAM from picture track and
3 continuous data are selected in the output set of a upper basic computational ele- ment.
5. the restructural convolutional neural networks accelerator of coarseness according to claim 3, which is characterized in that described each
Multiplier and adder are being closed when not used in processing unit cluster, and the sub- addition unit and female addition unit are breaking when not used
Electricity.
6. a kind of convolutional neural networks acceleration system that coarseness is restructural, which is characterized in that including several parallel such as right
It is required that 1 to 5 any convolutional neural networks accelerator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710104029.8A CN106951961B (en) | 2017-02-24 | 2017-02-24 | A kind of convolutional neural networks accelerator that coarseness is restructural and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710104029.8A CN106951961B (en) | 2017-02-24 | 2017-02-24 | A kind of convolutional neural networks accelerator that coarseness is restructural and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106951961A CN106951961A (en) | 2017-07-14 |
CN106951961B true CN106951961B (en) | 2019-11-26 |
Family
ID=59466600
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710104029.8A Active CN106951961B (en) | 2017-02-24 | 2017-02-24 | A kind of convolutional neural networks accelerator that coarseness is restructural and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106951961B (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108269224B (en) * | 2017-01-04 | 2022-04-01 | 意法半导体股份有限公司 | Reconfigurable interconnect |
US10417364B2 (en) | 2017-01-04 | 2019-09-17 | Stmicroelectronics International N.V. | Tool to create a reconfigurable interconnect framework |
CN109284827A (en) * | 2017-07-19 | 2019-01-29 | 阿里巴巴集团控股有限公司 | Neural computing method, equipment, processor and computer readable storage medium |
CN107729990B (en) * | 2017-07-20 | 2021-06-08 | 上海寒武纪信息科技有限公司 | Apparatus and method for performing forward operations in support of discrete data representations |
CN107491416B (en) * | 2017-08-31 | 2020-10-23 | 中国人民解放军信息工程大学 | Reconfigurable computing structure suitable for convolution requirement of any dimension and computing scheduling method and device |
US11609623B2 (en) | 2017-09-01 | 2023-03-21 | Qualcomm Incorporated | Ultra-low power neuromorphic artificial intelligence computing accelerator |
CN108986022A (en) * | 2017-10-30 | 2018-12-11 | 上海寒武纪信息科技有限公司 | Image beautification method and related product |
CN108256628B (en) * | 2018-01-15 | 2020-05-22 | 合肥工业大学 | Convolutional neural network hardware accelerator based on multicast network-on-chip and working method thereof |
US11468302B2 (en) | 2018-03-13 | 2022-10-11 | Recogni Inc. | Efficient convolutional engine |
CN108510066B (en) * | 2018-04-08 | 2020-05-12 | 湃方科技(天津)有限责任公司 | Processor applied to convolutional neural network |
CN108805266B (en) * | 2018-05-21 | 2021-10-26 | 南京大学 | Reconfigurable CNN high-concurrency convolution accelerator |
CN110826707B (en) * | 2018-08-10 | 2023-10-31 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
US11990137B2 (en) | 2018-09-13 | 2024-05-21 | Shanghai Cambricon Information Technology Co., Ltd. | Image retouching method and terminal device |
CN109919826B (en) * | 2019-02-02 | 2023-02-17 | 西安邮电大学 | Graph data compression method for graph computation accelerator and graph computation accelerator |
CN109949202B (en) * | 2019-02-02 | 2022-11-11 | 西安邮电大学 | Parallel graph computation accelerator structure |
CN110399883A (en) * | 2019-06-28 | 2019-11-01 | 苏州浪潮智能科技有限公司 | Image characteristic extracting method, device, equipment and computer readable storage medium |
CN111126593B (en) * | 2019-11-07 | 2023-05-05 | 复旦大学 | Reconfigurable natural language deep convolutional neural network accelerator |
US11593609B2 (en) | 2020-02-18 | 2023-02-28 | Stmicroelectronics S.R.L. | Vector quantization decoding hardware unit for real-time dynamic decompression for parameters of neural networks |
CN111340206A (en) * | 2020-02-20 | 2020-06-26 | 云南大学 | Alexnet forward network accelerator based on FPGA |
CN111325327B (en) * | 2020-03-06 | 2022-03-08 | 四川九洲电器集团有限责任公司 | Universal convolution neural network operation architecture based on embedded platform and use method |
WO2021189209A1 (en) * | 2020-03-23 | 2021-09-30 | 深圳市大疆创新科技有限公司 | Testing method and verification platform for accelerator |
CN111652361B (en) * | 2020-06-04 | 2023-09-26 | 南京博芯电子技术有限公司 | Composite granularity near storage approximate acceleration structure system and method for long-short-term memory network |
US11531873B2 (en) | 2020-06-23 | 2022-12-20 | Stmicroelectronics S.R.L. | Convolution acceleration with embedded vector decompression |
CN111610963B (en) * | 2020-06-24 | 2021-08-17 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN111860780A (en) * | 2020-07-10 | 2020-10-30 | 逢亿科技(上海)有限公司 | Hardware acceleration system and calculation method for irregular convolution kernel convolution neural network |
CN112183732A (en) * | 2020-10-22 | 2021-01-05 | 中国人民解放军国防科技大学 | Convolutional neural network acceleration method and device and computer equipment |
CN112905526B (en) * | 2021-01-21 | 2022-07-08 | 北京理工大学 | FPGA implementation method for multiple types of convolution |
CN112686228B (en) * | 2021-03-12 | 2021-06-01 | 深圳市安软科技股份有限公司 | Pedestrian attribute identification method and device, electronic equipment and storage medium |
CN115576895B (en) * | 2022-11-18 | 2023-05-02 | 摩尔线程智能科技(北京)有限责任公司 | Computing device, computing method, and computer-readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984560A (en) * | 2014-05-30 | 2014-08-13 | 东南大学 | Embedded reconfigurable system based on large-scale coarseness and processing method thereof |
WO2015168774A1 (en) * | 2014-05-05 | 2015-11-12 | Chematria Inc. | Binding affinity prediction system and method |
CN105453021A (en) * | 2013-08-01 | 2016-03-30 | 经度企业快闪公司 | Systems and methods for atomic storage operations |
CN105488565A (en) * | 2015-11-17 | 2016-04-13 | 中国科学院计算技术研究所 | Calculation apparatus and method for accelerator chip accelerating deep neural network algorithm |
CN106127302A (en) * | 2016-06-23 | 2016-11-16 | 杭州华为数字技术有限公司 | Process the circuit of data, image processing system, the method and apparatus of process data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7219085B2 (en) * | 2003-12-09 | 2007-05-15 | Microsoft Corporation | System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit |
-
2017
- 2017-02-24 CN CN201710104029.8A patent/CN106951961B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105453021A (en) * | 2013-08-01 | 2016-03-30 | 经度企业快闪公司 | Systems and methods for atomic storage operations |
WO2015168774A1 (en) * | 2014-05-05 | 2015-11-12 | Chematria Inc. | Binding affinity prediction system and method |
CN103984560A (en) * | 2014-05-30 | 2014-08-13 | 东南大学 | Embedded reconfigurable system based on large-scale coarseness and processing method thereof |
CN105488565A (en) * | 2015-11-17 | 2016-04-13 | 中国科学院计算技术研究所 | Calculation apparatus and method for accelerator chip accelerating deep neural network algorithm |
CN106127302A (en) * | 2016-06-23 | 2016-11-16 | 杭州华为数字技术有限公司 | Process the circuit of data, image processing system, the method and apparatus of process data |
Also Published As
Publication number | Publication date |
---|---|
CN106951961A (en) | 2017-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106951961B (en) | A kind of convolutional neural networks accelerator that coarseness is restructural and system | |
Kwon et al. | Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects | |
JP6960700B2 (en) | Multicast Network On-Chip Convolutional Neural Network Hardware Accelerator and Its Behavior | |
Chen et al. | Dadiannao: A machine-learning supercomputer | |
CN205139973U (en) | BP neural network based on FPGA device founds | |
CN111178519A (en) | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method | |
Kim et al. | FPGA-based CNN inference accelerator synthesized from multi-threaded C software | |
Kim et al. | A highly scalable restricted Boltzmann machine FPGA implementation | |
CN109284817A (en) | Depth separates convolutional neural networks processing framework/method/system and medium | |
Kim et al. | A large-scale architecture for restricted boltzmann machines | |
CN109711539A (en) | Operation method, device and Related product | |
Wu et al. | Compute-efficient neural-network acceleration | |
CN109284824A (en) | A kind of device for being used to accelerate the operation of convolution sum pond based on Reconfiguration Technologies | |
US11645225B2 (en) | Partitionable networked computer | |
Delaye et al. | Deep learning challenges and solutions with xilinx fpgas | |
Firuzan et al. | Reconfigurable network-on-chip based convolutional neural network accelerator | |
CN109767002A (en) | A kind of neural network accelerated method based on muti-piece FPGA collaboration processing | |
CN105955896A (en) | Reconfigurable DBF algorithm hardware accelerator and control method | |
Kadric et al. | Kung fu data energy-minimizing communication energy in FPGA computations | |
CN112988621A (en) | Data loading device and method for tensor data | |
Sait et al. | Engineering a memetic algorithm from discrete cuckoo search and tabu search for cell assignment of hybrid nanoscale CMOL circuits | |
Ascia et al. | Networks-on-chip based deep neural networks accelerators for iot edge devices | |
Crafton et al. | Breaking barriers: Maximizing array utilization for compute in-memory fabrics | |
CN109857024A (en) | The unit performance test method and System on Chip/SoC of artificial intelligence module | |
CN109933369A (en) | The System on Chip/SoC of integrated single-instruction multiple-data stream (SIMD) framework artificial intelligence module |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |