[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111783933A - Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation - Google Patents

Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation Download PDF

Info

Publication number
CN111783933A
CN111783933A CN201910269465.XA CN201910269465A CN111783933A CN 111783933 A CN111783933 A CN 111783933A CN 201910269465 A CN201910269465 A CN 201910269465A CN 111783933 A CN111783933 A CN 111783933A
Authority
CN
China
Prior art keywords
data
input
main memory
tensor
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910269465.XA
Other languages
Chinese (zh)
Inventor
林森
何一波
李珏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinqi Technology Co ltd
Original Assignee
Beijing Xinqi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xinqi Technology Co ltd filed Critical Beijing Xinqi Technology Co ltd
Priority to CN201910269465.XA priority Critical patent/CN111783933A/en
Publication of CN111783933A publication Critical patent/CN111783933A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to a hardware circuit design and a method of a data loading device combined with a main memory, which are used for accelerating computation of a deep convolutional neural network. The device designs a cache structure specifically, and comprises the following steps: input caching and control, namely applying a macro block segmentation method aiming at the input of a main memory or/and other memories, and realizing regional data sharing, tensor data fusion and distribution; parallel input register array to convert the data in the input buffer; and the tensor data loading unit is connected with the output of the input cache and the input of the parallel input register array. The design simplifies the address decoding circuit, saves area and power consumption and does not influence the high bandwidth of data. The hardware device and the data processing method provided by the invention comprise a transformation method, a macro block segmentation method and an addressing method for input data, meet the requirement of algorithm acceleration by using limited hardware resources and reduce the complexity of address management.

Description

Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
Technical Field
The invention belongs to the field of computer hardware and artificial neural network algorithm deployment hardware acceleration, and relates to the field of digital integrated circuit design, in particular to a method and a device for designing an input system architecture of a deep convolutional neural network hardware acceleration chip.
Background
The deep convolutional neural network algorithm consists of a plurality of specific neuron algorithm layers and hidden layers and mainly comprises convolutional layers, and a main operator is convolution calculation of a matrix or a vector. The calculation task is mainly characterized in that the input data volume is large, the input data has coupling of spatial characteristic information, data calculated by convolution each time often overlaps with calculated data, and the input data often is calculation data required by extraction according to a certain spatial rule from data in a tensor format. The computation force required by the convolution layer calculation is huge, the required data is larger, and the storage bottle neck becomes a main constraint factor.
In recent years, the deployment of artificial neural algorithms on the embedded end side has become a wide demand, but in relevant scenes, the performance of an acceleration chip and cost factors become main factors for restricting the demand. Patent document 1 (publication No. CN105488565A) discloses an arithmetic device and method for accelerating an acceleration chip of a deep neural network algorithm, in order to overcome the problem that a large number of intermediate values are generated and need to be stored, and thus the required main memory space increases, the arithmetic device is provided with intermediate value storage areas, and these areas are configured as random memories, so as to reduce the number of times of reading and writing the intermediate values from and into the main memory, reduce the energy consumption of the accelerator chip, and avoid the problems of data loss and replacement during data processing. Patent document 2 (application publication No. CN107341544A) discloses a reconfigurable accelerator based on a partitionable array and an implementation method thereof, and a scratch pad memory buffer is designed for implementing data reuse. Patent document 3 (publication No. USB0170103316a1) discloses a method, system and apparatus of a convolutional neural network accelerator in which a Unified Buffer is designed. Patent document 4 (publication No. US20180341495a1) discloses a convolutional neural network accelerator and method in which a cache device is employed to provide data required for parallel acceleration. These inventions are all excellent and have been deployed in servers, data centers and high-end smart phones, but have problems with embedded end-side applications.
The method is characterized in that an artificial neural algorithm is deployed at the embedded end side, and the requirement characteristic is that data needs to be segmented and the expansion of the data is reduced as much as possible due to the limited hardware resources of an acceleration chip; for different artificial neural network algorithms commonly used in different fields and industrial scenes, the processing is a set of simple and convenient method, otherwise, the artificial neural network algorithms are still difficult to fall on the ground. In the inventions described in patent documents 1 and 3, the accelerator resources are wasted due to different neural network algorithm layer sizes and different data reuse degrees, so that other heterogeneous processors are required to be matched to help solve the data-related problems, or the performance is improved by means of a deeper submicron high-cost advanced process; the storage method described in patent 3 requires more data to be backed up, resulting in too large Buffer size; the method of patent 2 adopts a reconfigurable computing idea, although great attention is paid to saving resource waste, the data segmentation and arrangement method is complex, and a compiler needs to be deployed in cooperation with an advanced computing task to assist application; the invention of patent 4 is too coupled with the design of the central processing unit, and the design implementation complexity is too high.
Disclosure of Invention
The invention provides a hardware circuit design and method for a data loading device which accelerates the computation of a deep convolutional neural network and combines a main memory,
the method has the advantages of reducing the complexity of hardware circuit design, reducing the area and power consumption of the chip, providing high-throughput and high-performance parallel data bandwidth, improving the utilization rate of computing resources and memory bandwidth of the chip, and reducing the complexity and cost of application.
To achieve the above object, an embodiment of the present invention provides, in combination with a scalable parallel data loading apparatus, a data loading apparatus re-combined with a main memory, the data loading apparatus including:
the tensor input cache random access controller is used for fusing, arranging and converting data formats of input data from the main memory or/and other memories, and then distributing the input data to the partition areas of the input cache unit, wherein the working mode can be reconfigured through software;
the divisible input cache unit is a local cache of the data loading device, consists of a plurality of storage pages, has a design and storage method corresponding to the dimension of input data and a parallel input register array, and supports the change of data formats brought by software reconfiguration;
the tensor data loading device completes the fusion or rearrangement of tensor data by changing the access format of each storage page of the divisible input cache unit, has a data padding function at the same time, and loads the processed data into the parallel input register array;
and the parallel input register array is used for inputting high-bandwidth data to the deep convolutional neural network parallel acceleration computing unit array.
For a feature map which is stored in a main memory or/and other memories and is output by a hidden layer in front of a deep convolutional neural network algorithm layer, the device provides a cache for data rearrangement and a rapid register area, and the difficulty of input data arrangement is simplified; the input cache unit capable of being partitioned can be repeatedly accessed, and the format of the data to be accessed again is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory or/and other memories.
The invention provides a design method of a data loading device, which is characterized in that a local cache unit is divided into a plurality of storage pages, and a tensor input cache random access controller can access the plurality of pages in parallel; the design of the storage page and tensor data loading device corresponds to the flexibility of the parallel input register array, and meets a specific design formula. The design method can simplify the complexity of a hardware circuit in the device and reduce the area and the power consumption.
The invention has the following effects:
1. simplifying the complexity of connection between the hardware parallel computing unit array and the input device
2. Simplifying the spatial complexity of arranging data between the output device and the main storage
3. Simplifying the address calculation complexity of software configuration data and dividing data macro block
4. The practical application efficiency of the hardware parallel computing unit array is improved
5. Is more suitable for being implemented on a low-cost embedded ASIC chip
Drawings
FIG. 1 is a block diagram of a data input device according to the present invention;
FIG. 2 is a diagram of the structure and design method between the tensor data loading device and the page and scalable parallel input register array according to the present invention;
FIG. 3 is a diagram of the specific structure between the tensor data loading device and the page and the scalable parallel input register array according to the present invention;
FIG. 4 is a diagram of a data loading method in conjunction with a main memory according to the present invention.
Description of the reference numerals
1 parallel hardware computing Unit Array (Process Elements Array, PEA)
101 convolution computing Element (Process Element, PE)
2 data input device combined with main memory
201 divisible input buffer unit
202 scalable parallel input register array
204 tensor data Loading Device (LDI)
205 tensor input buffer random access controller
5 high performance on-chip data bus
6 main memory and controller thereof
Detailed Description
The invention is described in further detail below with reference to the figures and examples.
Fig. 1 is a structural diagram of a data loading apparatus incorporated in a main memory according to the present invention, the data loading apparatus 2 including:
a tensor input buffer random access controller 205, which performs fusion, arrangement and data format conversion on input data from the main memory 6 or/and other memories, and distributes the input data to the partition areas of the input buffer unit 201, wherein the working mode can be reconfigured by software;
the divisible input cache unit 201 is a local cache of the data loading device of the present invention, and is composed of a plurality of storage pages, and the design and storage method corresponds to the dimension of the input data and the parallel input register array 202, and supports the change of the data format caused by the software reconfiguration;
a tensor data loading device 204 which completes the fusion or rearrangement of tensor data by changing the access format of each storage page of the divisible input buffer unit 201, has a data padding function, and loads the processed data into the parallel input register array 202;
and the parallel input register array 202 inputs high-bandwidth data to the deep convolutional neural network parallel acceleration computing unit array.
For the feature map which is stored in the main memory 6 or/and other memories and is output by a hidden layer in front of the deep convolutional neural network algorithm layer, the device provides a cache for data rearrangement and a quick register area, and the difficulty of input data arrangement is simplified; the input buffer unit 201 can be repeatedly accessed, and the format of the data to be accessed is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory 6 or/and other memories.
The invention provides a method for designing a divisible input cache corresponding to a telescopic parallel input register array, which comprises the following steps: assuming that the parallel input register array 202 is instantiated with rows and columns of Rh and Rw input registers, the paging number of the input buffer 201 is also designed as Rh bank pages; assuming that the bit width of the input data is DW, the parallel input register array 202 is filled with bits available for the parallel accelerated computing array 1 to perform the successive accelerated computing Bw × Bh for each time, and the bit width computing method for filling the parallel input register array 202 is as follows
Figure BDA0002017907920000051
Based on parallel accelerometersCalculating the parallelism P of the cell array 1 and the minimum size Kmin of a convolution kernel, and selecting Bw Bh in a foldable mode; the depth per bank page is tm Rw, taking into account the required buffer depth tm for the design of the main storage system. Fig. 2 explains the correspondence of the present design method.
FIG. 2 is a diagram of the structure and design method of the tensor data loading device 204, the page 201 and the scalable parallel input register array 202 of the present invention: the tensor data loading device comprises several groups of reading and writing units which work in parallel, the number of the groups is related to the range of input data which needs to be accessed corresponding to IRA by each PE, and the reading and writing units are characterized in that
Figure BDA0002017907920000052
Each LDI read-write group unit is operated corresponding to the Bh line of the parallel input register array; the working method is that the corresponding row of the parallel input register array is written first, and each writing
Figure BDA0002017907920000053
After writing the Bw times, writing the row, and then writing the next row until the corresponding Bh row is written, and finishing the current IRA filling; the distribution of the Bh rows corresponding to one LDI read-write group unit in the tensor data loading device 204 is distributed at intervals across regions according to the size of the regions determined by the operational characteristics of IRA and PEA; all LDI read-write group units are written into IRA in parallel, and when one row is written, data which can be used for the parallel computing unit array PEA to complete at least one time of matrix convolution computation is filled. Fig. 3 specifically explains the correspondence relationship of the present design method.
FIG. 4 is a flowchart of a data loading method combined with a main memory according to the present invention:
firstly, input data are normally placed in a main memory according to a scanning sequence and are coded according to a 2-dimensional format, as shown in the figure, r represents an input data graph, and numbers represent addressing;
according to the register scale of the parallel input register array in the device, input data is cut into blocks;
starting a tensor input cache random access controller 205, configuring the cut input data block according to the head address and the tensor reading mode, and completing tensor operations on the input data, such as fusion, transposition and the like, by the access controller;
the write mode of the tensor input cache random access controller 205 is configured, the data is sequentially written according to bank paging, and meanwhile, the data is rearranged in a certain rule, so that the cache data arrangement mode provided by the invention is met.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (3)

1. A data loading apparatus in conjunction with a main memory for accelerating computation of a deep convolutional neural network, the hardware apparatus comprising:
the tensor input cache random access controller is used for fusing, arranging and converting data formats of input data from the main memory or/and other memories, and then distributing the input data to the partition areas of the input cache unit, wherein the working mode can be reconfigured through software;
the divisible input cache unit is a local cache of the data loading device, consists of a plurality of storage pages, has a design and storage method corresponding to the dimension of input data and a parallel input register array, and supports the change of data formats brought by software reconfiguration;
the tensor data loading device completes the fusion or rearrangement of tensor data by changing the access format of each storage page of the divisible input cache unit, has a data padding function at the same time, and loads the processed data into the parallel input register array;
and the parallel input register array is used for inputting high-bandwidth data to the deep convolutional neural network parallel acceleration computing unit array.
2. The data loading device according to claim 1, wherein the device provides a buffer and a fast register area for data rearrangement for the feature map output by a hidden layer before the deep convolutional neural network algorithm layer stored in the main memory or/and other memories, thereby simplifying the difficulty of input data arrangement; the input cache unit capable of being partitioned can be repeatedly accessed, and the format of the data to be accessed again is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory or/and other memories.
3. A method for designing a data loading apparatus according to claims 1-2, wherein the local cache unit is divided into a plurality of pages, and the plurality of pages are accessed in parallel by a tensor input cache random access controller; the design of the storage page and tensor data loading device corresponds to the flexibility of the parallel input register array, and meets a specific design formula. The design method can simplify the complexity of a hardware circuit in the device and reduce the area and the power consumption.
CN201910269465.XA 2019-04-04 2019-04-04 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation Withdrawn CN111783933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910269465.XA CN111783933A (en) 2019-04-04 2019-04-04 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910269465.XA CN111783933A (en) 2019-04-04 2019-04-04 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Publications (1)

Publication Number Publication Date
CN111783933A true CN111783933A (en) 2020-10-16

Family

ID=72755181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910269465.XA Withdrawn CN111783933A (en) 2019-04-04 2019-04-04 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Country Status (1)

Country Link
CN (1) CN111783933A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925727A (en) * 2021-03-16 2021-06-08 杭州慧芯达科技有限公司 Tensor cache and access structure and method thereof
CN113543045A (en) * 2021-05-28 2021-10-22 平头哥(上海)半导体技术有限公司 Processing unit, correlation device, and tensor operation method
WO2023000136A1 (en) * 2021-07-19 2023-01-26 华为技术有限公司 Data format conversion apparatus and method
WO2023179619A1 (en) * 2022-03-25 2023-09-28 中山大学 Neural network caching method, system, and device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925727A (en) * 2021-03-16 2021-06-08 杭州慧芯达科技有限公司 Tensor cache and access structure and method thereof
CN112925727B (en) * 2021-03-16 2023-03-03 杭州慧芯达科技有限公司 Tensor cache and access structure and method thereof
CN113543045A (en) * 2021-05-28 2021-10-22 平头哥(上海)半导体技术有限公司 Processing unit, correlation device, and tensor operation method
CN113543045B (en) * 2021-05-28 2022-04-26 平头哥(上海)半导体技术有限公司 Processing unit, correlation device, and tensor operation method
WO2023000136A1 (en) * 2021-07-19 2023-01-26 华为技术有限公司 Data format conversion apparatus and method
EP4357973A4 (en) * 2021-07-19 2024-08-14 Huawei Tech Co Ltd Data format conversion apparatus and method
WO2023179619A1 (en) * 2022-03-25 2023-09-28 中山大学 Neural network caching method, system, and device and storage medium

Similar Documents

Publication Publication Date Title
US11544191B2 (en) Efficient hardware architecture for accelerating grouped convolutions
CN111783933A (en) Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
KR102663759B1 (en) System and method for hierarchical sort acceleration near storage
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN108765247A (en) Image processing method, device, storage medium and equipment
US10936230B2 (en) Computational processor-in-memory with enhanced strided memory access
CN108805272A (en) A kind of general convolutional neural networks accelerator based on FPGA
US7844630B2 (en) Method and structure for fast in-place transformation of standard full and packed matrix data formats
US20220179823A1 (en) Reconfigurable reduced instruction set computer processor architecture with fractured cores
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
EP3938917B1 (en) Moving data in a memory and command for memory control
CN102279818A (en) Vector data access and storage control method supporting limited sharing and vector memory
CN111860807B (en) Fractal calculation device, fractal calculation method, integrated circuit and board card
CN103760525A (en) Completion type in-place matrix transposition method
CN112988621A (en) Data loading device and method for tensor data
CN117273099A (en) Data multiplexing scheme and implementation method of transducer network model under reconfigurable array
CN101061460A (en) Micro processor device and method for shuffle operations
CN116774968A (en) Efficient matrix multiplication and addition with a set of thread bundles
CN111522776B (en) Computing architecture
US20210255793A1 (en) System and method for managing conversion of low-locality data into high-locality data
CN102289424B (en) Configuration stream working method for dynamic reconfigurable array processor
CN115965067A (en) Neural network accelerator for ReRAM
CN113448624A (en) Data access method, device and system and AI accelerator
CN110766150A (en) Regional parallel data loading device and method in deep convolutional neural network hardware accelerator
CN110659118B (en) Configurable hybrid heterogeneous computing core system for multi-field chip design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201016