CN111783933A

CN111783933A - Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Info

Publication number: CN111783933A
Application number: CN201910269465.XA
Authority: CN
Inventors: 林森; 何一波; 李珏
Original assignee: Beijing Xinqi Technology Co ltd
Current assignee: Beijing Xinqi Technology Co ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2020-10-16

Abstract

The invention relates to a hardware circuit design and a method of a data loading device combined with a main memory, which are used for accelerating computation of a deep convolutional neural network. The device designs a cache structure specifically, and comprises the following steps: input caching and control, namely applying a macro block segmentation method aiming at the input of a main memory or/and other memories, and realizing regional data sharing, tensor data fusion and distribution; parallel input register array to convert the data in the input buffer; and the tensor data loading unit is connected with the output of the input cache and the input of the parallel input register array. The design simplifies the address decoding circuit, saves area and power consumption and does not influence the high bandwidth of data. The hardware device and the data processing method provided by the invention comprise a transformation method, a macro block segmentation method and an addressing method for input data, meet the requirement of algorithm acceleration by using limited hardware resources and reduce the complexity of address management.

Description

Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Technical Field

The invention belongs to the field of computer hardware and artificial neural network algorithm deployment hardware acceleration, and relates to the field of digital integrated circuit design, in particular to a method and a device for designing an input system architecture of a deep convolutional neural network hardware acceleration chip.

Background

The deep convolutional neural network algorithm consists of a plurality of specific neuron algorithm layers and hidden layers and mainly comprises convolutional layers, and a main operator is convolution calculation of a matrix or a vector. The calculation task is mainly characterized in that the input data volume is large, the input data has coupling of spatial characteristic information, data calculated by convolution each time often overlaps with calculated data, and the input data often is calculation data required by extraction according to a certain spatial rule from data in a tensor format. The computation force required by the convolution layer calculation is huge, the required data is larger, and the storage bottle neck becomes a main constraint factor.

In recent years, the deployment of artificial neural algorithms on the embedded end side has become a wide demand, but in relevant scenes, the performance of an acceleration chip and cost factors become main factors for restricting the demand. Patent document 1 (publication No. CN105488565A) discloses an arithmetic device and method for accelerating an acceleration chip of a deep neural network algorithm, in order to overcome the problem that a large number of intermediate values are generated and need to be stored, and thus the required main memory space increases, the arithmetic device is provided with intermediate value storage areas, and these areas are configured as random memories, so as to reduce the number of times of reading and writing the intermediate values from and into the main memory, reduce the energy consumption of the accelerator chip, and avoid the problems of data loss and replacement during data processing. Patent document 2 (application publication No. CN107341544A) discloses a reconfigurable accelerator based on a partitionable array and an implementation method thereof, and a scratch pad memory buffer is designed for implementing data reuse. Patent document 3 (publication No. USB0170103316a1) discloses a method, system and apparatus of a convolutional neural network accelerator in which a Unified Buffer is designed. Patent document 4 (publication No. US20180341495a1) discloses a convolutional neural network accelerator and method in which a cache device is employed to provide data required for parallel acceleration. These inventions are all excellent and have been deployed in servers, data centers and high-end smart phones, but have problems with embedded end-side applications.

The method is characterized in that an artificial neural algorithm is deployed at the embedded end side, and the requirement characteristic is that data needs to be segmented and the expansion of the data is reduced as much as possible due to the limited hardware resources of an acceleration chip; for different artificial neural network algorithms commonly used in different fields and industrial scenes, the processing is a set of simple and convenient method, otherwise, the artificial neural network algorithms are still difficult to fall on the ground. In the inventions described in patent documents 1 and 3, the accelerator resources are wasted due to different neural network algorithm layer sizes and different data reuse degrees, so that other heterogeneous processors are required to be matched to help solve the data-related problems, or the performance is improved by means of a deeper submicron high-cost advanced process; the storage method described in patent 3 requires more data to be backed up, resulting in too large Buffer size; the method of patent 2 adopts a reconfigurable computing idea, although great attention is paid to saving resource waste, the data segmentation and arrangement method is complex, and a compiler needs to be deployed in cooperation with an advanced computing task to assist application; the invention of patent 4 is too coupled with the design of the central processing unit, and the design implementation complexity is too high.

Disclosure of Invention

The invention provides a hardware circuit design and method for a data loading device which accelerates the computation of a deep convolutional neural network and combines a main memory,

the method has the advantages of reducing the complexity of hardware circuit design, reducing the area and power consumption of the chip, providing high-throughput and high-performance parallel data bandwidth, improving the utilization rate of computing resources and memory bandwidth of the chip, and reducing the complexity and cost of application.

To achieve the above object, an embodiment of the present invention provides, in combination with a scalable parallel data loading apparatus, a data loading apparatus re-combined with a main memory, the data loading apparatus including:

the tensor input cache random access controller is used for fusing, arranging and converting data formats of input data from the main memory or/and other memories, and then distributing the input data to the partition areas of the input cache unit, wherein the working mode can be reconfigured through software;

the divisible input cache unit is a local cache of the data loading device, consists of a plurality of storage pages, has a design and storage method corresponding to the dimension of input data and a parallel input register array, and supports the change of data formats brought by software reconfiguration;

the tensor data loading device completes the fusion or rearrangement of tensor data by changing the access format of each storage page of the divisible input cache unit, has a data padding function at the same time, and loads the processed data into the parallel input register array;

and the parallel input register array is used for inputting high-bandwidth data to the deep convolutional neural network parallel acceleration computing unit array.

For a feature map which is stored in a main memory or/and other memories and is output by a hidden layer in front of a deep convolutional neural network algorithm layer, the device provides a cache for data rearrangement and a rapid register area, and the difficulty of input data arrangement is simplified; the input cache unit capable of being partitioned can be repeatedly accessed, and the format of the data to be accessed again is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory or/and other memories.

The invention provides a design method of a data loading device, which is characterized in that a local cache unit is divided into a plurality of storage pages, and a tensor input cache random access controller can access the plurality of pages in parallel; the design of the storage page and tensor data loading device corresponds to the flexibility of the parallel input register array, and meets a specific design formula. The design method can simplify the complexity of a hardware circuit in the device and reduce the area and the power consumption.

The invention has the following effects:

1. simplifying the complexity of connection between the hardware parallel computing unit array and the input device

2. Simplifying the spatial complexity of arranging data between the output device and the main storage

3. Simplifying the address calculation complexity of software configuration data and dividing data macro block

4. The practical application efficiency of the hardware parallel computing unit array is improved

5. Is more suitable for being implemented on a low-cost embedded ASIC chip

Drawings

FIG. 1 is a block diagram of a data input device according to the present invention;

FIG. 2 is a diagram of the structure and design method between the tensor data loading device and the page and scalable parallel input register array according to the present invention;

FIG. 3 is a diagram of the specific structure between the tensor data loading device and the page and the scalable parallel input register array according to the present invention;

FIG. 4 is a diagram of a data loading method in conjunction with a main memory according to the present invention.

Description of the reference numerals

1 parallel hardware computing Unit Array (Process Elements Array, PEA)

101 convolution computing Element (Process Element, PE)

2 data input device combined with main memory

201 divisible input buffer unit

202 scalable parallel input register array

204 tensor data Loading Device (LDI)

205 tensor input buffer random access controller

5 high performance on-chip data bus

6 main memory and controller thereof

Detailed Description

The invention is described in further detail below with reference to the figures and examples.

Fig. 1 is a structural diagram of a data loading apparatus incorporated in a main memory according to the present invention, the data loading apparatus 2 including:

a tensor input buffer random access controller 205, which performs fusion, arrangement and data format conversion on input data from the main memory 6 or/and other memories, and distributes the input data to the partition areas of the input buffer unit 201, wherein the working mode can be reconfigured by software;

the divisible input cache unit 201 is a local cache of the data loading device of the present invention, and is composed of a plurality of storage pages, and the design and storage method corresponds to the dimension of the input data and the parallel input register array 202, and supports the change of the data format caused by the software reconfiguration;

a tensor data loading device 204 which completes the fusion or rearrangement of tensor data by changing the access format of each storage page of the divisible input buffer unit 201, has a data padding function, and loads the processed data into the parallel input register array 202;

and the parallel input register array 202 inputs high-bandwidth data to the deep convolutional neural network parallel acceleration computing unit array.

For the feature map which is stored in the main memory 6 or/and other memories and is output by a hidden layer in front of the deep convolutional neural network algorithm layer, the device provides a cache for data rearrangement and a quick register area, and the difficulty of input data arrangement is simplified; the input buffer unit 201 can be repeatedly accessed, and the format of the data to be accessed is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory 6 or/and other memories.

The invention provides a method for designing a divisible input cache corresponding to a telescopic parallel input register array, which comprises the following steps: assuming that the parallel input register array 202 is instantiated with rows and columns of Rh and Rw input registers, the paging number of the input buffer 201 is also designed as Rh bank pages; assuming that the bit width of the input data is DW, the parallel input register array 202 is filled with bits available for the parallel accelerated computing array 1 to perform the successive accelerated computing Bw × Bh for each time, and the bit width computing method for filling the parallel input register array 202 is as follows

Based on parallel accelerometersCalculating the parallelism P of the cell array 1 and the minimum size Kmin of a convolution kernel, and selecting Bw Bh in a foldable mode; the depth per bank page is tm Rw, taking into account the required buffer depth tm for the design of the main storage system. Fig. 2 explains the correspondence of the present design method.

FIG. 2 is a diagram of the structure and design method of the tensor data loading device 204, the page 201 and the scalable parallel input register array 202 of the present invention: the tensor data loading device comprises several groups of reading and writing units which work in parallel, the number of the groups is related to the range of input data which needs to be accessed corresponding to IRA by each PE, and the reading and writing units are characterized in that

Each LDI read-write group unit is operated corresponding to the Bh line of the parallel input register array; the working method is that the corresponding row of the parallel input register array is written first, and each writing

After writing the Bw times, writing the row, and then writing the next row until the corresponding Bh row is written, and finishing the current IRA filling; the distribution of the Bh rows corresponding to one LDI read-write group unit in the tensor data loading device 204 is distributed at intervals across regions according to the size of the regions determined by the operational characteristics of IRA and PEA; all LDI read-write group units are written into IRA in parallel, and when one row is written, data which can be used for the parallel computing unit array PEA to complete at least one time of matrix convolution computation is filled. Fig. 3 specifically explains the correspondence relationship of the present design method.

FIG. 4 is a flowchart of a data loading method combined with a main memory according to the present invention:

firstly, input data are normally placed in a main memory according to a scanning sequence and are coded according to a 2-dimensional format, as shown in the figure, r represents an input data graph, and numbers represent addressing;

according to the register scale of the parallel input register array in the device, input data is cut into blocks;

starting a tensor input cache random access controller 205, configuring the cut input data block according to the head address and the tensor reading mode, and completing tensor operations on the input data, such as fusion, transposition and the like, by the access controller;

the write mode of the tensor input cache random access controller 205 is configured, the data is sequentially written according to bank paging, and meanwhile, the data is rearranged in a certain rule, so that the cache data arrangement mode provided by the invention is met.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data loading apparatus in conjunction with a main memory for accelerating computation of a deep convolutional neural network, the hardware apparatus comprising:

2. The data loading device according to claim 1, wherein the device provides a buffer and a fast register area for data rearrangement for the feature map output by a hidden layer before the deep convolutional neural network algorithm layer stored in the main memory or/and other memories, thereby simplifying the difficulty of input data arrangement; the input cache unit capable of being partitioned can be repeatedly accessed, and the format of the data to be accessed again is more regular; when the data therein has been invalidated, new data can be efficiently written again from the main memory or/and other memories.

3. A method for designing a data loading apparatus according to claims 1-2, wherein the local cache unit is divided into a plurality of pages, and the plurality of pages are accessed in parallel by a tensor input cache random access controller; the design of the storage page and tensor data loading device corresponds to the flexibility of the parallel input register array, and meets a specific design formula. The design method can simplify the complexity of a hardware circuit in the device and reduce the area and the power consumption.