Disclosure of Invention
Accordingly, in order to overcome the above-described drawbacks of the prior art, the present invention provides a neural network acceleration apparatus, method and communication device.
In order to achieve the above object, the present invention provides a neural network acceleration device comprising: the main memory is used for receiving and storing the feature map data and the weight data of the image to be processed; the main controller is used for analyzing the compiling instruction of the neural network program and generating configuration information and operation instructions according to the structural parameters of the neural network; the data caching module comprises a characteristic data caching unit for caching characteristic line data extracted from the characteristic map data and a convolution kernel caching unit for caching convolution kernel data extracted from the weight data; the neural network computing module comprises a data controller, a data extractor and a neural network computing unit, wherein the data controller adjusts a data path according to configuration information and instruction information, controls the data flow extracted by the data extractor to be loaded into the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and the feature map data and completes accumulation of a plurality of convolution results in at least one period, so that circuit reconstruction and data multiplexing are realized; and the accumulator is used for accumulating the convolution results on the multiple input channel feature graphs obtained by the convolution kernel operation unit and outputting the output feature graph data corresponding to the convolution kernel.
In one embodiment, the neural network computing unit includes a plurality of neural network acceleration slices, each neural network acceleration slice includes a plurality of convolution operation multiply-add arrays, each neural network acceleration slice at least completes a convolution operation of feature map data of one input channel and one convolution kernel data, and the plurality of neural network acceleration slices completes a convolution operation of feature map data of a plurality of input channels and one convolution kernel data.
In one embodiment, the plurality of neural network acceleration tiles form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of a plurality of input channel characteristic data and a plurality of convolution kernels.
In one embodiment, each group of convolution operation multiply-add arrays acquires feature line data in a parallel input mode; each group of convolution operation multiply-add arrays acquires convolution kernel data through a serial input mode.
In one embodiment, the neural network accelerated tile includes a plurality of first multiplexer and a plurality of second multiplexer, the first multiplexer being coupled in parallel one-to-one correspondence with the convolution multiply-add array, the second multiplexer being coupled in series one-to-one correspondence with the convolution multiply-add array; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal, and inputs the characteristic line data into the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel data into the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.
In one embodiment, the neural network computing module further includes a first shift register set and a second shift register set, the neural network computing unit includes a multiply-add subunit and a portion and a buffer subunit, the first shift register set adopts a serial input and parallel output mode, and outputs the feature line data to the multiply-add subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to a next-stage convolution operation multiply-add array and the multiply-add subunit through a second multiplexer, the multiply-add subunit correspondingly carries out multiplication operation on the input characteristic line data and the convolution kernel data, carries out accumulation operation on the characteristic line data and the convolution line data in the part and the buffer subunit, and carries out convolution operation on a plurality of line convolution result parts and accumulation of the convolution window when the convolution operation of the convolution kernel line data and the characteristic line data of the corresponding convolution window is completed, so that one sliding window convolution operation of the convolution kernel is realized; the convolution operation multiplication-addition array of each group of different stages outputs the row operation result to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication-addition array of each group of the same stage corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.
In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, and simultaneously instruct the multiplexer to adjust the on-off of the data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; the data extractor comprises a feature extractor and a convolution kernel extractor, wherein the feature extractor is used for extracting feature line data from a feature data caching unit according to instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel caching unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.
In one embodiment, the feature data caching unit includes a plurality of feature data caching groups, each feature data caching unit caches part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data caching unit, each neural network acceleration slice obtains feature map data of one input channel from the corresponding feature data caching group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices.
In one embodiment, the data controller obtains the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, meanwhile, the instruction multiplexer outputs the characteristic line data to the neural network computing unit in a serial input and parallel output mode from the first shift register group, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication addition array and the next convolution operation multiplication addition array of the neural network computing unit in a serial input and alternative output mode from the second shift register group.
The invention also provides a neural network acceleration method, which comprises the following steps: the method comprises the steps of receiving and storing feature map data and weight data of an image to be processed by adopting a main memory; analyzing a neural network program compiling instruction by adopting a main controller, and generating configuration information and an operation instruction according to the structural parameters of the neural network; a characteristic data caching unit of a data caching module is adopted to cache characteristic data of characteristic line data extracted from the characteristic map data, and a convolution kernel caching unit of the data caching module is adopted to cache convolution kernel data extracted from the weight data; the data controller of the neural network computing module is used for adjusting a data path according to the configuration information and the instruction information on-off, controlling the data flow extracted by the data extractor of the neural network computing module to flow into the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and the feature map data and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing; and accumulating convolution results on the multiple input channel feature graphs obtained by the convolution kernel operation unit by adopting an accumulator, and outputting output feature graph data corresponding to the convolution kernel.
The invention also provides communication equipment which is characterized by comprising a CPU (central processing unit), a DDR SDRAM (double data Rate SDRAM) and the neural network accelerating device which are in communication connection, wherein the CPU is used for controlling the neural network accelerating device to start convolution operation, and the DDR SDRAM is used for inputting characteristic diagram data and weight data into the data cache module of the neural network accelerating device.
Compared with the prior art, the invention has the advantages that: the method is characterized in that the convolution kernel is converted into a row (column) operation from the value acquired from the input image, under the condition that the data do not need to be unfolded in a matrix form, the convolution operation of the current data can be completed only by reading once from the main memory, so that memory access is reduced, the energy efficiency of data access is improved, time characteristic data multiplexing is realized, the operation speed is optimized, the characteristic map data of each input channel are subjected to block processing by adopting a block technology, namely, partial characteristic map data are obtained, cache data are updated after the calculation is completed, and the demand of on-chip characteristic map data cache is reduced, so that the calculation core non-delay operation is realized.
Detailed Description
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, apparatus may be implemented and/or methods practiced using any number and aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
The embodiment of the application provides a communication device, and a hardware architecture of the communication device comprises a Central Processing Unit (CPU) comprising communication connection, a memory DDR SDRAM and a neural network accelerating device. The DDR SDRAM is used for inputting the characteristic diagram data and the weight data to a data caching module of the neural network accelerating device. The CPU, the DDRSDRAM and the neural network accelerating device are in communication connection. The CPU controls the neural network accelerating device to start convolution operation, the DDRSDRAM is used for inputting a plurality of convolution data and a plurality of convolution parameters to a data buffer module of the neural network accelerating device, then the neural network accelerating device completes convolution operation according to the acquired convolution data and the convolution parameters to obtain an operation result, the operation result is written back to a memory address agreed by the DDRSDRAM, and the CPU is informed of the completion of the convolution operation.
As shown in fig. 1 and 2, an embodiment of the present application provides a neural network acceleration device 100 including a main memory 102, a main controller 104, a data buffer module 106, a neural network calculation module 108, and an accumulator 110.
The main memory 102 is used for receiving and storing feature map data and weight data of an image to be processed. The main memory 102 may receive and store images to be processed as well as weight data.
The main controller 104 is configured to parse the compiling instruction of the neural network program, and generate configuration information and operation instructions according to the structural parameters of the neural network. The main controller 104 may generate, according to the structural parameters of the neural network, to load the feature map data and the weight parameters into the corresponding buffer units, and at the same time, the data controller host of the neural network computing module 108 sends the execution instruction to the configuration information of the PE matrix queues in the PU slices.
The data caching module 106 includes a feature data caching unit and a convolution kernel caching unit. The feature data caching unit is used for caching feature line data extracted from feature map data. The convolution kernel buffer unit is used for buffering convolution kernel data extracted from the weight data. After the PU receives the instruction, the data buffer module 106 configures the interconnect circuit paths on the PU slices according to the configuration and instruction information, and inputs the feature map data into the PE accelerators in the multi-slice PU units line by line according to a preset data flow rule, so as to complete the convolution operation (such as what data arrangement mode the feature data and weight are, which PU slices and accelerator PE accelerators the data paths and the block sizes enter, etc.).
The neural network computing module 108 includes a data controller 1081, a data extractor 1082 and a neural network computing unit 1083, wherein the data controller adjusts a data path according to configuration information and instruction information, controls the data flow extracted by the data extractor to be loaded into the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and feature map data and completes accumulation of a plurality of convolution results in at least one period, thereby realizing circuit reconstruction and data multiplexing.
The accumulator 110 is configured to accumulate convolution results on the multiple input channel feature maps obtained by the convolution kernel operation unit, and output feature map data corresponding to the convolution kernel.
The neural network accelerating device may also have an activation/pooling unit, an output buffer unit, etc., and complete subsequent processing of the convolutional neural network, etc., in cooperation with the neural network computing module 108.
According to the neural network accelerating device, matrix form expansion is not required for data, convolution and accumulation operation of current data can be completed only by reading the feature line data and the convolution kernel data once from the main memory, so that the memory access bandwidth and the storage space are reduced, the energy efficiency of data access is improved, efficient feature map data multiplexing is realized, the operation speed is optimized, the feature map data of each input channel is further subjected to block processing by adopting a block technology, namely, partial feature map data is obtained, the cache data is updated after the calculation is completed, and the on-chip feature map data caching requirement is reduced, so that the calculation core non-delay operation is realized.
In one embodiment, as shown in fig. 2, the neural network computing unit 1083 includes a plurality of neural network acceleration slices 1084, each of which includes a plurality of convolution operation multiply-add arrays, each of which performs a convolution operation of at least one of the feature map data of the input channel and one of the convolution kernel data, and the plurality of neural network acceleration slices performs a convolution operation of the feature map data of the input channel and one of the convolution kernel data.
The neural network acceleration tile (PU tile) may be composed of a plurality of PE acceleration processing units (PE accelerators). Multiple PU slices can realize data parallel computation of different dimensions through system configuration. The NN neural network acceleration circuit may include a plurality of neural network acceleration tiles (PU tiles), an activation/pooling circuit, an accumulation unit, and the like. The PE accelerator is the most basic unit for accelerating the neural network, each unit can at least comprise a multiplier, an adder, a part of the adder and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and accumulation of a plurality of convolution results can be completed in at least one period. In this application, PU slices may contain PE accelerators arranged in an array. The characteristic data caching unit comprises a plurality of characteristic data caching groups, and each group of characteristic data caching unit caches partial characteristic data of one input channel and is coupled with at least one PU (polyurethane), namely each PU acquires characteristic map data of one input channel from the corresponding characteristic data caching group; meanwhile, a plurality of PUs share a convolution kernel data buffer unit, namely, the data of the same convolution kernel are broadcasted to the plurality of PUs, so that the parallel operation of multi-channel input single-channel output convolution is realized.
In order to realize data multiplexing and reduce the pressure of the main memory read-write bandwidth, each PU partition can firstly calculate the convolution results of the characteristic line data (the data collected by the convolution kernel in the line direction of the image to be processed is the characteristic line data) input by the current single channel and a plurality of convolution kernels in the line direction of the input characteristic image; updating the convolution kernel data, restarting the convolution operation until the convolution operation is completed by the single characteristic line data and all convolution kernels; after the convolution operation of the feature line data of the single input channel and all convolution kernels is completed, updating the input feature line, namely moving the convolution operation under the convolution kernels, and then circulating the steps until the convolution operation of the current input feature line and all the convolution kernels is completed, and outputting a multi-channel output feature line of the input feature line; after the input feature map of the single channel and all convolution kernel convolution results are completed, the feature map input channel is updated, and the operation sequence can be flexibly configured according to configuration information or instruction information according to actual conditions.
In one embodiment, the plurality of neural network acceleration tiles form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of the characteristic data of a plurality of input channels and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of the characteristic data of the plurality of input channels and the plurality of convolution kernels. A neural network acceleration patch may compute an acceleration operation between a profile data and a convolution kernel convolution. The plurality of PU fragments of the same second neural network acceleration matrix can simultaneously calculate convolution results of the same convolution kernel and different input characteristic data, each PU fragment simultaneously shares a convolution kernel buffer unit, and the plurality of fragments share a characteristic data buffer unit; the multiple PU fragments of different second neural network acceleration matrixes can simultaneously calculate convolution results of multiple different convolution kernels and the same input characteristic data, each PU fragment shares a convolution kernel buffer unit, and the multiple fragments share a characteristic data buffer unit.
The plurality of PUs form a first sub-neural network operation PU matrix, and the plurality of first sub-neural network operation PU matrices form a second neural network PU matrix; each submatrix in the second neural network PU matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of submatrices can complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.
In one embodiment, each group of convolution operation multiply-add arrays acquires feature line data in a parallel input mode; each group of convolution operation multiply-add arrays acquires convolution kernel data through a serial input mode. The PU comprises at least one group of PE, and each group of PE is responsible for convolution operation of convolution kernel line data and corresponding feature map line data; the multiple groups of PEs can realize convolution operation of multiple convolution kernel rows and multiple corresponding feature image row data, namely, each group of PEs form a column, and the multiple columns of PEs complete convolution operation of at least one convolution kernel row and the corresponding feature data. Each group of PE acquires feature line data in a parallel input mode, namely, each piece of metadata in each feature line data is simultaneously broadcast to each level of PE in the current group; and meanwhile, each group of PE acquires convolution kernel row data in a serial input mode, namely, the convolution kernel row metadata flows from the first PE in each group of PE to the next PE in each clock cycle.
In one embodiment, as shown in FIG. 3, the neural network acceleration tile 1084 comprises a plurality of first multiplexer 1085 that are coupled in parallel in a one-to-one correspondence with the convolution multiply-add array and a plurality of second multiplexer 1086 that are coupled in series in a one-to-one correspondence with the convolution multiply-add array; the first multiplexer obtains the characteristic line data corresponding to the convolution operation multiply-add array through the data selection signal, and the characteristic line data are input into the corresponding convolution operation multiply-add arrays in parallel, and the second multiplexer obtains the convolution kernel data corresponding to the convolution operation multiply-add array and serially inputs the convolution kernel data into the convolution operation multiply-add arrays in each stage to complete the convolution product multiply-add operation.
The first multiplexers are respectively coupled with the corresponding PE groups in parallel, each PE group can select at least one of the two characteristic line data through the selection signal, and the sixth PE group in FIG. 3 can select 6 different lines of data; the second plurality of multiplexers are serially coupled to corresponding groups of PEs, each group of PEs selecting at least one of the two convolution kernel rows by the selection signal, and the sixth group of PEs in FIG. 3 selecting 6 different convolution kernel rows. The first multiplexer obtains the corresponding characteristic line data through the data selection signal (provided by the configuration information or the data loading instruction information), and the characteristic line data are input into the PE of each stage of the corresponding PE group in parallel, and meanwhile, the second multiplexer selects the corresponding convolution kernel data to be input into the PE of each stage in series to complete the convolution multiplication and addition operation.
The residual idle PE groups can acquire the characteristic row data and the convolution kernel data used for convolution operation in the direction of the characteristic diagram column through a multiplexer, so that multiplexing of input data is realized. For example, for a convolution kernel step size of 3×3 being 1, taking the matrix of fig. 3 as an example, the first three columns of PE groups complete a parallel acceleration convolution operation in the row direction of the convolution kernel (PE 00, PE10, PE20 are a group, PE01, PE11, PE21 are a group, PE02, PE12, PE22 are a group), the fourth group of PEs can multiplex the feature row data in the second feature extractor 1 and the convolution kernel row data of the first weight acquirer 0, the fifth group of PEs can multiplex the feature row data in the third feature extractor 2 and the convolution kernel row data of the second weight acquirer 1, and the sixth group of PEs can multiplex the feature row data in the fourth feature extractor 3 and the convolution kernel row data of the third weight acquirer 2, and then send the pairs of data into the corresponding PE groups to complete the sliding convolution operation of the convolution window in the row direction of the feature map, thereby improving the multiplexing of the feature map data and reducing the occupation of the main memory read/write bandwidth, and simultaneously the whole PE array is in a high-efficiency running state.
In one embodiment, the neural network computing module further includes a first shift register set and a second shift register set, the neural network computing unit includes a multiply-add subunit and a portion and a buffer subunit, the first shift register set adopts a serial input and parallel output mode, and outputs the feature line data to the multiply-add subunit through a first multiplexer; the second shift register group adopts a serial input and parallel output mode, and outputs convolution kernel data to a next-stage convolution operation multiplication and addition array and a multiplication and addition subunit through a second multiplexer according to step signals, the multiplication and addition subunit correspondingly carries out multiplication operation on the input characteristic line data and the convolution kernel data, and carries out accumulation operation on the characteristic line data and the convolution line data in a part and the buffer subunit, and when the convolution operation of the convolution kernel line data and the characteristic line data of a corresponding convolution window is completed, a plurality of line convolution result parts and accumulation of the convolution window are carried out, so that one sliding window convolution operation of the convolution kernel is realized; the convolution operation multiplication and addition arrays PE of each group of different stages output row operation results to an accumulator in a convolution kernel row period, and the accumulator accumulates convolution kernel row convolution operation results output by the convolution kernel row multiplication and addition arrays corresponding to all rows of the current convolution kernel through an addition tree, so that one-time window convolution operation of one convolution kernel is realized; the PE of different levels calculates a plurality of sliding convolution operations of the current convolution kernel row in the direction of the feature map row in parallel; and the characteristic diagram lines are sequentially input and loaded into each group of all stages of PE in parallel in the convolution operation process, and the corresponding convolution kernel data are serially loaded into each group of all stages of PE in a periodic cycle mode.
As shown in fig. 4, the first shift register set 1087 adopts a serial connection and parallel output mode to input the feature map data to the PEs at each stage in parallel, and outputs the data to the PEs at each stage through the multiplexer; the current convolution kernel metadata is fed into the multiply-add unit simultaneously with being fed into the first stage shift register.
The second shift register set 1088 also adopts a serial connection and parallel output mode and outputs the convolution kernel data to the next multiplication and addition unit through the multiplexer.
And carrying out multiplication operation on the characteristic line data and the convolution kernel line data fed into the multiplication and addition unit correspondingly, then carrying out accumulation operation on the characteristic line data and the convolution line data in the partial and buffer unit, and after the convolution operation of one convolution kernel line and the corresponding convolution window characteristic line data is completed, carrying out partial and output of the convolution result of the convolution line and the convolution result of other lines of the convolution kernel and accumulation, so as to realize the convolution operation of one sliding window of the convolution kernel.
The characteristic line data (X00, …, X0 n) are fed into the PE group in parallel according to the line sequence, the convolution kernel line data (F00/F01/F02) are fed into the PE group in series according to the circulating sequence in a mode of F00/F01/F02-F00/F01/F02-F00/F01/F02 for carrying out convolution operation, and each stage of PE outputs a part and a result corresponding to the current convolution kernel line after one convolution kernel line period (the convolution kernel line size is 3, and the convolution kernel line period is 3). The PE of different levels outputs the part and result of the convolution kernel line sliding convolution operation on the feature image line according to the step length s; after each group of different level PEs output partial sums in a convolution line period, the partial sum results output by the level PEs of each group of PEs corresponding to all lines of the current convolution kernel are accumulated through an addition tree, so that convolution operation of one convolution kernel is realized, and convolution calculation shown in figure 5 is obtained.
In fig. 5, PE00 is a convolution portion and result of data of a first row of a convolution kernel and a corresponding window feature row output in three consecutive periods, PE10 is a convolution portion and result of a first row of a convolution kernel and a window adjacent to the window calculated by PE00 output in three consecutive periods, PE20 is a convolution portion and result of a first row of a convolution kernel and a window adjacent to the window calculated by PE10 output in three consecutive periods, and a plurality of groups of PEs can implement convolution acceleration operations of different rows of the convolution kernel and corresponding feature data rows, which is equivalent to sliding traversal convolution operations of the convolution kernel in the direction of the feature map row. The first shift register group selects the output of the corresponding shift register as the input of the next PE according to the step size parameter s selection signal, so that the convolution window traverses the convolution operation along the row direction of the step size s, and other PE groups realize the window sliding convolution acceleration operation along the column direction of the feature map according to the step size by multiplexing part of the input feature map row data and convolution kernel row data.
In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit, and simultaneously instruct the multiplexer to adjust the on-off state of the data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; the data extractor comprises a feature extractor and a convolution kernel extractor, wherein the feature extractor is used for extracting feature line data from the feature data caching unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel caching unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.
In one embodiment, the feature data caching unit includes a plurality of feature data caching groups, each feature data caching unit caches part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data caching unit, each neural network acceleration slice obtains feature map data of one input channel from the corresponding feature data caching group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices. The data controller, besides connecting the weight acquirer 0 and the 0 column of the PE unit, broadcasts the data to the 4 th column PE unit, that is, the first row weight parameter performs a row convolution operation with the 1 st characteristic row data, and also performs a row convolution operation with the 2 nd characteristic row data (wherein the 2 nd characteristic row data is connected to the 4 th column PE unit through a multiplexer in addition to the 2 nd column PE), that is, the convolution operation is performed on adjacent rows each time by the convolution kernel, and the data controller of the neural network computing module 108 uses a plurality of neural networks to accelerate the slicing to compute the convolution result of the characteristic row data input in a single channel and the convolution kernel in the row direction of the image to be processed.
In one embodiment, the data controller obtains the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, meanwhile, the instruction multiplexer outputs the characteristic data to the neural network computing unit in a serial input and parallel output mode from the first shift register group, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiply-add array and the next convolution operation multiply-add array of the neural network computing unit in a serial input and one output mode according to step size selection from the second shift register group.
As shown in fig. 6, when the convolution kernel line size (for example, 5) is greater than the number of PEs (for example, 3 PEs) of each group of PEs, in each convolution kernel line operation period, address access of each level of PEs to the cached feature line data by the feature line data extractor may collide (since each level of PEs synchronously acquire the same feature metadata input, at the next convolution kernel line cycle, the feature data address required by a part of PEs collides with the current feature extractor access cache address), but the data accessed by a part of PEs has already been accessed in the previous operation period, and these data are cached in the first shift register group, so the data controller acquires the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and instructs the multiplexer to adjust the data path, input the feature data and the corresponding weight data to the corresponding neural network computing unit according to the instruction information, and prevent the conflict between the part of the access data and the current feature extractor access address by selecting the output data of the registers in the corresponding first shift register group. Therefore, the PE unit can control the multiplexer to acquire conflict data through the data controller, so that convolution operation of convolution kernels with different sizes is applicable, and PE kernel waiting caused by address access conflict of PE at each level under the condition that access data are not aligned is avoided.
As shown in fig. 4, the feature extractor may load data to the PE array in row order according to the data address; meanwhile, 6 feature extractors are arranged in the PE array, namely FIFO units or an addressing mode (addressing is carried out from a cache unit in sequence and data is loaded to PE), if two of the FIFOs are empty, signal interception processing can be carried out on the currently empty feature extractors, so that the power consumption is reduced, and the weight acquirer is similar; for a convolution kernel of 5×5 or 7×7, all feature extractors and weight acquisitors can be put into full-load operation; the feature extractor number can be flexibly configured such as reduced or increased, i.e., the PE array is correspondingly reduced or increased.
The selection of an input channel is completed by adopting a multiplexing input multiplexer according to the configuration information of the data controller, so that the convolution operation of the convolution kernel and the corresponding characteristic data is completed; for example, for a 3×3 convolution kernel, the 4, 5, and 6PE columns multiplex the data output by the feature data extractors 2, 3, and 4, respectively. If for the convolution kernel of 5×5, the 4 th and 5 th PE columns are communicated with the corresponding 4 th and 5 th feature data extractors, and the 6 th PE column is communicated with the 2 nd feature data extractor, since the convolution kernel size is not aligned with the PE column number, after one row of convolution operation is completed, the input paths of the input multiplexers need to be rearranged, for example, in the first row period, 5+1 rows (one row period convolution of the convolution kernel of 5×5 on the feature map and one row of the convolution kernel on the feature map, one row of the convolution kernel are moved down and only one row of the feature data are convolved) are completed once, so that the 1 st to 4 th feature extractors and the 2 nd to 5 th row weights of the convolution kernel are convolved on the 1 st to 4 th PE columns, and the 5 th and 6 th PE rows multiplex the data of the 1 st and 2 nd feature extractors; wherein only part of the connection is shown in fig. 5 and part of the connection is not shown.
As shown in fig. 5, the weight parameters flow into the PE units in serial order in the column direction (three PE units are arranged in the column direction in order to adapt to the convolution kernel of 3*3), and the PE units in the same column and each row are linked by a group of delay register groups through a multiplexer, and the delay circuit structure is used for adjusting the sliding step length of the convolution window; each column of PE units receives characteristic data of a specific row through a multiplexer input and outputs the characteristic data to each row of PE units in the same column in parallel, and because weight parameters flow into each PE in series, when the PE units reach the 2 nd and 3 rd rows of PE in an initial state, the characteristic data needs to be delayed for 1 and 2 periods respectively, and then convolution operation is carried out on the characteristic data and the corresponding weight parameters; meanwhile, a group of characteristic data delay circuits are also arranged in the PE unit, and firstly, the alignment of characteristic data and weight parameters is realized in an initial state, and the PE unit is used for multiplexing corresponding characteristic data through a multi-input selection circuit when the convolution kernel exceeds 3*3, so that address conflict during data reading is prevented.
The feature data extractor acquires at least 1 line of continuous feature line data from the feature cache circuit on the PU slice (because the embodiment adopts a convolution kernel of 3×3, the feature extraction circuit requests the feature cache circuit for loading 4 continuous feature line data lines at a time in order to fully utilize the multiplication circuit resource in the PE array); meanwhile, the weight acquirer acquires at least 1 row of weight parameters at one time (three rows of weight parameters are directly acquired and loaded into the weight acquirer), and for inputting 4 rows of characteristic row data and 3 rows of convolution kernel row data, each convolution kernel row cycle outputs at least two characteristic image column direction convolution results.
In this embodiment, the feature extractor and the weight acquirer form a data extractor, which may be a FOFI structure or an addressing circuit, each of the data extractors extracts the first row data (F00, F01, F02) and the first feature row data (X00, X01, X02, X03, X04, X05, X06, …) of the weight by row, respectively, and re-extracts the other row feature data after each row period.
In the initial state, the first period F00 is fed to the register 1 and the multiplier in the PE00, and the characteristic data X00 is fed to the PE00, the PE10, the PE20; x00 is multiplied by F00 in PE00, and the result is sent to a row accumulation BUFFER, at the moment, F00 in PE01 arrives in the next period, and X00 is stored in each PE characteristic register 1; in the second cycle X01 is sent to PE00, PE01, PE20, at this time X00 is shifted to the signature register 2, X01 enters the signature register 1, at the same time F01 is sent to PE00, F00 in PE00 is sent to PE01, X01 and F01 are convolved in PE00, and the result of the operation is sent to the adder to be added with the result of X00X F00 buffered in the addition BUFFER in the previous cycle. Meanwhile, multiplying X01 in PE01 by F00, and sending the result into an accumulation buffer; the third period F02 enters PE00, F01 in PE00 enters PE01, F00 in PE01 enters PE02, meanwhile X02 is synchronously transmitted to each PE, and multiplication and addition operation is carried out with corresponding weight parameters; the convolution of one row of weight and the corresponding characteristic data can be realized through the three periods, after four periods, the convolution kernel row slides once on the characteristic data row, and after eight periods, a convolution result that the convolution kernel row slides 6 times on the corresponding characteristic data row can be obtained; since the column 3*3 convolution kernel performs 3 rows of inputs in parallel, the result output of 6 convolution sliding windows can be completed in eight cycle periods; after the initialization state, the whole multiply-add unit is in a full load state.
In another embodiment, as shown in fig. 6, the convolution kernel is 5*5, the step size is 1, 3-level PE parallel acceleration operation feature map line direction three convolution window convolution kernel lines of convolution operation, when PE00 completes one convolution kernel line convolution, the convolution operation of the fourth convolution window convolution kernel line is started (the second and third window convolution kernel line convolution operations are accelerated convolution operations performed by the same group of PEs 10 and 20), where the address of the PE00 feature data needs to start from X03, but where PE01 and PE02 need to access the data of X05, so that the feature data extractor may cause conflict in accessing and addressing the data of X03 and X05. To avoid collisions, the X03 data may be obtained using a signature data register (the signature data for X04, X03, X02, X01 has been cached for the current 4 clock cycles) selection circuit, i.e., X03 is obtained from register 2; meanwhile, when the 3 PE register selection signals are consistent, the address pointer of the feature data extractor is updated again, for example, in the 8 th cycle (each PE needs to access X05 data and the addresses of the data accessed in the two subsequent cycles are the same), the address pointer is updated to X05, and after the next convolution kernel line cycle (for example, the 11 th clock cycle), the feature data selection circuit is reconfigured; full-load calculation of the 5×5 convolution kernel can be flexibly realized through the circuit configuration.
In another embodiment, as shown in fig. 7, the convolution kernel is 5*5, the step size is 2, after 5 cycles, each PE has been loaded with data, but for PE00, a convolution window convolution kernel row operation is completed, the next time the line address of the convolution window is X06, but at this time the feature extractor address pointer is occupied by PE10 and PE20 and points to X05 data (the X05 data needs to be obtained from the feature data buffer circuit), so for PE00, when the feature extractor address pointer points to X06, a fourth convolution window convolution kernel row convolution operation, that is, X06F 00, is performed, and when the feature data loaded by other-stage PEs conflicts with another current PE feature extractor access address, the PE unit that has the address conflict is in an idle state in the current operation cycle, and the product operation is performed in the next cycle according to the feature extractor access address obtaining the corresponding feature data. From the above, it can be seen that some PE units are not fully loaded during a portion of the address collision period in an idle state during a particular period, but for a convolution operation with a step size of 2 for a 5X5 convolution kernel, the utilization of the PE units is: 3*6-3/3×6×100+=83%, i.e. the number of PE units per iteration cycle-the number of idles per iteration cycle/(the number of PE units per iteration cycle), where the iteration cycle is from the first PE idle of cycel 6 to cycel 11 after initialization, i.e. PE00 is 6 per idle cycle, and the number of idles per iteration cycle is the total number of idles of the group of PEs in the iteration cycle, while for convolution operations with a step size exceeding 1, the larger the convolution kernel size, the higher the overall PE unit utilization.
As shown in fig. 8, the present embodiment further provides a neural network acceleration method, including the following steps:
step 602, a main memory is adopted to receive and store feature map data and weight data of an image to be processed;
step 604, analyzing the compiling instruction of the neural network program by adopting the main controller, and generating configuration information and an operation instruction according to the structural parameters of the neural network;
step 606, a characteristic data buffer unit of the data buffer module is adopted to buffer the characteristic data buffer unit of the characteristic line data extracted from the characteristic map data, and a convolution kernel buffer unit of the data buffer module is adopted to buffer the convolution kernel data extracted from the weight data;
step 608, the data controller of the neural network computing module is adopted to adjust the data path according to the configuration information and the instruction information on-off, the data flow extracted by the data extractor of the neural network computing module is controlled to flow into the corresponding neural network computing unit according to the instruction information, the neural network computing unit at least completes the convolution operation of one convolution kernel and the feature map data, and the accumulation of a plurality of convolution results is completed in at least one period, so that the circuit reconstruction and the data multiplexing are realized;
and 610, accumulating convolution results on the multiple input channel feature graphs obtained by the convolution kernel operation unit by adopting an accumulator, and outputting output feature graph data corresponding to the convolution kernel.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.