[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110033086A - Hardware accelerator for neural network convolution algorithm - Google Patents

Hardware accelerator for neural network convolution algorithm Download PDF

Info

Publication number
CN110033086A
CN110033086A CN201910301389.6A CN201910301389A CN110033086A CN 110033086 A CN110033086 A CN 110033086A CN 201910301389 A CN201910301389 A CN 201910301389A CN 110033086 A CN110033086 A CN 110033086A
Authority
CN
China
Prior art keywords
data
processing engine
weighted
matrix
hardware accelerator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910301389.6A
Other languages
Chinese (zh)
Other versions
CN110033086B (en
Inventor
陈柏纲
许喆
丁雪立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NOVUMIND Ltd.
Original Assignee
Beijing Isomerism Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Isomerism Intelligent Technology Co Ltd filed Critical Beijing Isomerism Intelligent Technology Co Ltd
Priority to CN201910301389.6A priority Critical patent/CN110033086B/en
Publication of CN110033086A publication Critical patent/CN110033086A/en
Application granted granted Critical
Publication of CN110033086B publication Critical patent/CN110033086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Processing (AREA)

Abstract

A kind of hardware accelerator for neural network convolution algorithm includes processing engine matrix, weighted data buffer, threshold value data buffer, input data buffer and control module.Control module reads input data from input data buffer and inputs data into the processing engine to match in processing engine matrix with input data, control module reads weighted data from weighted data buffer and configures weighted data to the processing engine to match with weighted data, and control module from threshold value data buffer reads threshold value data and configured threshold Value Data is to each processing engine.Each processing engine carries out convolution algorithm to matched input data and weighted data and obtains intermediate result, and intermediate result is compared with threshold value data, and the regularization result obtained after output intermediate result or regularization is selected according to comparison result.The hardware accelerator realizes high speed processing input tensor data and can neatly cope with the input tensor data of different dimensions.

Description

Hardware accelerator for neural network convolution algorithm
Technical field
This disclosure relates to which neural network convolution algorithm tensor processor, is transported more particularly, to for neural network convolution The hardware accelerator of calculation.
Background technique
Neural network establishes model structure by simulating the neural connection structure of human brain, be current academic research and The hot spot of Corporation R & D.Current neural network is especially used for the convolutional neural networks of image procossing and Object identifying, needs Processing is largely expressed as the data of three ranks or higher order tensor, it is also desirable to handle the tensor number with different shape and size According to.Therefore it is required to the neural network dedicated computing device of high speed processing three rank of different shapes or higher order tensor data. In addition, binaryzation neural network refers to carrying out weighted value and/or input data the neural network after binary conversion treatment.Currently There are no the high precision computation devices for being directed to binaryzation neural network.
Summary of the invention
Based on this, it is necessary to provide the neural network dedicated computing dress for capableing of three rank of high speed processing or higher order tensor data It sets, also it is necessary to provide the high precision computation devices for being directed to binaryzation neural network.For this purpose, the disclosure provides a kind of tensor processing Device, the tensor processor include multiple processing engines (Processing Engine, hereinafter referred to as PE) and are connected with multiple PE Ping-pang storage device.The tensor processor can (such as the dimension and convolution kernel of the tensor data according to input according to actual needs The information such as dimension) number of the determination PE to be called and the dimension of the two-dimensional array as composed by called PE, call All or part of of multiple PE forms PE two-dimensional array.Further, the PE of tensor processor configuration PE two-dimensional array Connection relationship and data flow each other can also cut the tensor data of input according to the dimension of PE two-dimensional array, Tensor data are inputted to realize high speed processing and can neatly cope with the input tensor data of different dimensions.For two-value mind Derivation operation through network, the tensor processor replace convolution algorithm with hardware mode, and there are also carry out door to convolution algorithm result Limit operation has both the binary neural network computing device with high-precision advantage at high speed to realize.
According to one aspect of the disclosure, a kind of hardware accelerator for neural network convolution algorithm is provided, is wrapped Include: the processing engine matrixes of N x M being made of multiple processing engines, wherein N and M be positive integer more than or equal to 2 and N not Greater than M;Weighted data buffer stores weighted data;Threshold value data buffer stores threshold value data;Input number According to buffer, input data is stored;Control module, the control module read the input from the input data buffer The processing engine to match in the processing engine matrix with the input data is input data into described in data and transmission, it is described Control module reads the weighted data from the weighted data buffer and configures the weighted data to the processing engine The processing engine to match in matrix with the weighted data, the control module read the door from threshold value data buffer Limit value data simultaneously configure the threshold value data to each of processing engine matrix processing engine;Wherein, the place Each of reason engine matrix processing engine to the input data for being matched to the processing engine and is matched to the processing engine Weighted data carries out convolution algorithm and obtains intermediate result, and the intermediate result is compared with the threshold value data, according to institute It states intermediate result and result selection that the threshold value data is compared exports the intermediate result or intermediate ties to described Fruit carries out the regularization result obtained after regularization.
Detailed description of the invention
Embodiment of the disclosure has other advantages and features, when read in conjunction with the accompanying drawings, from described in detail below and appended power Benefit will be more readily apparent from these advantages and features in requiring, in which:
Fig. 1 shows a kind of framework of the tensor processor of embodiment, the tensor processor include input/output bus, Ping-pang storage device and PE two-dimensional array.
Fig. 2 shows the frameworks and data flow of a kind of PE two-dimensional array of the tensor processor of embodiment.
Fig. 3 shows the framework and data flow of the PE two-dimensional array of the tensor processor of another embodiment.
The PE two-dimensional array that Fig. 4 shows the tensor processor in embodiment shown in Fig. 3 derives operation result A line.
The PE two-dimensional array that Fig. 5 shows the tensor processor in embodiment shown in Fig. 3 derives operation result Two rows.
The PE two-dimensional array that Fig. 6 shows the tensor processor in embodiment shown in Fig. 3 derives operation result Three rows.
Fig. 7 shows a kind of PE of the tensor processor configuration PE two-dimensional array of embodiment to adapt to input data matrix Dimension.
Fig. 8 shows a kind of tensor processor cutting input data matrix of embodiment to adapt to the dimension of PE two-dimensional array Degree.
Fig. 9 shows the first matching convolution kernel after a kind of tensor processor cutting input data matrix of embodiment With the mode of image data input.
Figure 10 shows second of matching convolution after a kind of tensor processor cutting input data matrix of embodiment The mode of core and image data input.
Figure 11 shows the third matching convolution after a kind of tensor processor cutting input data matrix of embodiment The mode of core and image data input.
The tensor processor that Figure 12 shows a kind of embodiment passes through Multicast configuration data.
The tensor processor that Figure 13 shows a kind of embodiment carries out the data flow of full connection operation.
Figure 14 shows a kind of PE configuration parameter of the tensor processor of embodiment.
The PE that Figure 15 shows a kind of tensor processor of embodiment carries out binary neural network convolution algorithm and thresholding Operation.
Figure 16 shows the framework of the tensor processor of another embodiment.
Figure 17 shows a kind of frameworks of the ping-pang storage device of the tensor processor of embodiment.
Specific embodiment
The accompanying drawings and the description below are by way of example only.It should be appreciated that from following discussion, structure disclosed herein and side The alternate embodiment of method will easily be considered as the viable alternatives that can be used, without departing from original claimed Reason.
Referring to Fig. 1, a kind of tensor processor of embodiment include input/output bus 100, multiple PE and with it is multiple PE connected ping-pang storage device 102.Input/output bus 100 receives input data from outside and (for example is expressed as three ranks or more The image data of high order tensor or characteristic tensor comprising image feature value), input data is transferred to ping-pang storage device 102 And outside is output to after the reception output data of ping-pang storage device 102.Input/output bus 100 can receive convolution kernel from outside Data (convolution Nuclear Data can may also be only single weighted value for one group of weighted value, be also possible to convolution kernel tensor).Another In outer some embodiments, convolution Nuclear Data be can be from the tensor processor itself, such as convolution Nuclear Data is preparatory It is stored in (not shown) in the PE configuration parameter registers of ping-pang storage device 102.Ping-pang storage device 102 is according to input data and volume Product Nuclear Data information (such as input data dimension and convolution Nuclear Data dimension) determine the number of the PE to be called, with And the dimension of two-dimensional array composed by called PE, then call the multiple PE being connected with ping-pang storage device whole or A part forms the two-dimensional array of PE.In embodiment shown in FIG. 1, ping-pang storage device 102 has been determined to be made of 16 PE 4 rows 4 column two-dimensional array (PE that never call that may be present is not shown).Further, ping-pang storage device 102 configure this 16 Connection relationship and data flow between a PE is (for example, as shown in Figure 1, operation result is vertically transferred to from top to bottom The PE of bottom line).In other embodiment, the determination of ping-pang storage device 102 needs N, and (N is just more than or equal to 2 Integer) a N row M column or M row N column or other dimensions are formed multiplied by M (M be more than or equal to 2 positive integer) a PE PE two-dimensional array, and N is configured multiplied by the connection relationship and data flow (including but not limited to picture number between M PE According to data flow, the data flow of the data flow of weighted data and operation result).
With continued reference to Fig. 1, ping-pang storage device 102 to the PE in PE two-dimensional array, will also input convolution kernel data configuration Data are transferred to the PE in PE two-dimensional array.For the specific PE in PE two-dimensional array, the PE is according to the input for being transferred to the PE Data and the convolution Nuclear Data for being configured to the PE obtain operation result after carrying out convolution algorithm.Particularly, convolution nucleus number is configured Occur before transmission inputs data into the PE according to the PE, that is to say that first configured convolution Nuclear Data starts transmission input number again According to.Because convolution Nuclear Data or weighted value have very high reusability in convolution algorithm, pass through the good convolution nucleus number of configured in advance According to or weighted value, input data such as image feature value can not be transmitted with resting carry out operation into PE two-dimensional array, from And increase the quantity of each batch processed data of tensor processor.In other embodiment, convolution Nuclear Data is configured It can also occur simultaneously with transmission input data, or occur after transmitting input data.According to configured PE each other it Between connection relationship and data flow, ping-pang storage device 102 select the operation result of some PE as output result.Shown in Fig. 1 Embodiment in, ping-pang storage device 102 selected the operation result of 4 PE of bottom line as output result.Another In outer some embodiments, according to actual needs, PE two-dimensional array can have different dimension or framework, and PE each other it Between connection relationship and data flow can also have different configurations, each of PE two-dimensional array PE is likely at some It is designated to provide output result in specific framework.In addition, according to some embodiments of the disclosure, ping-pang storage device 102 Input data can be transmitted while configuring new convolution kernel (or weighted data) and carries out operation, to accelerate tensor processor Processing speed.Specifically, for the PE in entire PE two-dimensional array, ping-pang storage device 102 can configure new convolution kernel While data are to a part of PE, another part PE is also input data into.That is, updating the convolution kernel of a part of PE While data, the convolution Nuclear Data of another part PE can be kept constant and another part PE is allowed to continue operation, thus Accelerate the processing speed of tensor processor.
With continued reference to Fig. 1, PE in PE two-dimensional array is to being transferred to the input data of the PE and be configured to the volume of the PE Product Nuclear Data (or weighted data) carries out the operation result obtained after convolution algorithm, can be in neural network derivation operation Between result Psum, be also possible to carry out intermediate result Psum obtained regularization result after Regularization.For binaryzation Convolutional neural networks (Binary CNN), the operation result of PE can be intermediate result Psum, 1 ratio after being also possible to regularization 1 or 0 special result.In some embodiments, the intermediate result Psum that PE obtain after operation is transferred to table tennis control Device 102 processed.In other embodiment, PE two-dimensional array uses full articulamentum (fully-connected), and PE is carried out The intermediate result Psum obtained after operation does not need to be transferred to ping-pang storage device 102.That is, not passing through ping-pang storage device 102 read and write intermediate result Psum, but the folded biography of intermediate result Psum is completed directly between the PE of PE two-dimensional array. It had both supported to connect operation entirely using the PE two-dimensional array of full articulamentum, and had also supported convolution algorithm.Also, using the PE of full articulamentum Two-dimensional array is not because need to read and write intermediate result Psum by ping-pang storage device 102, but inside two-dimensional array Operation is completed, can reduce delay, is conducive to high-speed computation derivation.Ping-pang storage device 102 can adjust PE according to actual needs Connection relationship and data flow each other, to control whether to read and write intermediate result by ping-pang storage device 102 Psum, and then realize the switching between full connection operation and convolution algorithm.In addition, being led to according to some embodiments of the disclosure Configuration PE is crossed, the pond layer (pooling) in neural network model can be operated and be mixed with convolution algorithm, Ye Jirang The PE being configured carries pondization operation.
Referring to fig. 2, a kind of data flow of the PE two-dimensional array of the tensor processor of embodiment includes but is not limited to The data flow of the data flow of image data, the data flow of weighted data and operation result.Embodiment shown in Fig. 2 mentions The PE two-dimensional array arranged for 3 rows 3 being made of 9 PE.The PE that image data is arranged from PE two-dimensional array Far Left one enters, later The right side PE adjacent to same a line is one by one transmitted, that is, according to direction from left to right from first row to secondary series again from Two column are arranged to third.And diagonally direction is not same a line after the PE entrance that weighted data is arranged from Far Left one It is not that the nearest PE in upper right side of same row is one by one transmitted.Each PE of PE two-dimensional array carries out the operation result that operation obtains It is one by one propagated perpendicularly toward the nearest PE in the lower section of same a line, that is, according to direction from top to bottom from the first row to Two rows are again from the second row to the third line.The data flow of PE two-dimensional array shown in Fig. 2 is merely illustrative tensor processor pair Control can respectively be applied in the data flow of the data flow of image data, the data flow of weighted data and operation result System., the tensor processor configuration PE each other it Between connection relationship and data flow when, can data flow, the data flow of weighted data and operation knot to image data The data flow of fruit is respectively configured.Embodiment shown in Fig. 2 is merely illustrative the PE two dimension battle array of 3 rows 3 column A kind of configuration mode of the data flow between PE in column, should not be taken to limit the disclosure to PE two-dimensional array its Its possible configuration mode.
PE in PE two-dimensional array shown in FIG. 1 and PE two-dimensional array shown in Fig. 2 and disclosure other embodiment Two-dimensional array, the connection relationship and data flow being merely illustrative between PE, should not be taken to limit the disclosure to PE Other possible configuration modes of two-dimensional array.It is upper all around inferior between the PE mentioned in multiple embodiments of the disclosure Relativeness, the location informations such as PE of which column of which row, there are also the statements such as the PE of the column of Far Left one or bottom line, only Only for facilitating the connection relationship and data flow between illustrating PE, but it should not be construed and require PE in strict accordance with institute The relativeness and positional relationship mentioned arrange, and more to should not be taken to limit the disclosure may match the other of PE two-dimensional array Set mode.In addition, PE two-dimensional array shown by the multiple attached drawings of the disclosure has the various data flows indicated by arrow, this A little arrows should not be taken to limit the disclosure to PE two-dimensional array just to facilitate the data flow between illustrating PE Other possible configuration modes.
Referring to Fig. 3 to Fig. 6, wherein Fig. 3 shows the tensor processor of another embodiment, and Fig. 4 to Fig. 6 shows defeated The image data matrix entered is 5 rows 5 column and weighted data matrix is 3 rows 3 column.The tensor processor includes being made of 9 PE 3 rows 3 column two-dimensional array, it is PE1 to PE9 which numbers respectively.The connection between 9 PE is also shown in Fig. 3 Relationship and data flow, the data flow of data flow, weighted data including image data and the data flow of operation result. Image data is transferred to corresponding PE according to mode shown in Fig. 3.Specifically, each PE corresponds to a line image data: image The 1st row of data is transferred to the PE that number is PE1, and it is PE2 and the PE of PE4 that the 2nd row of image data, which is transferred to number, image data the 3 rows are transferred to the PE that number is PE3, PE5 and PE7, and it is PE6 and the PE of PE8 that the 4th row of image data, which is transferred to number, Yi Jitu As the 5th row of data is transferred to the PE that number is PE9.And weighted data is configured to corresponding PE according to mode shown in Fig. 3.Specifically Ground, each PE correspond to a line weighted data: the 1st row of weighted data is configured to the PE that number is PE1, PE4 and PE7, weight number It is configured to the PE that number is PE2, PE5 and PE8 according to the 2nd row, it is PE3, PE6 and PE9 that the 3rd row of weighted data, which is configured to number, PE.Operation result carries out folded biography according to mode shown in Fig. 3.Specifically, the operation result for the PE that number is PE1 is folded to pass to number For the PE of PE2, it is further continued for folded pass to and numbers the PE for being PE3, finally obtain the 1st row of convolution algorithm output result.Number is PE4 PE operation result it is folded pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain convolution algorithm Export the 2nd row of result.The operation result of the PE that number is PE7 is folded to pass to the PE that number is PE8, is further continued for folded passing to number and being The PE of PE9 finally obtains the 3rd row of convolution algorithm output result.
Referring to Fig. 3 and Fig. 4, using two-value convolutional neural networks as example, the image data of PE that number is PE1 to input 1 row and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE2 is to input The 2nd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE3's PE carries out the convolution algorithm of two-value convolutional neural networks to the 3rd row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE1 is folded to pass to the PE that number is PE2, be further continued for it is folded pass to the PE that number is PE3, finally obtain two It is worth the 1st row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 5, using two-value convolutional neural networks as example, the image data of PE that number is PE4 to input 2 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE5 is to input The 3rd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE6's PE carries out the convolution algorithm of two-value convolutional neural networks to the 4th row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE4 is folded to pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain two It is worth the 2nd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 6, using two-value convolutional neural networks as example, the image data of PE that number is PE7 to input 3 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE8 is to input The 4th row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE9's PE carries out the convolution algorithm of two-value convolutional neural networks to the 5th row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE7 is folded to pass to the PE that number is PE8, be further continued for it is folded pass to the PE that number is PE9, finally obtain two It is worth the 3rd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 to Fig. 6, the image data matrix of input is 5 rows 5 column, and weighted data matrix is that 3 rows 3 arrange, at the tensor Reason device is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, the connection relationship and data being provided between 9 PE Flow direction.Further, which is input to specific PE for certain a line of image data, also by weighted data a line It is configured to the specific PE.The specific PE carries out convolution algorithm to the image data of input and the weighted data of configuration and exports operation As a result.The operation result of multiple PE obtains certain a line of neural network convolution algorithm output result after passing according to ad hoc fashion is folded. In other embodiment, PE two-dimensional array can have different dimensions or a size, for example PE two-dimensional array can be with It is 12 × 14.In other embodiment, tensor processor is according to the image data matrix and weighted data square of input The information (such as matrix dimensionality) of battle array adjusts the size of PE two-dimensional array.Fig. 3 is only used to embodiment shown in fig. 6 In a kind of framework and a kind of mode for configuring PE two-dimensional array that illustrate PE two-dimensional array, the disclosure pair should not be taken to limit The other possible frameworks and configuration mode of PE two-dimensional array.In other embodiment, for carrying out the volume of convolution algorithm Product core (or weighted data matrix), size can be 3 × 3, be also possible to 1 × 1,5 × 5 or 7 × 7.
According to the other embodiment of the disclosure, tensor processor by configure PE two-dimensional array size and Framework, and by configuring connection relationship and data flow between PE, convolution algorithm can be used for the synchronous input of multiple PE Image data, the weighted data for being configured to convolution algorithm can also be synchronized to multiple PE, thus optimize data transmission.According to Some embodiments of the disclosure and the framework of reference PE two-dimensional array shown in Fig. 3, the PE of number PE3, PE5 and PE7 Synchronously the 3rd row of image data can be received by the buffer from except ping-pang storage device or PE two-dimensional array, and number and be The PE of PE1, PE4 and PE7 synchronously can receive weighted data by the buffer from except ping-pang storage device or PE two-dimensional array 1st row.Fig. 3 is merely illustrative a kind of framework and a kind of configuration PE of PE two-dimensional array to PE two-dimensional array shown in fig. 6 The mode of two-dimensional array should not be taken to limit the disclosure to the other possible frameworks and configuration mode of PE two-dimensional array.
For Fig. 3 into embodiment shown in fig. 6, the operation result of first PE is folded to pass to second PE, is further continued for folded biography To third PE.In other embodiment, the operation result of first PE is folded to be passed to after second PE, waits second PE, which terminates to fold after convolution algorithm, passes to first PE rather than third PE.Later, first PE receives the new image of input Data can also configure new weighted data if needed or to continue the weighted data for keeping being configured constant, and to new Image data carry out convolution algorithm, then export result.
Fig. 3 is into embodiment shown in fig. 6, using two-value convolutional neural networks as example, image data of the PE to input The convolution algorithm of two-value convolutional neural networks is carried out with the weighted data of configuration.In other embodiment, PE can be right The image data of input and the weighted data of configuration carry out full connection operation.In other embodiment, tensor processor It can be used for non-two-value convolutional neural networks, for example data type is pushing away for the neural network of INT4, INT8, INT16 or INT32 Operation is led, and PE is corresponding with the data type of the neural network to the image data of input and the weighted data progress of configuration Convolution algorithm.
Referring to Fig. 7, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated Enter data matrix for 3 rows 13 column.Tensor processor adjustment PE two-dimensional array allows part PE to be in sluggish state to drop Low energy consumption.
Referring to Fig. 8, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated Enter data matrix for 5 rows 27 column.The tensor processor cuts input data matrix, is divided into 5 rows 14 column and 5 row, 13 column two inputs Data matrix, to adapt to the dimension of PE two-dimensional array.
With continued reference to Fig. 7 and Fig. 8, according to some embodiments of the disclosure, tensor processor can be according to the figure of input As the information (such as matrix dimensionality) of data matrix and weighted data matrix (or convolution kernel), so that it is determined that the dimension of PE two-dimensional array Spending (or size), there are also the connection relationships and data flow between PE.The tensor processor simultaneously can also be according to determining The dimension of PE two-dimensional array cut input data matrix.If it is desirable, the tensor processor can adjust previously again The dimension for the PE two-dimensional array having had determined.Therefore, the tensor processor of the disclosure can maintain current PE two-dimensional array Dimension it is constant under the premise of, by cutting input data matrix to have processing different dimensions input data matrix spirit Activity.Neural network tensor data to be treated indicate to become the data matrix of different dimensions after being unfolded, at the tensor The flexibility of the input data matrix of reason device processing different dimensions is advantageously implemented neural network high speed derivation operation.Another party Face, when the preferable consistency of dimension holding of the input data matrix of neural network, or according to other actual needs, the tensor Processor can readjust the PE two-dimensional array previously having had determined according to information such as the dimensions of input data matrix Dimension, thus selection more suitable for handle present input data matrix PE two-dimensional array dimension there are also between PE connection pass System and data flow.Such as referring to Fig. 3 to embodiment shown in fig. 6, when the image data matrix of input is 5 rows 5 column, And weighted data matrix is 3 rows 3 column, then tensor processor is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, thus Realize the image data matrix of high speed derivation operation input.According to some embodiments of the disclosure, tensor processor was both right The image data matrix of input is cut, and the dimension and other configurations of adjustable current PE two-dimensional array, thus Be conducive to high speed processing input tensor data complicated and changeable.
Referring to Fig. 9, a kind of tensor processor of embodiment takes the first matching volume after cutting input data matrix The mode of product core and image data input.First way refers to that identical convolution kernel is inputted for different image datas. As shown in figure 9, the image data of the first row is different from the image data of the second row, and the convolution kernel of the first row or weighted data with The convolution kernel or weighted data of second row are the same filter namely the same convolution kernel.The output knot of the first row and the second row Fruit all enters channel 1.
Referring to Figure 10, a kind of tensor processor of embodiment takes second of matching to roll up after cutting input data matrix The mode of product core and image data input.The second way refers to the identical corresponding different convolution kernel of image data input. As shown in Figure 10, the convolution kernel or weighted data of the first row is different from the convolution kernel or weighted data of the second row, and the first row Image data and the image data of the second row are identical image datas.The first row and the output result of the second row all enter channel 1。
Referring to Figure 11, a kind of tensor processor of embodiment takes the third matching volume after cutting input data matrix The mode of product core and image data input.The third mode inputs two different image datas after referring to cutting respectively Two different convolution kernels.As shown in figure 11, the convolution kernel or weighted data of the first row are different from the convolution kernel or power of the second row Tuple evidence, and the image data of the first row is different from the image data of the second row.The output result of the first row enters channel 1, and The output result of second row enters channel 2.
Referring to Figure 12, a kind of tensor processor of embodiment passes through multicast delivery mode (Multicast) configuration data To optimize the transmission of data.Multicast mean a read operation can from ping-pang storage device or PE two-dimensional array it Data Concurrent is read in outer buffer is sent to multiple PE.In other words, the PE that can be changed number passes through in an instruction cycle Multicast receives new data configuration, so that the tensor processor can be by the same data in an instruction cycle It is configured in multiple PE.And it can be the position in PE two-dimensional array by multiple PE that Multicast is configured the same data In same a line or it is located at same row, or belongs to any combination of any position in two-dimensional array.For example, simultaneously Referring to Fig. 3 and Figure 12, tensor processor by Multicast by the 1st row of weighted data simultaneously be configured to number be PE1, PE4 and The 2nd row of weighted data is configured to the PE numbered as PE2, PE5 and PE8 simultaneously by Multicast, and passed through by the PE of PE7 The 3rd row of weighted data is configured to the PE numbered as PE3, PE6 and PE9 by Multicast simultaneously.Pass through Multicast configuration Data may include the convolution kernel for convolution algorithm, also may include the weighted data for neural network derivation operation.Root According to some embodiments of the disclosure, the data by Multicast configuration also may include for binary neural network convolution Threshold value required for the thresholding operation of operation.Threshold value data for configuration can be trained threshold value.It should Tensor processor just transmits input data after configuring convolution kernel (or weighted data) and threshold value by Multicast Convolution algorithm is carried out to PE two-dimensional array.That is to say, the data of input data such as characteristics of image just can during actual operation It is input to PE two-dimensional array and carries out operation.The tensor processor can use state algorithm, by trained threshold value It is configured to after PE two-dimensional array, is incessantly matched input data matrix and weighted data matrix respectively by Multicast It sets corresponding PE and carries out convolution algorithm, realize faster derivation operation speed.
Referring to Figure 13, a kind of tensor processor of embodiment has the framework of the full connection operation of support.Tensor processing Device by adjusting between PE two-dimensional array connection relationship and data flow realize and supported in neural network above full articulamentum It is the data stream connected entirely.
The tensor processor that Figure 14 shows a kind of embodiment is allocated to the parameter list of each PE.The image of input The each guild of characteristic distributes a feature_row_id.And the every a line of weighted data distributes a weight_row_id. Each PE distributes weight_row_id_local and feature_ being associated with the PE being associated with the PE row_id_local.When configuring weighted data, weight_row_id that the weighted data being configured has and PE's Weight_row_id_local compares, and the PE receives the weighted value being configured if consistent, and the PE is not if inconsistent It receives.When the image feature data of configuration input, feature_row_ that the image feature data for the input being configured has Id is compared with the feature_row_id_local of PE.The PE receives the characteristics of image number for the input being configured if consistent According to the PE is not received if inconsistent.The tensor processor is according to the image feature data and weighted data of input (or convolution Core) information calculate distribution feature_row_id, weight_row_id, feature_row_id_local, and weight_row_id_local.Such as the image feature data that dimension is 3 dimensions has a length and width, deep totally three dimensions, and dimension It is that the convolution Nuclear Datas of 4 dimensions has the number of length and width, depth and convolution kernel to be total to four dimensions.The tensor processor can be according to image The information of the four dimensions of the information and convolution Nuclear Data of three dimensions of characteristic, calculates the number for the PE to be called also There is the dimension of two-dimensional array composed by called PE, then calculates and be assigned to the every a line of image feature data Feature_row_id is assigned to the weight_row_id of each row of convolution Nuclear Data, is also assigned to the feature_ of PE Row_id_local and weight_row_id_local.
Referring to Fig. 3, Figure 12 and Figure 14, the 1st row of weighted data is configured to number by Multicast and is by tensor processor The PE of PE1, PE4 and PE7.Tensor processor by comparing weighted data the 1st row weight_row_id and each PE weight_row_id_local.Only number be PE1, PE4 and PE7 PE weight_row_id_local and weighted data The weight_row_id of 1st row matches, therefore only number is that the PE of PE1, PE4 and PE7 receive the 1st row of weighted data.
With continued reference to Figure 14, for the configuration parameter of specific PE, which can be set parameter model_set The operating mode for setting the PE is RGB operation or full Binary Operation, and parameter Psum_set can be set whether to set the PE It receives the operation result of another PE and is added to the operation result of the PE, parameter Pool_en can be set to set the PE and be It is no that pond layer operation is carried after convolution algorithm, it can set whether the PE exports operation result with setup parameter Dout_on, it can Set whether the PE participates in operation with setup parameter Row_on.The tensor processor can be illustrated with setup parameter K_num Participate in convolution algorithm convolution kernel dimension, setup parameter Psum_num come illustrate carry out accumulating operation intermediate result P_sum Number.By the configuration parameter of configured in advance PE, the tensor processor can control the operating mode of PE, working condition and Control the connection relationship and data flow between PE.It, can be according to actual needs according to the other embodiment of the disclosure The configuration parameter in PE is readjusted, and then readjusts the connection relationship and data flow between PE.Further, because Whether to judge the PE by the assigned parameter of configured in advance in matching PE good parameter and input data or weighted data Input data or weighted data should be received, which can improve the efficiency of configuration data by Multicast. Therefore, which accelerates the tensor computation of neural network by using pattern of rows and columns of two-dimensional matrix, and by matching The parameter of PE is set to optimize data transmission and control operation, and adjust by cutting input matrix and adjustment PE two-dimensional array The dimension for the tensor that can be handled, to realize the neural network high -speed calculating unit being adapted dynamically.
Embodiment shown in Figure 14 is merely illustrative a kind of possible combination of PE configuration parameter, should not be taken to limit Other possible configuration modes of the disclosure to PE.According to some embodiments of the disclosure, tensor processor is according to the figure of input As the information of characteristic and convolution kernel, can calculate to be allocated to each PE for matching image characteristic Feature_row_id_local and feature_column_id_local, there are also for matching convolution kernel or weighted data Weight_row_id_local and weight_column_id_local.Each is used to be input to the image feature data of PE The feature_row_id and feature_column_id that can be distributed into couple.And each convolution kernel or weighted data can divide With pairs of weight_row_id and weight_column_id.PE can respectively be compared when configuring image feature data The feature_column_id_ of the feature_row_id and PE of feature_row_id_local and image feature data The feature_column_id of local and image feature data.Only when this is matched all unanimously twice, which can just receive figure As characteristic, and refusal receives image feature data if mismatching at least once.It is similar, in configuration convolution kernel or The weight_row_id_local and weight_row_id and weight_ of PE can respectively be compared when weighted data Column_id_local and weight_column_id.Only when this matching twice is all consistent, which can just receive convolution kernel Or weighted data, and refusal receives convolution kernel or weighted data if mismatching at least once.
Referring to Figure 15, a kind of the tensor processor progress binary neural network convolution algorithm and thresholding operation of embodiment. Specifically, which has all carried out binary conversion treatment, and further earth's surface for character image data and weighted data It is shown as 0 and 1.Therefore the character image data after binaryzation and weighted data all only need a bit storage bit (if It is expressed as the storage bit of 1 and two bits of -1 needs, another bit of bit stored symbols stores numerical value), thus Save a large amount of memory spaces.Further, because indicating the character image data after binaryzation with 0 or the 1 of a bit And weighted data, the multiplication operation of neural network convolution algorithm can be substituted for exclusive or non-exclusive (XNOR) logic gate.And neural network The sum operation of convolution algorithm can be substituted for popcount operation.Popcount operation means to count each bit in the result The number that positional value is 1.For example the number that all bit position values of the operand of 32 bits are 1 is a, then 0 number (- 1 is represented if being expressed as 1 and-1) is 32-a, and final result is a-(32-a)=2a-32.
With continued reference to Figure 15, by taking the weighted data of the character image data of 32 bits and 32 bits carries out convolution algorithm as an example Son, then the multiply-add operation of the operand of two 32 bits can be substituted for: the operand of two 32 bits carries out exclusive or not operation, Obtained result carries out popcount operation.Specifically, as shown in figure 15, by the image data of bit and one corresponding The weighted data of bit passes through exclusive or non-exclusive (XNOR) logic gate, then the result of multiple exclusive or non-exclusive (XNOR) logic gate is passed through As soon as the popcount of 32 bits, then stack up the result exported by the popcount of multiple 32 bits to have obtained two-value The intermediate result Psum of 16 bits of neural network convolution algorithm.Because with gate operation and popcount operation instead of volume The multiply-add operation of product operation, so that a large amount of floating-point operations are saved, so that the tensor processor realizes adding for binary neural network Fast convolution algorithm.The acceleration convolution algorithm of binary neural network can pass through either one or two of the PE two-dimensional array of the tensor processor PE is realized, can also be realized by specified PE.According to some embodiments of the disclosure, tensor processor can also be obstructed Gate operation and popcount operation are crossed, but uses the convolution algorithm operation based on floating-point operation of general neural network.
Tensor processor shown in figure 15 carries out thresholding operation to intermediate result Psum to improve operational precision.Specifically, The intermediate result Psum of convolution algorithm is compared by the tensor processor with trained threshold value, and output result is 16 bits Intermediate result Psum or regularization after 1 bit 0 or 1.
Assuming that result after convolution is a, batch processed function BatchNorm (a)=γ (a-μ)/σ+B, wherein u be to The mean value of amount, σ are variance, and γ is proportionality coefficient, and B is biasing.Carry out binarization operation namely sign function operation:
Binaryzation:
In order to simplify operation, thresholding operation is merged into Batchnormal operation and binarization operation simplification.It can be concluded that BatchNorm (a)=0 is separation, and when it is more than or equal to 0, result is 1, and other case values are 0.Therefore BatchNorm is enabled (a)=0 a=u-(B σ)/γ can, be obtained.Note a is Tk, it is meant that BatchNorm (Tk)=γ (Tk-μ)/σ+B=0.So It is 1 when the result of convolution algorithm is more than or equal to Tk duration, other case values are 0.
Assuming that the training result of Batchnormal are as follows: γ=4, μ=5, σ=8, B=2.It calculates: Tk=u-(B σ)/γ=5-(2*8)/4=1.
Abbreviation calculates before merging: when convolutional calculation result is 0, substituting into Batchnormal formula: 4* (0-5)/8+2=- 0.5, because smaller than 0, result 0.When convolutional calculation result is 2, Batchnormal formula: 4* (2-5)/8+2 is substituted into =0.5, because bigger than 0, result 1.
Abbreviation calculates after merging: threshold T k=1, smaller than Tk when convolution results are 0, so the result is that 0.Work as convolution The result is that when 2, because bigger than Tk, the result is that 1.So output result is consistent before and after abbreviation.
Therefore, Batchnormal operation and binarization operation simplification are merged into thresholding operation it can be concluded that with before simplification It is consistent as a result, still thresholding operation saves a large amount of floating-point operation resources.After the completion of the training of two-value convolutional neural networks, root Tk is exported according to formula Tk=u-(B σ)/γ, as long as being then compared the result of convolution with the value of Tk.Tensor processing Device is not needed to carry out a large amount of floating-point operation, both be improved because the result of convolution is compared with trained threshold T k Operational precision also accelerates the inference time of neural network.Thresholding operation can pass through the PE two-dimensional array of the tensor processor Any one PE is realized, can also be realized by specified PE.
According to some embodiments of the disclosure, tensor processor replaces two by gate operation and popcount operation Be worth neural network convolution algorithm multiply-add operation, then by thresholding operation by convolution results compared with trained threshold value to mention High operational precision realizes and has both high speed and high-precision binary neural network tensor computation device.In other embodiment party In formula, which, which is further advanced by ping-pang storage device and realizes, carries out operation while configuring PE two-dimensional array. In other embodiments, which opens again after being further advanced by configured in advance authority credentials and threshold value to PE Beginning input feature vector image data realizes state algorithm and quickly processing input data.In other embodiments, the tensor Processor, which is further advanced by configuration PE, allows its included pond layer (pooling) to handle.In other embodiments, the tensor Processor is also further advanced by full connection so that not reading and writing intermediate result Psum by ping-pang storage device, but directly The folded biography of intermediate result Psum is completed between the PE of PE two-dimensional array.In other embodiments, the tensor processor into The input that can handle different dimensions is realized to one step by the dimension of cutting input data matrix and adjustment PE two-dimensional array Measure data.In other embodiments, which, which is further advanced by Multicast configuration data, realizes optimization Data transmission.
According to some embodiments of the disclosure, trained threshold value can pass through the iteration of predetermined number of times, Huo Zheyou Constringent regression algorithm, or the methods of compare with the test image marked and to obtain.According to other the one of the disclosure A little embodiments, trained threshold value can be obtained by general trained neural network and the method for machine learning.
Referring to Figure 16, a kind of tensor processor of embodiment includes the matrix 500 of PE two-dimensional array, control module 502, Weighted data buffer 504, threshold value data buffer 506, image data buffer 508 and input/output bus 510.According to Some embodiments of the disclosure, ping-pang storage device 512 include but is not limited to control module 502, weighted data buffer 504, threshold value data buffer 506 and image data buffer 508.According to the other embodiment of the disclosure, table tennis Controller 512 includes control module 502, weighted data buffer 504 and image data buffer 508.Input/output bus 510 Data, such as the tensor matrix data of three ranks or higher order are received from outside.The data based on the received of input/output bus 510 Type and purposes (such as weighted data, threshold value data or image data), its received data is respectively written into weight Data buffer 504, threshold value data buffer 506 and image data buffer 508.Weighted data buffer 504, threshold value The data respectively stored are transferred to control module 502 and from control moulds by data buffer 506 and image data buffer 508 Block 502 reads weighted data, threshold value data and image data respectively.The input data of present embodiment using image data as Example, but input data is not limited to image data, input data is also possible to voice data, is suitable for target identification Data type or other data types.In other embodiment, image data buffer 508 could alternatively be input 514 (not shown) of data buffer.Input data buffer 514 is used to receive various types of input numbers from input/output bus According to, including image, sound or other data types.The data that input data buffer 514 is also stored are transferred to control Module 502 and corresponding data are read from control module 502.
With continued reference to Figure 16, control module 502 determines the PE's to be called according to the information of weighted data and image data Number, then determine connection relationship and data flow between called multiple PE, and then construct the matrix of PE two-dimensional array 500.Control module 502 can also be only according to the information of weighted data, perhaps only according to the information of image data or only Only rely on program of the configured in advance in control module 502 determine the PE to be called number and called multiple PE it Between connection relationship and data flow.Weighted data, threshold value data and image data are transferred to and build by control module 502 PE two-dimensional array matrix 500.According to the dimension of the dimension of image data matrix and the matrix of PE two-dimensional array, control module PE or cutting image data matrix in the matrix 500 of 502 adjustable PE two-dimensional arrays.Control module 502 is in cutting drawing As different modes can be taken to match weighted data or convolution kernel and image data, including identical convolution kernel after data matrix It is inputted for different image datas, the identical corresponding different convolution kernel of image data input, after also cutting not by two Identical image data inputs two different convolution kernels respectively.
With continued reference to Figure 16, control module 502 can also adjust PE two-dimensional array again or repeatedly according to actual needs Connection relationship and data flow between the PE of matrix 500.Control module 502 can be by Multicast by the same data (weighted data or threshold value data or image data) is configured in multiple PE in an instruction cycle, and is configured same Multiple PE of a data, which can be, to be located at a line in the matrix 500 of PE two-dimensional array or is located at same row, or is belonged to Any combination of any position in the matrix 500 of PE two-dimensional array.Control module 502 may be configured to convolution algorithm Convolution kernel or weighted data, can also be with configured threshold Value Data.Threshold value data for configuration can be trained Good threshold value.The tensor processor can use state algorithm, and trained threshold value is being configured to PE two dimension battle array After the matrix 500 of column, convolution algorithm is carried out to image data matrix and weighted data matrix, realizes faster arithmetic speed.
Referring to Figure 17, a kind of ping-pang storage device of the tensor processor of embodiment includes PE configuration parameter registers 600. PE configuration parameter registers 600 are used to storage configuration parameter such as weighted data or threshold value.The ping-pang storage device is carrying out It needs to read configuration parameter before the convolution algorithm of any dimension and PE is configured.The ping-pang storage device can configure on one side Operation is carried out on one side, has CONSUMER pointer 602 and PRODUCER pointer 604 resident in the single burst thus.CONSUMER refers to Needle 602 is a read-only register field, which can check to determine which table tennis group data path has selected, And PRODUCER pointer 604 is controlled by the tensor processor completely.In other embodiment, PE configuration parameter registers 600 are also used to the various parameters of storage configuration PE, such as the configuration parameter of PE shown in figure 15.
Input data in multiple embodiments of the disclosure with image data as an example, but input data not office It is limited to image data.Input data is also possible to voice data, the data type for being suitable for target identification or other data class Type.The input data of multiple embodiments of the disclosure with the tensor data of three ranks or higher order as an example, but input Data are not limited to the tensor data of three ranks or higher order.Input data is also possible to of second order, single order or zeroth order Measure data.
According to some embodiments of the disclosure, ping-pang storage device can carry out operation while configuring, that is to say, that New convolution kernel (or weighted data) and/or trained threshold value can be configured on one side, it can be on one side to input data Matrix carries out convolution algorithm or connects operation entirely, to accelerate the processing speed of tensor processor.
According to some embodiments of the disclosure, it is independent for configuring input data, weighted data and threshold value each other , progress can be synchronized.
Full connection can be realized with PE two-dimensional array by the adjustment on configuring according to some embodiments of the disclosure Operation or convolution algorithm and read-write intermediate result Psum is not needed, but is directly completed in PE two-dimensional array.And it can be with It is realized by adjusting configuration in the switching entirely between connection operation and convolution algorithm.
It, can be by the pond layer in neural network model by configuring PE according to some embodiments of the disclosure (pooling) operation is mixed with convolution algorithm, namely the PE being configured carries pondization operation.
According to some embodiments of the disclosure, the PE to be called can be neatly selected, and for the company between PE Connect relationship and data flow can by adjusting configure be set with according to actual needs the PE two-dimensional array of specific configuration (including Control PE between data flow), and can divide according further to the PE two-dimensional array set input data matrix or The PE that person's selection allows a part to take less than enters disabled state.
According to some embodiments of the disclosure, tensor processor can use state algorithm, will be trained Threshold value is configured to after PE two-dimensional array, is carried out convolution algorithm to input data matrix and weighted data matrix, is realized faster Arithmetic speed.
According to some embodiments of the disclosure, it is common that FPGA, GPU etc. can be used in PE used in tensor processor Neural network processor is also possible to the processor specially designed, as long as needed for meeting the various embodiments for realizing the disclosure The Functional Requirement for the bottom line wanted.
According to some embodiments of the disclosure, tensor processor is used for two-value convolutional neural networks, and the tensor is handled The PE of the PE two-dimensional array of device carries out the convolution of two-value convolutional neural networks to the image data of input and the weighted data of configuration Operation.In other embodiment, PE can carry out full connection fortune to the image data of input and the weighted data of configuration It calculates.In other embodiment, tensor processor can be used for non-two-value convolutional neural networks, for example data type is The derivation operation of the neural network of INT4, INT8, INT16 or INT32, and PE is to the image data of input and the weight number of configuration According to carrying out corresponding with the data type of neural network convolution algorithm or connect operation entirely.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned The all possible combination of each technical characteristic in embodiment is all described, as long as however, the combination of these technical characteristics There is no contradictions, all should be considered as described in this specification.
Embodiment described above can not be construed as limiting the scope of the patent.It should be pointed out that for this For the those of ordinary skill in field, without departing from the inventive concept of the premise, various modifications and improvements can be made, this Belong to protection scope of the present invention.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (20)

1. a kind of hardware accelerator for neural network convolution algorithm comprising:
By the processing engine matrix for the N x M that multiple processing engines form, wherein N and M is the positive integer and N more than or equal to 2 No more than M;
Weighted data buffer stores weighted data;
Threshold value data buffer stores threshold value data;
Input data buffer stores input data;
Control module, the control module read the input data from the input data buffer and transmit the input number According to the processing engine that matches with the input data into the processing engine matrix, the control module is from the weight number Read the weighted data according to buffer and configure the weighted data into the processing engine matrix with the weighted data The processing engine to match, the control module read the threshold value data from threshold value data buffer and configure the door Limit value data are to each of processing engine matrix processing engine;
Wherein, each of described processing engine matrix handles engine to the input data and matching for being matched to the processing engine Weighted data to the processing engine carries out convolution algorithm and obtains intermediate result, the intermediate result and the threshold value data into Row compares, the result selection output intermediate result that is compared with the threshold value data according to the intermediate result or The regularization result obtained after regularization is carried out to the intermediate result.
2. hardware accelerator according to claim 1, which is characterized in that the weighted data and the input data are Binaryzation data, the processing engine carry out binaryzation neural network convolution algorithm to the input data and the weighted data Obtain binaryzation neural network convolution algorithm intermediate result.
3. hardware accelerator according to claim 2, which is characterized in that the weighted data and the input data by It is expressed as the bit composition of 0 or 11 bit.
4. hardware accelerator according to claim 3, which is characterized in that the phase of the binaryzation neural network convolution algorithm Multiply operation to realize by exclusive or non-exclusive gate operation, the phase add operation of the binaryzation neural network convolution algorithm passes through number 1 Number operation is realized.
5. hardware accelerator according to claim 3, which is characterized in that the threshold value data is trained threshold value Threshold value, the threshold value threshold value meet batch processed function BatchNorm (a)=0, and wherein a is for training the threshold value The convolution results of the neural network of threshold value.
6. hardware accelerator according to claim 5, which is characterized in that the batch processed function BatchNorm (a)= γ (a-μ)/σ+B, wherein u is the mean value of vector, and σ is variance, and γ is proportionality coefficient, and B is biasing.
7. hardware accelerator according to claim 3, which is characterized in that among the binaryzation neural network convolution algorithm As a result it is compared with the threshold value data and merges batch processed operation and binarization operation realization by simplifying.
8. hardware accelerator according to claim 7, which is characterized in that the binarization operation is sign function operation:
9. hardware accelerator according to claim 1, which is characterized in that the control module is respectively and independently of each other The input data, the weighted data and the threshold value data are configured to the processing engine square by multicast transmission mode Battle array.
10. hardware accelerator according to claim 1, which is characterized in that the control module updates the processing engine The weighted data of first part's processing engine in matrix being configured, keeps the second part in the processing engine matrix The weighted data being configured for handling engine is constant, and updates the input number of the second part processing engine being configured According to.
11. hardware accelerator according to claim 1, which is characterized in that the processing engine matrix passes through the control Module reads and writees the intermediate result.
12. hardware accelerator according to claim 1, which is characterized in that the processing engine matrix has full connection knot Structure, the intermediate result is without the reading and writing for control module and in the folded biography of the processing engine internal matrix.
13. hardware accelerator according to claim 1, which is characterized in that at least one of described processing engine matrix Processing engine carries out pondization operation after the convolution algorithm.
14. hardware accelerator according to claim 1, which is characterized in that the hardware accelerator draws according to the processing The dimension for holding up matrix cuts the input data.
15. hardware accelerator according to claim 1, which is characterized in that the control module is according to the input data Information and it is described processing engine matrix dimension be arranged it is described processing engine matrix in a part processing engine be standby shape State.
16. hardware accelerator according to claim 1, which is characterized in that the control module is according to the input data Information and the weighted data information adjust it is described processing engine matrix in processing engine between connection relationship sum number According to flow direction.
17. hardware accelerator according to claim 1, which is characterized in that the hardware accelerator is according to the input number According to information and the weighted data information change it is described processing engine matrix possessed by processing engine number and institute State the dimension of processing engine matrix.
18. hardware accelerator according to claim 1, which is characterized in that the weighted data is the weight for being divided into N group Value, each group of weighted value includes same amount of weighted value, and the input data is divided into P group characteristic value, each group of characteristic value Comprising same amount of characteristic value, wherein P is the positive integer more than or equal to N, the control module configure the P group characteristic value with To the processing engine two-dimensional array, each of described processing engine matrix handles engine and receives one the N group weighted value Group characteristic value and one group of weighted value.
19. hardware accelerator according to claim 18, which is characterized in that the dimension of the processing engine matrix is 3x3, The weighted data is the weighted value for being divided into 3 groups, and every group of weighted value has 3 weighted values.
20. hardware accelerator according to claim 1, which is characterized in that the hardware accelerator is for data type The convolution algorithm of the neural network of INT4, INT8, INT16 or INT32.
CN201910301389.6A 2019-04-15 2019-04-15 Hardware accelerator for neural network convolution operations Active CN110033086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910301389.6A CN110033086B (en) 2019-04-15 2019-04-15 Hardware accelerator for neural network convolution operations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910301389.6A CN110033086B (en) 2019-04-15 2019-04-15 Hardware accelerator for neural network convolution operations

Publications (2)

Publication Number Publication Date
CN110033086A true CN110033086A (en) 2019-07-19
CN110033086B CN110033086B (en) 2022-03-22

Family

ID=67238490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910301389.6A Active CN110033086B (en) 2019-04-15 2019-04-15 Hardware accelerator for neural network convolution operations

Country Status (1)

Country Link
CN (1) CN110033086B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378471A (en) * 2019-07-24 2019-10-25 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN110458285A (en) * 2019-08-14 2019-11-15 北京中科寒武纪科技有限公司 Data processing method, device, computer equipment and storage medium
CN110598855A (en) * 2019-09-23 2019-12-20 Oppo广东移动通信有限公司 Deep learning model generation method, device, equipment and storage medium
CN111915001A (en) * 2020-08-18 2020-11-10 腾讯科技(深圳)有限公司 Convolution calculation engine, artificial intelligence chip and data processing method
CN112183732A (en) * 2020-10-22 2021-01-05 中国人民解放军国防科技大学 Convolutional neural network acceleration method and device and computer equipment
CN112596912A (en) * 2020-12-29 2021-04-02 清华大学 Acceleration operation method and device for convolution calculation of binary or ternary neural network
US20210150313A1 (en) * 2019-11-15 2021-05-20 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
CN112965931A (en) * 2021-02-22 2021-06-15 北京微芯智通科技合伙企业(有限合伙) Digital integration processing method based on CNN cell neural network structure
CN113159285A (en) * 2021-04-14 2021-07-23 广州放芯科技有限公司 Neural network accelerator
US20210319290A1 (en) * 2020-04-09 2021-10-14 Apple Inc. Ternary mode of planar engine for neural processor
TWI743710B (en) * 2020-03-18 2021-10-21 國立中山大學 Method, electric device and computer program product for convolutional neural network
WO2023065748A1 (en) * 2021-10-19 2023-04-27 海飞科(南京)信息技术有限公司 Accelerator and electronic device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328647A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Bit width selection for fixed point neural networks
US20160350647A1 (en) * 2015-05-26 2016-12-01 International Business Machines Corporation Neuron peripheral circuits for neuromorphic synaptic memory array based on neuron models
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107851066A (en) * 2015-07-16 2018-03-27 高通股份有限公司 Hardware counter and the offline adaptable caching architecture for establishing profile to application during based on operation
US20180114569A1 (en) * 2016-03-11 2018-04-26 Hewlett Packard Enterprise Development Lp Hardware accelerators for calculating node values of neural networks
CN108009627A (en) * 2016-10-27 2018-05-08 谷歌公司 Neutral net instruction set architecture
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
CN108875956A (en) * 2017-05-11 2018-11-23 广州异构智能科技有限公司 Primary tensor processor
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA
US10254760B1 (en) * 2017-12-29 2019-04-09 Apex Artificial Intelligence Industries, Inc. Self-correcting controller systems and methods of limiting the operation of neural networks to be within one or more conditions
TW201921294A (en) * 2017-08-09 2019-06-01 美商谷歌有限責任公司 Accelerating neural networks in hardware using interconnected crossbars

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328647A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Bit width selection for fixed point neural networks
US20160350647A1 (en) * 2015-05-26 2016-12-01 International Business Machines Corporation Neuron peripheral circuits for neuromorphic synaptic memory array based on neuron models
CN107851066A (en) * 2015-07-16 2018-03-27 高通股份有限公司 Hardware counter and the offline adaptable caching architecture for establishing profile to application during based on operation
US20180114569A1 (en) * 2016-03-11 2018-04-26 Hewlett Packard Enterprise Development Lp Hardware accelerators for calculating node values of neural networks
CN108009627A (en) * 2016-10-27 2018-05-08 谷歌公司 Neutral net instruction set architecture
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method
CN108875956A (en) * 2017-05-11 2018-11-23 广州异构智能科技有限公司 Primary tensor processor
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
TW201921294A (en) * 2017-08-09 2019-06-01 美商谷歌有限責任公司 Accelerating neural networks in hardware using interconnected crossbars
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
US10254760B1 (en) * 2017-12-29 2019-04-09 Apex Artificial Intelligence Industries, Inc. Self-correcting controller systems and methods of limiting the operation of neural networks to be within one or more conditions
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
K KININGHAM等: "Design and Analysis of a Hardware CNN Accelerator", 《VISION.STANFORD.EDU》 *
乔瑞秀等: "一种高性能可重构深度卷积神经网络加速器", 《西安电子科技大学学报》 *
李宗凌等: "基于多并行计算和存储的CNN加速器", 《计算机技术与发展》 *
梁爽: "可重构神经网络加速器设计关键技术研究", 《中国博士学位论文全文数据库 _信息科技辑》 *
闫明: "基于FPGA的神经网络硬件实现", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378471A (en) * 2019-07-24 2019-10-25 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN110458285A (en) * 2019-08-14 2019-11-15 北京中科寒武纪科技有限公司 Data processing method, device, computer equipment and storage medium
CN110458285B (en) * 2019-08-14 2021-05-14 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110598855A (en) * 2019-09-23 2019-12-20 Oppo广东移动通信有限公司 Deep learning model generation method, device, equipment and storage medium
CN110598855B (en) * 2019-09-23 2023-06-09 Oppo广东移动通信有限公司 Deep learning model generation method, device, equipment and storage medium
US12039430B2 (en) * 2019-11-15 2024-07-16 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
US20210150313A1 (en) * 2019-11-15 2021-05-20 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
TWI743710B (en) * 2020-03-18 2021-10-21 國立中山大學 Method, electric device and computer program product for convolutional neural network
US11604975B2 (en) * 2020-04-09 2023-03-14 Apple Inc. Ternary mode of planar engine for neural processor
US20210319290A1 (en) * 2020-04-09 2021-10-14 Apple Inc. Ternary mode of planar engine for neural processor
CN111915001A (en) * 2020-08-18 2020-11-10 腾讯科技(深圳)有限公司 Convolution calculation engine, artificial intelligence chip and data processing method
CN111915001B (en) * 2020-08-18 2024-04-12 腾讯科技(深圳)有限公司 Convolution calculation engine, artificial intelligent chip and data processing method
CN112183732A (en) * 2020-10-22 2021-01-05 中国人民解放军国防科技大学 Convolutional neural network acceleration method and device and computer equipment
CN112596912B (en) * 2020-12-29 2023-03-28 清华大学 Acceleration operation method and device for convolution calculation of binary or ternary neural network
CN112596912A (en) * 2020-12-29 2021-04-02 清华大学 Acceleration operation method and device for convolution calculation of binary or ternary neural network
CN112965931A (en) * 2021-02-22 2021-06-15 北京微芯智通科技合伙企业(有限合伙) Digital integration processing method based on CNN cell neural network structure
CN113159285A (en) * 2021-04-14 2021-07-23 广州放芯科技有限公司 Neural network accelerator
CN113159285B (en) * 2021-04-14 2023-09-05 广州放芯科技有限公司 neural network accelerator
WO2023065748A1 (en) * 2021-10-19 2023-04-27 海飞科(南京)信息技术有限公司 Accelerator and electronic device

Also Published As

Publication number Publication date
CN110033086B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN110033086A (en) Hardware accelerator for neural network convolution algorithm
CN110046705A (en) Device for convolutional neural networks
CN110059805A (en) Method for two value arrays tensor processor
CN110033085A (en) Tensor processor
CN105930902B (en) A kind of processing method of neural network, system
EP3859543B1 (en) Parallel processing of reduction and broadcast operations on large datasets of non-scalar data
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN105956659A (en) Data processing device, data processing system and server
CN107451654A (en) Acceleration operation method, server and the storage medium of convolutional neural networks
CN110197253A (en) The arithmetical unit accelerated for deep learning
CN107609641A (en) Sparse neural network framework and its implementation
CN106951395A (en) Towards the parallel convolution operations method and device of compression convolutional neural networks
CN105892989A (en) Neural network accelerator and operational method thereof
CN101925877A (en) Apparatus and method for performing permutation operations on data
WO2017117186A1 (en) Conditional parallel processing in fully-connected neural networks
CN106201651A (en) The simulator of neuromorphic chip
CN108470009A (en) Processing circuit and its neural network computing method
CN107423816A (en) A kind of more computational accuracy Processing with Neural Network method and systems
CN110300944A (en) Image processor with configurable number of active cores and supporting internal networks
CN110414672B (en) Convolution operation method, device and system
US20230316057A1 (en) Neural network processor
CN107894957A (en) Memory data towards convolutional neural networks accesses and zero insertion method and device
CN205827367U (en) Data processing equipment and server
US12117946B2 (en) Neural network processor
CN111291884B (en) Neural network pruning method, device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Xu Zhe

Inventor after: Ding Xueli

Inventor after: Chen Baigang

Inventor before: Chen Baigang

Inventor before: Xu Zhe

Inventor before: Ding Xueli

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200623

Address after: Room 1202-1204, No.8, Jingang Avenue, Nansha street, Nansha District, Guangzhou City, Guangdong Province

Applicant after: NOVUMIND Ltd.

Address before: 100191 9th floor 908 Shining Building, 35 College Road, Haidian District, Beijing

Applicant before: NOVUMIND Ltd.

GR01 Patent grant
GR01 Patent grant