[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110033085A - Tensor processor - Google Patents

Tensor processor Download PDF

Info

Publication number
CN110033085A
CN110033085A CN201910301388.1A CN201910301388A CN110033085A CN 110033085 A CN110033085 A CN 110033085A CN 201910301388 A CN201910301388 A CN 201910301388A CN 110033085 A CN110033085 A CN 110033085A
Authority
CN
China
Prior art keywords
tensor
data
dimensional array
processing engine
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910301388.1A
Other languages
Chinese (zh)
Other versions
CN110033085B (en
Inventor
陈柏纲
许喆
丁雪立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NOVUMIND Ltd.
Original Assignee
Beijing Isomerism Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Isomerism Intelligent Technology Co Ltd filed Critical Beijing Isomerism Intelligent Technology Co Ltd
Priority to CN201910301388.1A priority Critical patent/CN110033085B/en
Publication of CN110033085A publication Critical patent/CN110033085A/en
Application granted granted Critical
Publication of CN110033085B publication Critical patent/CN110033085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

A kind of tensor processor.Multiple processing engines that tensor processor includes ping-pang storage device and is connected with ping-pang storage device.Ping-pang storage device receives input tensor, calculates the number for the processing engine to be called according to the dimensional information of the dimensional information of input tensor and weight tensor and calls processing engine composition processing engine two-dimensional array.The connection relationship and data flow between processing engine in ping-pang storage device configuration processing engine two-dimensional array.The configuration of ping-pang storage device inputs tensor sum weight tensor to processing engine two-dimensional array.Processing engine two-dimensional array carries out convolution algorithm to input tensor sum weight tensor and obtains output result.Output result is transferred to the ping-pang storage device.The tensor processor realizes high speed processing input tensor data and can neatly cope with the input tensor data of different dimensions.

Description

Tensor processor
Technical field
This disclosure relates to neural network convolution algorithm tensor processor.
Background technique
Neural network establishes model structure by simulating the neural connection structure of human brain, be current academic research and The hot spot of Corporation R & D.Current neural network is especially used for the convolutional neural networks of image procossing and Object identifying, needs Processing is largely expressed as the data of three ranks or higher order tensor, it is also desirable to handle the tensor number with different shape and size According to.Therefore it is required to the neural network dedicated computing device of high speed processing three rank of different shapes or higher order tensor data. In addition, binaryzation neural network refers to carrying out weighted value and/or input data the neural network after binary conversion treatment.Currently There are no the high precision computation devices for being directed to binaryzation neural network.
Summary of the invention
Based on this, it is necessary to provide the neural network dedicated computing dress for capableing of three rank of high speed processing or higher order tensor data It sets, also it is necessary to provide the high precision computation devices for being directed to binaryzation neural network.For this purpose, the disclosure provides a kind of tensor processing Device, the tensor processor include multiple processing engines (Processing Engine, hereinafter referred to as PE) and are connected with multiple PE Ping-pang storage device.The tensor processor can (such as the dimension and convolution kernel of the tensor data according to input according to actual needs The information such as dimension) number of the determination PE to be called and the dimension of the two-dimensional array as composed by called PE, call All or part of of multiple PE forms PE two-dimensional array.Further, the PE of tensor processor configuration PE two-dimensional array Connection relationship and data flow each other can also cut the tensor data of input according to the dimension of PE two-dimensional array, Tensor data are inputted to realize high speed processing and can neatly cope with the input tensor data of different dimensions.For two-value mind Derivation operation through network, the tensor processor replace convolution algorithm with hardware mode, and there are also carry out door to convolution algorithm result Limit operation has both the binary neural network computing device with high-precision advantage at high speed to realize.
According to disclosed one aspect, a kind of tensor processor is provided comprising: ping-pang storage device receives input Tensor;The multiple processing engines being connected with the ping-pang storage device;Wherein, the ping-pang storage device is according to the input tensor Dimension calculates the number for the processing engine to be called and the processing engine of the number is called to form processing engine two dimension battle array Column configure connection relationship and data flow between the processing engine in the processing engine two-dimensional array, and described in configuration Tensor sum weight tensor is inputted to the processing engine two-dimensional array;The processing engine two-dimensional array is to the input tensor sum The weight tensor carries out convolution algorithm and is exported as a result, the output result is transferred to the ping-pang storage device.
Detailed description of the invention
Embodiment of the disclosure has other advantages and features, when read in conjunction with the accompanying drawings, from described in detail below and appended power Benefit will be more readily apparent from these advantages and features in requiring, in which:
Fig. 1 shows a kind of framework of the tensor processor of embodiment, the tensor processor include input/output bus, Ping-pang storage device and PE two-dimensional array.
Fig. 2 shows the frameworks and data flow of a kind of PE two-dimensional array of the tensor processor of embodiment.
Fig. 3 shows the framework and data flow of the PE two-dimensional array of the tensor processor of another embodiment.
The PE two-dimensional array that Fig. 4 shows the tensor processor in embodiment shown in Fig. 3 derives operation result A line.
The PE two-dimensional array that Fig. 5 shows the tensor processor in embodiment shown in Fig. 3 derives operation result Two rows.
The PE two-dimensional array that Fig. 6 shows the tensor processor in embodiment shown in Fig. 3 derives operation result Three rows.
Fig. 7 shows a kind of PE of the tensor processor configuration PE two-dimensional array of embodiment to adapt to input data matrix Dimension.
Fig. 8 shows a kind of tensor processor cutting input data matrix of embodiment to adapt to the dimension of PE two-dimensional array Degree.
Fig. 9 shows the first matching convolution kernel after a kind of tensor processor cutting input data matrix of embodiment With the mode of image data input.
Figure 10 shows second of matching convolution after a kind of tensor processor cutting input data matrix of embodiment The mode of core and image data input.
Figure 11 shows the third matching convolution after a kind of tensor processor cutting input data matrix of embodiment The mode of core and image data input.
The tensor processor that Figure 12 shows a kind of embodiment passes through Multicast configuration data.
The tensor processor that Figure 13 shows a kind of embodiment carries out the data flow of full connection operation.
Figure 14 shows a kind of PE configuration parameter of the tensor processor of embodiment.
The PE that Figure 15 shows a kind of tensor processor of embodiment carries out binary neural network convolution algorithm and thresholding Operation.
Figure 16 shows the framework of the tensor processor of another embodiment.
Figure 17 shows a kind of frameworks of the ping-pang storage device of the tensor processor of embodiment.
Specific embodiment
The accompanying drawings and the description below are by way of example only.It should be appreciated that from following discussion, structure disclosed herein and side The alternate embodiment of method will easily be considered as the viable alternatives that can be used, without departing from original claimed Reason.
Referring to Fig. 1, a kind of tensor processor of embodiment include input/output bus 100, multiple PE and with it is multiple PE connected ping-pang storage device 102.Input/output bus 100 receives input data from outside and (for example is expressed as three ranks or more The image data of high order tensor or characteristic tensor comprising image feature value), input data is transferred to ping-pang storage device 102 And outside is output to after the reception output data of ping-pang storage device 102.Input/output bus 100 can receive convolution kernel from outside Data (convolution Nuclear Data can may also be only single weighted value for one group of weighted value, be also possible to convolution kernel tensor).Another In outer some embodiments, convolution Nuclear Data be can be from the tensor processor itself, such as convolution Nuclear Data is preparatory It is stored in (not shown) in the PE configuration parameter registers of ping-pang storage device 102.Ping-pang storage device 102 is according to input data and volume Product Nuclear Data information (such as input data dimension and convolution Nuclear Data dimension) determine the number of the PE to be called, with And the dimension of two-dimensional array composed by called PE, then call the multiple PE being connected with ping-pang storage device whole or A part forms the two-dimensional array of PE.In embodiment shown in FIG. 1, ping-pang storage device 102 has been determined to be made of 16 PE 4 rows 4 column two-dimensional array (PE that never call that may be present is not shown).Further, ping-pang storage device 102 configure this 16 Connection relationship and data flow between a PE is (for example, as shown in Figure 1, operation result is vertically transferred to from top to bottom The PE of bottom line).In other embodiment, the determination of ping-pang storage device 102 needs N, and (N is just more than or equal to 2 Integer) a N row M column or M row N column or other dimensions are formed multiplied by M (M be more than or equal to 2 positive integer) a PE PE two-dimensional array, and N is configured multiplied by the connection relationship and data flow (including but not limited to picture number between M PE According to data flow, the data flow of the data flow of weighted data and operation result).
With continued reference to Fig. 1, ping-pang storage device 102 to the PE in PE two-dimensional array, will also input convolution kernel data configuration Data are transferred to the PE in PE two-dimensional array.For the specific PE in PE two-dimensional array, the PE is according to the input for being transferred to the PE Data and the convolution Nuclear Data for being configured to the PE obtain operation result after carrying out convolution algorithm.Particularly, convolution nucleus number is configured Occur before transmission inputs data into the PE according to the PE, that is to say that first configured convolution Nuclear Data starts transmission input number again According to.Because convolution Nuclear Data or weighted value have very high reusability in convolution algorithm, pass through the good convolution nucleus number of configured in advance According to or weighted value, input data such as image feature value can not be transmitted with resting carry out operation into PE two-dimensional array, from And increase the quantity of each batch processed data of tensor processor.In other embodiment, convolution Nuclear Data is configured It can also occur simultaneously with transmission input data, or occur after transmitting input data.According to configured PE each other it Between connection relationship and data flow, ping-pang storage device 102 select the operation result of some PE as output result.Shown in Fig. 1 Embodiment in, ping-pang storage device 102 selected the operation result of 4 PE of bottom line as output result.Another In outer some embodiments, according to actual needs, PE two-dimensional array can have different dimension or framework, and PE each other it Between connection relationship and data flow can also have different configurations, each of PE two-dimensional array PE is likely at some It is designated to provide output result in specific framework.In addition, according to some embodiments of the disclosure, ping-pang storage device 102 Input data can be transmitted while configuring new convolution kernel (or weighted data) and carries out operation, to accelerate tensor processor Processing speed.Specifically, for the PE in entire PE two-dimensional array, ping-pang storage device 102 can configure new convolution kernel While data are to a part of PE, another part PE is also input data into.That is, updating the convolution kernel of a part of PE While data, the convolution Nuclear Data of another part PE can be kept constant and another part PE is allowed to continue operation, thus Accelerate the processing speed of tensor processor.
With continued reference to Fig. 1, PE in PE two-dimensional array is to being transferred to the input data of the PE and be configured to the volume of the PE Product Nuclear Data (or weighted data) carries out the operation result obtained after convolution algorithm, can be in neural network derivation operation Between result Psum, be also possible to carry out intermediate result Psum obtained regularization result after Regularization.For binaryzation Convolutional neural networks (Binary CNN), the operation result of PE can be intermediate result Psum, 1 ratio after being also possible to regularization 1 or 0 special result.In some embodiments, the intermediate result Psum that PE obtain after operation is transferred to table tennis control Device 102 processed.In other embodiment, PE two-dimensional array uses full articulamentum (fully-connected), and PE is carried out The intermediate result Psum obtained after operation does not need to be transferred to ping-pang storage device 102.That is, not passing through ping-pang storage device 102 read and write intermediate result Psum, but the folded biography of intermediate result Psum is completed directly between the PE of PE two-dimensional array. It had both supported to connect operation entirely using the PE two-dimensional array of full articulamentum, and had also supported convolution algorithm.Also, using the PE of full articulamentum Two-dimensional array is not because need to read and write intermediate result Psum by ping-pang storage device 102, but inside two-dimensional array Operation is completed, can reduce delay, is conducive to high-speed computation derivation.Ping-pang storage device 102 can adjust PE according to actual needs Connection relationship and data flow each other, to control whether to read and write intermediate result by ping-pang storage device 102 Psum, and then realize the switching between full connection operation and convolution algorithm.In addition, being led to according to some embodiments of the disclosure Configuration PE is crossed, the pond layer (pooling) in neural network model can be operated and be mixed with convolution algorithm, Ye Jirang The PE being configured carries pondization operation.
Referring to fig. 2, a kind of data flow of the PE two-dimensional array of the tensor processor of embodiment includes but is not limited to The data flow of the data flow of image data, the data flow of weighted data and operation result.Embodiment shown in Fig. 2 mentions The PE two-dimensional array arranged for 3 rows 3 being made of 9 PE.The PE that image data is arranged from PE two-dimensional array Far Left one enters, later The right side PE adjacent to same a line is one by one transmitted, that is, according to direction from left to right from first row to secondary series again from Two column are arranged to third.And diagonally direction is not same a line after the PE entrance that weighted data is arranged from Far Left one It is not that the nearest PE in upper right side of same row is one by one transmitted.Each PE of PE two-dimensional array carries out the operation result that operation obtains It is one by one propagated perpendicularly toward the nearest PE in the lower section of same a line, that is, according to direction from top to bottom from the first row to Two rows are again from the second row to the third line.The data flow of PE two-dimensional array shown in Fig. 2 is merely illustrative tensor processor pair Control can respectively be applied in the data flow of the data flow of image data, the data flow of weighted data and operation result System., the tensor processor configuration PE each other it Between connection relationship and data flow when, can data flow, the data flow of weighted data and operation knot to image data The data flow of fruit is respectively configured.Embodiment shown in Fig. 2 is merely illustrative the PE two dimension battle array of 3 rows 3 column A kind of configuration mode of the data flow between PE in column, should not be taken to limit the disclosure to PE two-dimensional array its Its possible configuration mode.
PE in PE two-dimensional array shown in FIG. 1 and PE two-dimensional array shown in Fig. 2 and disclosure other embodiment Two-dimensional array, the connection relationship and data flow being merely illustrative between PE, should not be taken to limit the disclosure to PE Other possible configuration modes of two-dimensional array.It is upper all around inferior between the PE mentioned in multiple embodiments of the disclosure Relativeness, the location informations such as PE of which column of which row, there are also the statements such as the PE of the column of Far Left one or bottom line, only Only for facilitating the connection relationship and data flow between illustrating PE, but it should not be construed and require PE in strict accordance with institute The relativeness and positional relationship mentioned arrange, and more to should not be taken to limit the disclosure may match the other of PE two-dimensional array Set mode.In addition, PE two-dimensional array shown by the multiple attached drawings of the disclosure has the various data flows indicated by arrow, this A little arrows should not be taken to limit the disclosure to PE two-dimensional array just to facilitate the data flow between illustrating PE Other possible configuration modes.
Referring to Fig. 3 to Fig. 6, wherein Fig. 3 shows the tensor processor of another embodiment, and Fig. 4 to Fig. 6 shows defeated The image data matrix entered is 5 rows 5 column and weighted data matrix is 3 rows 3 column.The tensor processor includes being made of 9 PE 3 rows 3 column two-dimensional array, it is PE1 to PE9 which numbers respectively.The connection between 9 PE is also shown in Fig. 3 Relationship and data flow, the data flow of data flow, weighted data including image data and the data flow of operation result. Image data is transferred to corresponding PE according to mode shown in Fig. 3.Specifically, each PE corresponds to a line image data: image The 1st row of data is transferred to the PE that number is PE1, and it is PE2 and the PE of PE4 that the 2nd row of image data, which is transferred to number, image data the 3 rows are transferred to the PE that number is PE3, PE5 and PE7, and it is PE6 and the PE of PE8 that the 4th row of image data, which is transferred to number, Yi Jitu As the 5th row of data is transferred to the PE that number is PE9.And weighted data is configured to corresponding PE according to mode shown in Fig. 3.Specifically Ground, each PE correspond to a line weighted data: the 1st row of weighted data is configured to the PE that number is PE1, PE4 and PE7, weight number It is configured to the PE that number is PE2, PE5 and PE8 according to the 2nd row, it is PE3, PE6 and PE9 that the 3rd row of weighted data, which is configured to number, PE.Operation result carries out folded biography according to mode shown in Fig. 3.Specifically, the operation result for the PE that number is PE1 is folded to pass to number For the PE of PE2, it is further continued for folded pass to and numbers the PE for being PE3, finally obtain the 1st row of convolution algorithm output result.Number is PE4 PE operation result it is folded pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain convolution algorithm Export the 2nd row of result.The operation result of the PE that number is PE7 is folded to pass to the PE that number is PE8, is further continued for folded passing to number and being The PE of PE9 finally obtains the 3rd row of convolution algorithm output result.
Referring to Fig. 3 and Fig. 4, using two-value convolutional neural networks as example, the image data of PE that number is PE1 to input 1 row and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE2 is to input The 2nd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE3's PE carries out the convolution algorithm of two-value convolutional neural networks to the 3rd row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE1 is folded to pass to the PE that number is PE2, be further continued for it is folded pass to the PE that number is PE3, finally obtain two It is worth the 1st row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 5, using two-value convolutional neural networks as example, the image data of PE that number is PE4 to input 2 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE5 is to input The 3rd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE6's PE carries out the convolution algorithm of two-value convolutional neural networks to the 4th row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE4 is folded to pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain two It is worth the 2nd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 6, using two-value convolutional neural networks as example, the image data of PE that number is PE7 to input 3 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE8 is to input The 4th row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE9's PE carries out the convolution algorithm of two-value convolutional neural networks to the 5th row of image data of input and the 3rd row of weighted data of configuration.It compiles The operation result of number PE for being PE7 is folded to pass to the PE that number is PE8, be further continued for it is folded pass to the PE that number is PE9, finally obtain two It is worth the 3rd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later, It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE It folds together and passes to third PE.
Referring to Fig. 3 to Fig. 6, the image data matrix of input is 5 rows 5 column, and weighted data matrix is that 3 rows 3 arrange, at the tensor Reason device is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, the connection relationship and data being provided between 9 PE Flow direction.Further, which is input to specific PE for certain a line of image data, also by weighted data a line It is configured to the specific PE.The specific PE carries out convolution algorithm to the image data of input and the weighted data of configuration and exports operation As a result.The operation result of multiple PE obtains certain a line of neural network convolution algorithm output result after passing according to ad hoc fashion is folded. In other embodiment, PE two-dimensional array can have different dimensions or a size, for example PE two-dimensional array can be with It is 12 × 14.In other embodiment, tensor processor is according to the image data matrix and weighted data square of input The information (such as matrix dimensionality) of battle array adjusts the size of PE two-dimensional array.Fig. 3 is only used to embodiment shown in fig. 6 In a kind of framework and a kind of mode for configuring PE two-dimensional array that illustrate PE two-dimensional array, the disclosure pair should not be taken to limit The other possible frameworks and configuration mode of PE two-dimensional array.In other embodiment, for carrying out the volume of convolution algorithm Product core (or weighted data matrix), size can be 3 × 3, be also possible to 1 × 1,5 × 5 or 7 × 7.
According to the other embodiment of the disclosure, tensor processor by configure PE two-dimensional array size and Framework, and by configuring connection relationship and data flow between PE, convolution algorithm can be used for the synchronous input of multiple PE Image data, the weighted data for being configured to convolution algorithm can also be synchronized to multiple PE, thus optimize data transmission.According to Some embodiments of the disclosure and the framework of reference PE two-dimensional array shown in Fig. 3, the PE of number PE3, PE5 and PE7 Synchronously the 3rd row of image data can be received by the buffer from except ping-pang storage device or PE two-dimensional array, and number and be The PE of PE1, PE4 and PE7 synchronously can receive weighted data by the buffer from except ping-pang storage device or PE two-dimensional array 1st row.Fig. 3 is merely illustrative a kind of framework and a kind of configuration PE of PE two-dimensional array to PE two-dimensional array shown in fig. 6 The mode of two-dimensional array should not be taken to limit the disclosure to the other possible frameworks and configuration mode of PE two-dimensional array.
For Fig. 3 into embodiment shown in fig. 6, the operation result of first PE is folded to pass to second PE, is further continued for folded biography To third PE.In other embodiment, the operation result of first PE is folded to be passed to after second PE, waits second PE, which terminates to fold after convolution algorithm, passes to first PE rather than third PE.Later, first PE receives the new image of input Data can also configure new weighted data if needed or to continue the weighted data for keeping being configured constant, and to new Image data carry out convolution algorithm, then export result.
Fig. 3 is into embodiment shown in fig. 6, using two-value convolutional neural networks as example, image data of the PE to input The convolution algorithm of two-value convolutional neural networks is carried out with the weighted data of configuration.In other embodiment, PE can be right The image data of input and the weighted data of configuration carry out full connection operation.In other embodiment, tensor processor It can be used for non-two-value convolutional neural networks, for example data type is pushing away for the neural network of INT4, INT8, INT16 or INT32 Operation is led, and PE is corresponding with the data type of the neural network to the image data of input and the weighted data progress of configuration Convolution algorithm.
Referring to Fig. 7, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated Enter data matrix for 3 rows 13 column.Tensor processor adjustment PE two-dimensional array allows part PE to be in sluggish state to drop Low energy consumption.
Referring to Fig. 8, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated Enter data matrix for 5 rows 27 column.The tensor processor cuts input data matrix, is divided into 5 rows 14 column and 5 row, 13 column two inputs Data matrix, to adapt to the dimension of PE two-dimensional array.
With continued reference to Fig. 7 and Fig. 8, according to some embodiments of the disclosure, tensor processor can be according to the figure of input As the information (such as matrix dimensionality) of data matrix and weighted data matrix (or convolution kernel), so that it is determined that the dimension of PE two-dimensional array Spending (or size), there are also the connection relationships and data flow between PE.The tensor processor simultaneously can also be according to determining The dimension of PE two-dimensional array cut input data matrix.If it is desirable, the tensor processor can adjust previously again The dimension for the PE two-dimensional array having had determined.Therefore, the tensor processor of the disclosure can maintain current PE two-dimensional array Dimension it is constant under the premise of, by cutting input data matrix to have processing different dimensions input data matrix spirit Activity.Neural network tensor data to be treated indicate to become the data matrix of different dimensions after being unfolded, at the tensor The flexibility of the input data matrix of reason device processing different dimensions is advantageously implemented neural network high speed derivation operation.Another party Face, when the preferable consistency of dimension holding of the input data matrix of neural network, or according to other actual needs, the tensor Processor can readjust the PE two-dimensional array previously having had determined according to information such as the dimensions of input data matrix Dimension, thus selection more suitable for handle present input data matrix PE two-dimensional array dimension there are also between PE connection pass System and data flow.Such as referring to Fig. 3 to embodiment shown in fig. 6, when the image data matrix of input is 5 rows 5 column, And weighted data matrix is 3 rows 3 column, then tensor processor is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, thus Realize the image data matrix of high speed derivation operation input.According to some embodiments of the disclosure, tensor processor was both right The image data matrix of input is cut, and the dimension and other configurations of adjustable current PE two-dimensional array, thus Be conducive to high speed processing input tensor data complicated and changeable.
Referring to Fig. 9, a kind of tensor processor of embodiment takes the first matching volume after cutting input data matrix The mode of product core and image data input.First way refers to that identical convolution kernel is inputted for different image datas. As shown in figure 9, the image data of the first row is different from the image data of the second row, and the convolution kernel of the first row or weighted data with The convolution kernel or weighted data of second row are the same filter namely the same convolution kernel.The output knot of the first row and the second row Fruit all enters channel 1.
Referring to Figure 10, a kind of tensor processor of embodiment takes second of matching to roll up after cutting input data matrix The mode of product core and image data input.The second way refers to the identical corresponding different convolution kernel of image data input. As shown in Figure 10, the convolution kernel or weighted data of the first row is different from the convolution kernel or weighted data of the second row, and the first row Image data and the image data of the second row are identical image datas.The first row and the output result of the second row all enter channel 1。
Referring to Figure 11, a kind of tensor processor of embodiment takes the third matching volume after cutting input data matrix The mode of product core and image data input.The third mode inputs two different image datas after referring to cutting respectively Two different convolution kernels.As shown in figure 11, the convolution kernel or weighted data of the first row are different from the convolution kernel or power of the second row Tuple evidence, and the image data of the first row is different from the image data of the second row.The output result of the first row enters channel 1, and The output result of second row enters channel 2.
Referring to Figure 12, a kind of tensor processor of embodiment passes through multicast delivery mode (Multicast) configuration data To optimize the transmission of data.Multicast mean a read operation can from ping-pang storage device or PE two-dimensional array it Data Concurrent is read in outer buffer is sent to multiple PE.In other words, the PE that can be changed number passes through in an instruction cycle Multicast receives new data configuration, so that the tensor processor can be by the same data in an instruction cycle It is configured in multiple PE.And it can be the position in PE two-dimensional array by multiple PE that Multicast is configured the same data In same a line or it is located at same row, or belongs to any combination of any position in two-dimensional array.For example, simultaneously Referring to Fig. 3 and Figure 12, tensor processor by Multicast by the 1st row of weighted data simultaneously be configured to number be PE1, PE4 and The 2nd row of weighted data is configured to the PE numbered as PE2, PE5 and PE8 simultaneously by Multicast, and passed through by the PE of PE7 The 3rd row of weighted data is configured to the PE numbered as PE3, PE6 and PE9 by Multicast simultaneously.Pass through Multicast configuration Data may include the convolution kernel for convolution algorithm, also may include the weighted data for neural network derivation operation.Root According to some embodiments of the disclosure, the data by Multicast configuration also may include for binary neural network convolution Threshold value required for the thresholding operation of operation.Threshold value data for configuration can be trained threshold value.It should Tensor processor just transmits input data after configuring convolution kernel (or weighted data) and threshold value by Multicast Convolution algorithm is carried out to PE two-dimensional array.That is to say, the data of input data such as characteristics of image just can during actual operation It is input to PE two-dimensional array and carries out operation.The tensor processor can use state algorithm, by trained threshold value It is configured to after PE two-dimensional array, is incessantly matched input data matrix and weighted data matrix respectively by Multicast It sets corresponding PE and carries out convolution algorithm, realize faster derivation operation speed.
Referring to Figure 13, a kind of tensor processor of embodiment has the framework of the full connection operation of support.Tensor processing Device by adjusting between PE two-dimensional array connection relationship and data flow realize and supported in neural network above full articulamentum It is the data stream connected entirely.
The tensor processor that Figure 14 shows a kind of embodiment is allocated to the parameter list of each PE.The image of input The each guild of characteristic distributes a feature_row_id.And the every a line of weighted data distributes a weight_row_id. Each PE distributes weight_row_id_local and feature_ being associated with the PE being associated with the PE row_id_local.When configuring weighted data, weight_row_id that the weighted data being configured has and PE's Weight_row_id_local compares, and the PE receives the weighted value being configured if consistent, and the PE is not if inconsistent It receives.When the image feature data of configuration input, feature_row_ that the image feature data for the input being configured has Id is compared with the feature_row_id_local of PE.The PE receives the characteristics of image number for the input being configured if consistent According to the PE is not received if inconsistent.The tensor processor is according to the image feature data and weighted data of input (or convolution Core) information calculate distribution feature_row_id, weight_row_id, feature_row_id_local, and weight_row_id_local.Such as the image feature data that dimension is 3 dimensions has a length and width, deep totally three dimensions, and dimension It is that the convolution Nuclear Datas of 4 dimensions has the number of length and width, depth and convolution kernel to be total to four dimensions.The tensor processor can be according to image The information of the four dimensions of the information and convolution Nuclear Data of three dimensions of characteristic, calculates the number for the PE to be called also There is the dimension of two-dimensional array composed by called PE, then calculates and be assigned to the every a line of image feature data Feature_row_id is assigned to the weight_row_id of each row of convolution Nuclear Data, is also assigned to the feature_ of PE Row_id_local and weight_row_id_local.
Referring to Fig. 3, Figure 12 and Figure 14, the 1st row of weighted data is configured to number by Multicast and is by tensor processor The PE of PE1, PE4 and PE7.Tensor processor by comparing weighted data the 1st row weight_row_id and each PE weight_row_id_local.Only number be PE1, PE4 and PE7 PE weight_row_id_local and weighted data The weight_row_id of 1st row matches, therefore only number is that the PE of PE1, PE4 and PE7 receive the 1st row of weighted data.
With continued reference to Figure 14, for the configuration parameter of specific PE, which can be set parameter model_set The operating mode for setting the PE is RGB operation or full Binary Operation, and parameter Psum_set can be set whether to set the PE It receives the operation result of another PE and is added to the operation result of the PE, parameter Pool_en can be set to set the PE and be It is no that pond layer operation is carried after convolution algorithm, it can set whether the PE exports operation result with setup parameter Dout_on, it can Set whether the PE participates in operation with setup parameter Row_on.The tensor processor can be illustrated with setup parameter K_num Participate in convolution algorithm convolution kernel dimension, setup parameter Psum_num come illustrate carry out accumulating operation intermediate result P_sum Number.By the configuration parameter of configured in advance PE, the tensor processor can control the operating mode of PE, working condition and Control the connection relationship and data flow between PE.It, can be according to actual needs according to the other embodiment of the disclosure The configuration parameter in PE is readjusted, and then readjusts the connection relationship and data flow between PE.Further, because Whether to judge the PE by the assigned parameter of configured in advance in matching PE good parameter and input data or weighted data Input data or weighted data should be received, which can improve the efficiency of configuration data by Multicast. Therefore, which accelerates the tensor computation of neural network by using pattern of rows and columns of two-dimensional matrix, and by matching The parameter of PE is set to optimize data transmission and control operation, and adjust by cutting input matrix and adjustment PE two-dimensional array The dimension for the tensor that can be handled, to realize the neural network high -speed calculating unit being adapted dynamically.
Embodiment shown in Figure 14 is merely illustrative a kind of possible combination of PE configuration parameter, should not be taken to limit Other possible configuration modes of the disclosure to PE.According to some embodiments of the disclosure, tensor processor is according to the figure of input As the information of characteristic and convolution kernel, can calculate to be allocated to each PE for matching image characteristic Feature_row_id_local and feature_column_id_local, there are also for matching convolution kernel or weighted data Weight_row_id_local and weight_column_id_local.Each is used to be input to the image feature data of PE The feature_row_id and feature_column_id that can be distributed into couple.And each convolution kernel or weighted data can divide With pairs of weight_row_id and weight_column_id.PE can respectively be compared when configuring image feature data The feature_column_id_ of the feature_row_id and PE of feature_row_id_local and image feature data The feature_column_id of local and image feature data.Only when this is matched all unanimously twice, which can just receive figure As characteristic, and refusal receives image feature data if mismatching at least once.It is similar, in configuration convolution kernel or The weight_row_id_local and weight_row_id and weight_ of PE can respectively be compared when weighted data Column_id_local and weight_column_id.Only when this matching twice is all consistent, which can just receive convolution kernel Or weighted data, and refusal receives convolution kernel or weighted data if mismatching at least once.
Referring to Figure 15, a kind of the tensor processor progress binary neural network convolution algorithm and thresholding operation of embodiment. Specifically, which has all carried out binary conversion treatment, and further earth's surface for character image data and weighted data It is shown as 0 and 1.Therefore the character image data after binaryzation and weighted data all only need a bit storage bit (if It is expressed as the storage bit of 1 and two bits of -1 needs, another bit of bit stored symbols stores numerical value), thus Save a large amount of memory spaces.Further, because indicating the character image data after binaryzation with 0 or the 1 of a bit And weighted data, the multiplication operation of neural network convolution algorithm can be substituted for exclusive or non-exclusive (XNOR) logic gate.And neural network The sum operation of convolution algorithm can be substituted for popcount operation.Popcount operation means to count each bit in the result The number that positional value is 1.For example the number that all bit position values of the operand of 32 bits are 1 is a, then 0 number (- 1 is represented if being expressed as 1 and-1) is 32-a, and final result is a-(32-a)=2a-32.
With continued reference to Figure 15, by taking the weighted data of the character image data of 32 bits and 32 bits carries out convolution algorithm as an example Son, then the multiply-add operation of the operand of two 32 bits can be substituted for: the operand of two 32 bits carries out exclusive or not operation, Obtained result carries out popcount operation.Specifically, as shown in figure 15, by the image data of bit and one corresponding The weighted data of bit passes through exclusive or non-exclusive (XNOR) logic gate, then the result of multiple exclusive or non-exclusive (XNOR) logic gate is passed through As soon as the popcount of 32 bits, then stack up the result exported by the popcount of multiple 32 bits to have obtained two-value The intermediate result Psum of 16 bits of neural network convolution algorithm.Because with gate operation and popcount operation instead of volume The multiply-add operation of product operation, so that a large amount of floating-point operations are saved, so that the tensor processor realizes adding for binary neural network Fast convolution algorithm.The acceleration convolution algorithm of binary neural network can pass through either one or two of the PE two-dimensional array of the tensor processor PE is realized, can also be realized by specified PE.According to some embodiments of the disclosure, tensor processor can also be obstructed Gate operation and popcount operation are crossed, but uses the convolution algorithm operation based on floating-point operation of general neural network.
Tensor processor shown in figure 15 carries out thresholding operation to intermediate result Psum to improve operational precision.Specifically, The intermediate result Psum of convolution algorithm is compared by the tensor processor with trained threshold value, and output result is 16 bits Intermediate result Psum or regularization after 1 bit 0 or 1.
Assuming that result after convolution is a, batch processed function BatchNorm (a)=γ (a-μ)/σ+B, wherein u be to The mean value of amount, σ are variance, and γ is proportionality coefficient, and B is biasing.Carry out binarization operation namely sign function operation:
Binaryzation:
In order to simplify operation, thresholding operation is merged into Batchnormal operation and binarization operation simplification.It can be concluded that BatchNorm (a)=0 is separation, and when it is more than or equal to 0, result is 1, and other case values are 0.Therefore BatchNorm is enabled (a)=0 a=u-(B σ)/γ can, be obtained.Note a is Tk, it is meant that BatchNorm (Tk)=γ (Tk-μ)/σ+B=0.So It is 1 when the result of convolution algorithm is more than or equal to Tk duration, other case values are 0.
Assuming that the training result of Batchnormal are as follows: γ=4, μ=5, σ=8, B=2.It calculates: Tk=u-(B σ)/γ=5-(2*8)/4=1.
Abbreviation calculates before merging: when convolutional calculation result is 0, substituting into Batchnormal formula: 4* (0-5)/8+2=- 0.5, because smaller than 0, result 0.When convolutional calculation result is 2, Batchnormal formula: 4* (2-5)/8+2 is substituted into =0.5, because bigger than 0, result 1.
Abbreviation calculates after merging: threshold T k=1, smaller than Tk when convolution results are 0, so the result is that 0.Work as convolution The result is that when 2, because bigger than Tk, the result is that 1.So output result is consistent before and after abbreviation.
Therefore, Batchnormal operation and binarization operation simplification are merged into thresholding operation it can be concluded that with before simplification It is consistent as a result, still thresholding operation saves a large amount of floating-point operation resources.After the completion of the training of two-value convolutional neural networks, root Tk is exported according to formula Tk=u-(B σ)/γ, as long as being then compared the result of convolution with the value of Tk.Tensor processing Device is not needed to carry out a large amount of floating-point operation, both be improved because the result of convolution is compared with trained threshold T k Operational precision also accelerates the inference time of neural network.Thresholding operation can pass through the PE two-dimensional array of the tensor processor Any one PE is realized, can also be realized by specified PE.
According to some embodiments of the disclosure, tensor processor replaces two by gate operation and popcount operation Be worth neural network convolution algorithm multiply-add operation, then by thresholding operation by convolution results compared with trained threshold value to mention High operational precision realizes and has both high speed and high-precision binary neural network tensor computation device.In other embodiment party In formula, which, which is further advanced by ping-pang storage device and realizes, carries out operation while configuring PE two-dimensional array. In other embodiments, which opens again after being further advanced by configured in advance authority credentials and threshold value to PE Beginning input feature vector image data realizes state algorithm and quickly processing input data.In other embodiments, the tensor Processor, which is further advanced by configuration PE, allows its included pond layer (pooling) to handle.In other embodiments, the tensor Processor is also further advanced by full connection so that not reading and writing intermediate result Psum by ping-pang storage device, but directly The folded biography of intermediate result Psum is completed between the PE of PE two-dimensional array.In other embodiments, the tensor processor into The input that can handle different dimensions is realized to one step by the dimension of cutting input data matrix and adjustment PE two-dimensional array Measure data.In other embodiments, which, which is further advanced by Multicast configuration data, realizes optimization Data transmission.
According to some embodiments of the disclosure, trained threshold value can pass through the iteration of predetermined number of times, Huo Zheyou Constringent regression algorithm, or the methods of compare with the test image marked and to obtain.According to other the one of the disclosure A little embodiments, trained threshold value can be obtained by general trained neural network and the method for machine learning.
Referring to Figure 16, a kind of tensor processor of embodiment includes the matrix 500 of PE two-dimensional array, control module 502, Weighted data buffer 504, threshold value data buffer 506, image data buffer 508 and input/output bus 510.According to Some embodiments of the disclosure, ping-pang storage device 512 include but is not limited to control module 502, weighted data buffer 504, threshold value data buffer 506 and image data buffer 508.According to the other embodiment of the disclosure, table tennis Controller 512 includes control module 502, weighted data buffer 504 and image data buffer 508.Input/output bus 510 Data, such as the tensor matrix data of three ranks or higher order are received from outside.The data based on the received of input/output bus 510 Type and purposes (such as weighted data, threshold value data or image data), its received data is respectively written into weight Data buffer 504, threshold value data buffer 506 and image data buffer 508.Weighted data buffer 504, threshold value The data respectively stored are transferred to control module 502 and from control moulds by data buffer 506 and image data buffer 508 Block 502 reads weighted data, threshold value data and image data respectively.The input data of present embodiment using image data as Example, but input data is not limited to image data, input data is also possible to voice data, is suitable for target identification Data type or other data types.In other embodiment, image data buffer 508 could alternatively be input 514 (not shown) of data buffer.Input data buffer 514 is used to receive various types of input numbers from input/output bus According to, including image, sound or other data types.The data that input data buffer 514 is also stored are transferred to control Module 502 and corresponding data are read from control module 502.
With continued reference to Figure 16, control module 502 determines the PE's to be called according to the information of weighted data and image data Number, then determine connection relationship and data flow between called multiple PE, and then construct the matrix of PE two-dimensional array 500.Control module 502 can also be only according to the information of weighted data, perhaps only according to the information of image data or only Only rely on program of the configured in advance in control module 502 determine the PE to be called number and called multiple PE it Between connection relationship and data flow.Weighted data, threshold value data and image data are transferred to and build by control module 502 PE two-dimensional array matrix 500.According to the dimension of the dimension of image data matrix and the matrix of PE two-dimensional array, control module PE or cutting image data matrix in the matrix 500 of 502 adjustable PE two-dimensional arrays.Control module 502 is in cutting drawing As different modes can be taken to match weighted data or convolution kernel and image data, including identical convolution kernel after data matrix It is inputted for different image datas, the identical corresponding different convolution kernel of image data input, after also cutting not by two Identical image data inputs two different convolution kernels respectively.
With continued reference to Figure 16, control module 502 can also adjust PE two-dimensional array again or repeatedly according to actual needs Connection relationship and data flow between the PE of matrix 500.Control module 502 can be by Multicast by the same data (weighted data or threshold value data or image data) is configured in multiple PE in an instruction cycle, and is configured same Multiple PE of a data, which can be, to be located at a line in the matrix 500 of PE two-dimensional array or is located at same row, or is belonged to Any combination of any position in the matrix 500 of PE two-dimensional array.Control module 502 may be configured to convolution algorithm Convolution kernel or weighted data, can also be with configured threshold Value Data.Threshold value data for configuration can be trained Good threshold value.The tensor processor can use state algorithm, and trained threshold value is being configured to PE two dimension battle array After the matrix 500 of column, convolution algorithm is carried out to image data matrix and weighted data matrix, realizes faster arithmetic speed.
Referring to Figure 17, a kind of ping-pang storage device of the tensor processor of embodiment includes PE configuration parameter registers 600. PE configuration parameter registers 600 are used to storage configuration parameter such as weighted data or threshold value.The ping-pang storage device is carrying out It needs to read configuration parameter before the convolution algorithm of any dimension and PE is configured.The ping-pang storage device can configure on one side Operation is carried out on one side, has CONSUMER pointer 602 and PRODUCER pointer 604 resident in the single burst thus.CONSUMER refers to Needle 602 is a read-only register field, which can check to determine which table tennis group data path has selected, And PRODUCER pointer 604 is controlled by the tensor processor completely.In other embodiment, PE configuration parameter registers 600 are also used to the various parameters of storage configuration PE, such as the configuration parameter of PE shown in figure 15.
Input data in multiple embodiments of the disclosure with image data as an example, but input data not office It is limited to image data.Input data is also possible to voice data, the data type for being suitable for target identification or other data class Type.The input data of multiple embodiments of the disclosure with the tensor data of three ranks or higher order as an example, but input Data are not limited to the tensor data of three ranks or higher order.Input data is also possible to of second order, single order or zeroth order Measure data.
According to some embodiments of the disclosure, ping-pang storage device can carry out operation while configuring, that is to say, that New convolution kernel (or weighted data) and/or trained threshold value can be configured on one side, it can be on one side to input data Matrix carries out convolution algorithm or connects operation entirely, to accelerate the processing speed of tensor processor.
According to some embodiments of the disclosure, it is independent for configuring input data, weighted data and threshold value each other , progress can be synchronized.
Full connection can be realized with PE two-dimensional array by the adjustment on configuring according to some embodiments of the disclosure Operation or convolution algorithm and read-write intermediate result Psum is not needed, but is directly completed in PE two-dimensional array.And it can be with It is realized by adjusting configuration in the switching entirely between connection operation and convolution algorithm.
It, can be by the pond layer in neural network model by configuring PE according to some embodiments of the disclosure (pooling) operation is mixed with convolution algorithm, namely the PE being configured carries pondization operation.
According to some embodiments of the disclosure, the PE to be called can be neatly selected, and for the company between PE Connect relationship and data flow can by adjusting configure be set with according to actual needs the PE two-dimensional array of specific configuration (including Control PE between data flow), and can divide according further to the PE two-dimensional array set input data matrix or The PE that person's selection allows a part to take less than enters disabled state.
According to some embodiments of the disclosure, tensor processor can use state algorithm, will be trained Threshold value is configured to after PE two-dimensional array, is carried out convolution algorithm to input data matrix and weighted data matrix, is realized faster Arithmetic speed.
According to some embodiments of the disclosure, it is common that FPGA, GPU etc. can be used in PE used in tensor processor Neural network processor is also possible to the processor specially designed, as long as needed for meeting the various embodiments for realizing the disclosure The Functional Requirement for the bottom line wanted.
According to some embodiments of the disclosure, tensor processor is used for two-value convolutional neural networks, and the tensor is handled The PE of the PE two-dimensional array of device carries out the convolution of two-value convolutional neural networks to the image data of input and the weighted data of configuration Operation.In other embodiment, PE can carry out full connection fortune to the image data of input and the weighted data of configuration It calculates.In other embodiment, tensor processor can be used for non-two-value convolutional neural networks, for example data type is The derivation operation of the neural network of INT4, INT8, INT16 or INT32, and PE is to the image data of input and the weight number of configuration According to carrying out corresponding with the data type of neural network convolution algorithm or connect operation entirely.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned The all possible combination of each technical characteristic in embodiment is all described, as long as however, the combination of these technical characteristics There is no contradictions, all should be considered as described in this specification.
Embodiment described above can not be construed as limiting the scope of the patent.It should be pointed out that for this For the those of ordinary skill in field, without departing from the inventive concept of the premise, various modifications and improvements can be made, this Belong to protection scope of the present invention.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (20)

1. a kind of tensor processor comprising:
Ping-pang storage device receives input tensor;
The multiple processing engines being connected with the ping-pang storage device;
Wherein, the ping-pang storage device according to it is described input tensor dimension calculate to be called processing engine number and The processing engine of the number is called to form processing engine two-dimensional array, the processing configured in the processing engine two-dimensional array is drawn Connection relationship and data flow between holding up, and the input tensor sum weight tensor is configured to the processing engine two dimension battle array Column;
The processing engine two-dimensional array to weight tensor described in the input tensor sum carry out convolution algorithm exported as a result, The output result is transferred to the ping-pang storage device.
2. tensor processor according to claim 1, which is characterized in that the ping-pang storage device is according to the input tensor Dimension and the dimension of the weight tensor calculate the number.
3. tensor processor according to claim 1, which is characterized in that the ping-pang storage device is according to the input tensor Dimension and the weight tensor dimension configure it is described processing engine two-dimensional array in processing engine between connection relationship And data flow.
4. tensor processor according to claim 1, which is characterized in that the ping-pang storage device is according to the processing engine The dimension of two-dimensional array cuts the input tensor.
5. tensor processor according to claim 1, which is characterized in that the ping-pang storage device is according to the input tensor Dimension and it is described processing engine two-dimensional array dimension be arranged it is described processing engine two-dimensional array in a part processing engine For standby mode.
6. tensor processor according to claim 3, which is characterized in that the ping-pang storage device is according to the input tensor Dimension and the weight tensor the processing engine two-dimensional array that has been configured of dimension adjustment in processing engine between Connection relationship and data flow.
7. tensor processor according to claim 2, which is characterized in that the ping-pang storage device is according to new input tensor Dimension and the dimension of new weight tensor calculate the new number different from the number, call the processing of the new number to draw It holds up and forms new processing engine two-dimensional array.
8. tensor processor according to claim 1, which is characterized in that the ping-pang storage device updates the processing engine The weight tensor of first part's processing engine in two-dimensional array being configured, keeps in the processing engine two-dimensional array The weight tensor being configured that second part handles engine is constant, and updates being configured for the second part processing engine Input tensor.
9. tensor processor according to claim 1, which is characterized in that the ping-pang storage device configures the processing engine Each of two-dimensional array handles engine and carries pondization operation by it.
10. tensor processor according to claim 1, which is characterized in that there are two dimensions for the processing engine two-dimensional array It spends, one in described two dimensions is N, and wherein N is the positive integer more than or equal to 2, and the weight tensor is divided into N group weight Data, each group of the N group weighted data include same amount of weighted data, and the input tensor is divided into the input of M group Data, wherein M is the positive integer more than or equal to N, and each group of the M group input data includes same amount of input data, The ping-pang storage device configuration M group input data and the N group weighted data are described to the processing engine two-dimensional array Processing each of engine two-dimensional array processing engine receives one group of input data and one group of weighted data and to being configured to The input data of the processing engine and the weighted data carry out convolution algorithm and obtain intermediate result.
11. tensor processor according to claim 10, which is characterized in that it is described processing engine two-dimensional array dimension be 3 × 3, the weight tensor is divided into 3 groups of weighted datas, and every group of weighted data has 3 weighted datas, and the input tensor divides At 5 groups of input datas, every group of input data has 5 input datas.
12. tensor processor according to claim 10, which is characterized in that the processing engine two-dimensional array passes through described Ping-pang storage device reads and writees the intermediate result.
13. tensor processor according to claim 10, which is characterized in that the processing engine two-dimensional array has to be connected entirely Binding structure, the intermediate result is without the reading and writing for ping-pang storage device and in the processing engine two-dimensional array Portion is folded to be passed.
14. tensor processor according to claim 1, which is characterized in that the ping-pang storage device passes through multicast transmission side Formula configures weight tensor described in the input tensor sum to the processing engine two-dimensional array.
15. tensor processor according to claim 1, which is characterized in that the ping-pang storage device is the processing engine Each of two-dimensional array handles engine and distributes an input data local ident and a weighted data local ident, is described defeated Each component for entering tensor distributes an input data ID, is that each component of the weight tensor distributes a weight number According to ID, each of described processing engine two-dimensional array handles engine by comparing the input data local ident of the processing engine With input data ID and matchingly receive it is described input tensor component, each of described processing engine two-dimensional array handle Engine matchingly receives the weight tensor by comparing the weighted data local ident and weighted data ID of the processing engine Component.
16. tensor processor according to claim 10, which is characterized in that weight tensor described in the input tensor sum is equal Binary conversion treatment is carried out, each of described processing engine two-dimensional array handles engine and receives one group of binaryzation input data Two are carried out with one group of binaryzation weighted data and to the input data for being configured to the processing engine and the weighted data Value neural network convolution algorithm obtains binaryzation neural network convolution algorithm intermediate result.
17. tensor processor according to claim 16, which is characterized in that the binaryzation neural network convolution algorithm Multiplication operations realize that the phase add operation of the binaryzation neural network convolution algorithm passes through number 1 by exclusive or non-exclusive gate operation Number operation realize.
18. tensor processor according to claim 16, which is characterized in that in the binaryzation neural network convolution algorithm Between result be compared with trained threshold value threshold value,
The binaryzation mind is then transmitted when the binaryzation neural network convolution algorithm intermediate result is greater than the threshold value threshold value Through network convolution algorithm intermediate result to the ping-pang storage device,
When the binaryzation neural network convolution algorithm intermediate result is less than or equal to the threshold value threshold value then to the two-value Change neural network convolution algorithm intermediate result to carry out regularization operation and transmit among the binaryzation neural network convolution algorithm As a result the result after regularization is to the ping-pang storage device,
The ping-pang storage device configures the threshold value threshold value to the processing engine two-dimensional array.
19. tensor processor according to claim 18, which is characterized in that the ping-pang storage device is respectively relatively independent Ground configures threshold value threshold value described in the input tensor, the weight tensor sum to the processing engine two-dimensional array.
20. tensor processor according to claim 18, which is characterized in that in the binaryzation neural network convolution algorithm Between result be compared with the trained threshold value threshold value by simplify merge batch processed operation and binarization operation reality It is existing.
CN201910301388.1A 2019-04-15 2019-04-15 Tensor processor Active CN110033085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910301388.1A CN110033085B (en) 2019-04-15 2019-04-15 Tensor processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910301388.1A CN110033085B (en) 2019-04-15 2019-04-15 Tensor processor

Publications (2)

Publication Number Publication Date
CN110033085A true CN110033085A (en) 2019-07-19
CN110033085B CN110033085B (en) 2021-08-31

Family

ID=67238496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910301388.1A Active CN110033085B (en) 2019-04-15 2019-04-15 Tensor processor

Country Status (1)

Country Link
CN (1) CN110033085B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150313A1 (en) * 2019-11-15 2021-05-20 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
US20210319290A1 (en) * 2020-04-09 2021-10-14 Apple Inc. Ternary mode of planar engine for neural processor
CN114026554A (en) * 2019-07-31 2022-02-08 三星电子株式会社 Processor and control method thereof
CN117993443A (en) * 2024-04-03 2024-05-07 腾讯科技(深圳)有限公司 Model processing method, apparatus, computer device, storage medium, and program product
JP7582593B2 (en) 2021-04-26 2024-11-13 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Image data processing method and device, electronic device, and computer program

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328647A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Bit width selection for fixed point neural networks
US20160350647A1 (en) * 2015-05-26 2016-12-01 International Business Machines Corporation Neuron peripheral circuits for neuromorphic synaptic memory array based on neuron models
CN106563645A (en) * 2016-11-01 2017-04-19 上海师范大学 Intelligent piezoelectric film sensor sorting method based on tensor decomposition
CN106952291A (en) * 2017-03-14 2017-07-14 哈尔滨工程大学 The scene flows vehicle flowrate and speed-measuring method driven based on 3-dimensional structure tensor Anisotropic-Flow
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107944556A (en) * 2017-12-12 2018-04-20 电子科技大学 Deep neural network compression method based on block item tensor resolution
CN108009627A (en) * 2016-10-27 2018-05-08 谷歌公司 Neutral net instruction set architecture
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
US20180165577A1 (en) * 2016-12-13 2018-06-14 Google Inc. Performing average pooling in hardware
CN108875956A (en) * 2017-05-11 2018-11-23 广州异构智能科技有限公司 Primary tensor processor
CN108921049A (en) * 2018-06-14 2018-11-30 华东交通大学 Tumour cell pattern recognition device and equipment based on quantum gate transmission line neural network
WO2019069304A1 (en) * 2017-10-06 2019-04-11 DeepCube LTD. System and method for compact and efficient sparse neural networks

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328647A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Bit width selection for fixed point neural networks
US20160350647A1 (en) * 2015-05-26 2016-12-01 International Business Machines Corporation Neuron peripheral circuits for neuromorphic synaptic memory array based on neuron models
CN108009627A (en) * 2016-10-27 2018-05-08 谷歌公司 Neutral net instruction set architecture
CN106563645A (en) * 2016-11-01 2017-04-19 上海师范大学 Intelligent piezoelectric film sensor sorting method based on tensor decomposition
US20180165577A1 (en) * 2016-12-13 2018-06-14 Google Inc. Performing average pooling in hardware
CN106952291A (en) * 2017-03-14 2017-07-14 哈尔滨工程大学 The scene flows vehicle flowrate and speed-measuring method driven based on 3-dimensional structure tensor Anisotropic-Flow
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method
CN108875956A (en) * 2017-05-11 2018-11-23 广州异构智能科技有限公司 Primary tensor processor
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
WO2019069304A1 (en) * 2017-10-06 2019-04-11 DeepCube LTD. System and method for compact and efficient sparse neural networks
CN107944556A (en) * 2017-12-12 2018-04-20 电子科技大学 Deep neural network compression method based on block item tensor resolution
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
CN108921049A (en) * 2018-06-14 2018-11-30 华东交通大学 Tumour cell pattern recognition device and equipment based on quantum gate transmission line neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NORMAN P. JOUPPI等: "In-Datacenter Performance Analysis of a Tensor Processing Unit", 《ISCA "17: PROCEEDINGS OF THE 44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE》 *
周思楚: "基于张量理论的短时交通流预测算法", 《中国优秀硕士学位论文全文数据库_工程科技Ⅱ辑》 *
梁爽: "可重构神经网络加速器设计关键技术研究", 《中国博士学位论文全文数据库_信息科技辑》 *
胡惠贤: "基于异构平台的视频图像识别算法研究与实现", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114026554A (en) * 2019-07-31 2022-02-08 三星电子株式会社 Processor and control method thereof
CN114026554B (en) * 2019-07-31 2024-05-24 三星电子株式会社 Processor and control method thereof
US20210150313A1 (en) * 2019-11-15 2021-05-20 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
US12039430B2 (en) * 2019-11-15 2024-07-16 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
US20210319290A1 (en) * 2020-04-09 2021-10-14 Apple Inc. Ternary mode of planar engine for neural processor
US11604975B2 (en) * 2020-04-09 2023-03-14 Apple Inc. Ternary mode of planar engine for neural processor
JP7582593B2 (en) 2021-04-26 2024-11-13 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Image data processing method and device, electronic device, and computer program
CN117993443A (en) * 2024-04-03 2024-05-07 腾讯科技(深圳)有限公司 Model processing method, apparatus, computer device, storage medium, and program product
CN117993443B (en) * 2024-04-03 2024-08-30 腾讯科技(深圳)有限公司 Model processing method, apparatus, computer device, storage medium, and program product

Also Published As

Publication number Publication date
CN110033085B (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN110033086A (en) Hardware accelerator for neural network convolution algorithm
CN110046705A (en) Device for convolutional neural networks
CN110059805A (en) Method for two value arrays tensor processor
CN110033085A (en) Tensor processor
CN105930902B (en) A kind of processing method of neural network, system
CN105956659B (en) Data processing equipment and system, server
EP3364306B1 (en) Parallel processing of reduction and broadcast operations on large datasets of non-scalar data
CN110390384A (en) A kind of configurable general convolutional neural networks accelerator
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN107609641A (en) Sparse neural network framework and its implementation
CN106951395A (en) Towards the parallel convolution operations method and device of compression convolutional neural networks
CN106203621A (en) The processor calculated for convolutional neural networks
CN107506823A (en) A kind of construction method for being used to talk with the hybrid production style of generation
CN108470009A (en) Processing circuit and its neural network computing method
CN110175506A (en) Pedestrian based on parallel dimensionality reduction convolutional neural networks recognition methods and device again
CN110300944A (en) Image processor with configurable number of active cores and supporting internal networks
CN110414672B (en) Convolution operation method, device and system
Song et al. Design and implementation of convolutional neural networks accelerator based on multidie
CN205983537U (en) Data processing device and system, server
Camuñas-Mesa et al. Low-power hardware implementation of SNN with decision block for recognition tasks
US20230316057A1 (en) Neural network processor
CN107894957A (en) Memory data towards convolutional neural networks accesses and zero insertion method and device
CN113762480A (en) Time sequence processing accelerator based on one-dimensional convolutional neural network
CN111971692A (en) Convolutional neural network
CN205827367U (en) Data processing equipment and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Xu Zhe

Inventor after: Ding Xueli

Inventor after: Chen Baigang

Inventor before: Chen Baigang

Inventor before: Xu Zhe

Inventor before: Ding Xueli

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200629

Address after: Room 1202-1204, No.8, Jingang Avenue, Nansha street, Nansha District, Guangzhou City, Guangdong Province

Applicant after: NOVUMIND Ltd.

Address before: 100191 9th floor 908 Shining Building, 35 College Road, Haidian District, Beijing

Applicant before: NOVUMIND Ltd.

GR01 Patent grant
GR01 Patent grant