Specific embodiment
The accompanying drawings and the description below are by way of example only.It should be appreciated that from following discussion, structure disclosed herein and side
The alternate embodiment of method will easily be considered as the viable alternatives that can be used, without departing from original claimed
Reason.
Referring to Fig. 1, a kind of tensor processor of embodiment include input/output bus 100, multiple PE and with it is multiple
PE connected ping-pang storage device 102.Input/output bus 100 receives input data from outside and (for example is expressed as three ranks or more
The image data of high order tensor or characteristic tensor comprising image feature value), input data is transferred to ping-pang storage device 102
And outside is output to after the reception output data of ping-pang storage device 102.Input/output bus 100 can receive convolution kernel from outside
Data (convolution Nuclear Data can may also be only single weighted value for one group of weighted value, be also possible to convolution kernel tensor).Another
In outer some embodiments, convolution Nuclear Data be can be from the tensor processor itself, such as convolution Nuclear Data is preparatory
It is stored in (not shown) in the PE configuration parameter registers of ping-pang storage device 102.Ping-pang storage device 102 is according to input data and volume
Product Nuclear Data information (such as input data dimension and convolution Nuclear Data dimension) determine the number of the PE to be called, with
And the dimension of two-dimensional array composed by called PE, then call the multiple PE being connected with ping-pang storage device whole or
A part forms the two-dimensional array of PE.In embodiment shown in FIG. 1, ping-pang storage device 102 has been determined to be made of 16 PE
4 rows 4 column two-dimensional array (PE that never call that may be present is not shown).Further, ping-pang storage device 102 configure this 16
Connection relationship and data flow between a PE is (for example, as shown in Figure 1, operation result is vertically transferred to from top to bottom
The PE of bottom line).In other embodiment, the determination of ping-pang storage device 102 needs N, and (N is just more than or equal to 2
Integer) a N row M column or M row N column or other dimensions are formed multiplied by M (M be more than or equal to 2 positive integer) a PE
PE two-dimensional array, and N is configured multiplied by the connection relationship and data flow (including but not limited to picture number between M PE
According to data flow, the data flow of the data flow of weighted data and operation result).
With continued reference to Fig. 1, ping-pang storage device 102 to the PE in PE two-dimensional array, will also input convolution kernel data configuration
Data are transferred to the PE in PE two-dimensional array.For the specific PE in PE two-dimensional array, the PE is according to the input for being transferred to the PE
Data and the convolution Nuclear Data for being configured to the PE obtain operation result after carrying out convolution algorithm.Particularly, convolution nucleus number is configured
Occur before transmission inputs data into the PE according to the PE, that is to say that first configured convolution Nuclear Data starts transmission input number again
According to.Because convolution Nuclear Data or weighted value have very high reusability in convolution algorithm, pass through the good convolution nucleus number of configured in advance
According to or weighted value, input data such as image feature value can not be transmitted with resting carry out operation into PE two-dimensional array, from
And increase the quantity of each batch processed data of tensor processor.In other embodiment, convolution Nuclear Data is configured
It can also occur simultaneously with transmission input data, or occur after transmitting input data.According to configured PE each other it
Between connection relationship and data flow, ping-pang storage device 102 select the operation result of some PE as output result.Shown in Fig. 1
Embodiment in, ping-pang storage device 102 selected the operation result of 4 PE of bottom line as output result.Another
In outer some embodiments, according to actual needs, PE two-dimensional array can have different dimension or framework, and PE each other it
Between connection relationship and data flow can also have different configurations, each of PE two-dimensional array PE is likely at some
It is designated to provide output result in specific framework.In addition, according to some embodiments of the disclosure, ping-pang storage device 102
Input data can be transmitted while configuring new convolution kernel (or weighted data) and carries out operation, to accelerate tensor processor
Processing speed.Specifically, for the PE in entire PE two-dimensional array, ping-pang storage device 102 can configure new convolution kernel
While data are to a part of PE, another part PE is also input data into.That is, updating the convolution kernel of a part of PE
While data, the convolution Nuclear Data of another part PE can be kept constant and another part PE is allowed to continue operation, thus
Accelerate the processing speed of tensor processor.
With continued reference to Fig. 1, PE in PE two-dimensional array is to being transferred to the input data of the PE and be configured to the volume of the PE
Product Nuclear Data (or weighted data) carries out the operation result obtained after convolution algorithm, can be in neural network derivation operation
Between result Psum, be also possible to carry out intermediate result Psum obtained regularization result after Regularization.For binaryzation
Convolutional neural networks (Binary CNN), the operation result of PE can be intermediate result Psum, 1 ratio after being also possible to regularization
1 or 0 special result.In some embodiments, the intermediate result Psum that PE obtain after operation is transferred to table tennis control
Device 102 processed.In other embodiment, PE two-dimensional array uses full articulamentum (fully-connected), and PE is carried out
The intermediate result Psum obtained after operation does not need to be transferred to ping-pang storage device 102.That is, not passing through ping-pang storage device
102 read and write intermediate result Psum, but the folded biography of intermediate result Psum is completed directly between the PE of PE two-dimensional array.
It had both supported to connect operation entirely using the PE two-dimensional array of full articulamentum, and had also supported convolution algorithm.Also, using the PE of full articulamentum
Two-dimensional array is not because need to read and write intermediate result Psum by ping-pang storage device 102, but inside two-dimensional array
Operation is completed, can reduce delay, is conducive to high-speed computation derivation.Ping-pang storage device 102 can adjust PE according to actual needs
Connection relationship and data flow each other, to control whether to read and write intermediate result by ping-pang storage device 102
Psum, and then realize the switching between full connection operation and convolution algorithm.In addition, being led to according to some embodiments of the disclosure
Configuration PE is crossed, the pond layer (pooling) in neural network model can be operated and be mixed with convolution algorithm, Ye Jirang
The PE being configured carries pondization operation.
Referring to fig. 2, a kind of data flow of the PE two-dimensional array of the tensor processor of embodiment includes but is not limited to
The data flow of the data flow of image data, the data flow of weighted data and operation result.Embodiment shown in Fig. 2 mentions
The PE two-dimensional array arranged for 3 rows 3 being made of 9 PE.The PE that image data is arranged from PE two-dimensional array Far Left one enters, later
The right side PE adjacent to same a line is one by one transmitted, that is, according to direction from left to right from first row to secondary series again from
Two column are arranged to third.And diagonally direction is not same a line after the PE entrance that weighted data is arranged from Far Left one
It is not that the nearest PE in upper right side of same row is one by one transmitted.Each PE of PE two-dimensional array carries out the operation result that operation obtains
It is one by one propagated perpendicularly toward the nearest PE in the lower section of same a line, that is, according to direction from top to bottom from the first row to
Two rows are again from the second row to the third line.The data flow of PE two-dimensional array shown in Fig. 2 is merely illustrative tensor processor pair
Control can respectively be applied in the data flow of the data flow of image data, the data flow of weighted data and operation result
System., the tensor processor configuration PE each other it
Between connection relationship and data flow when, can data flow, the data flow of weighted data and operation knot to image data
The data flow of fruit is respectively configured.Embodiment shown in Fig. 2 is merely illustrative the PE two dimension battle array of 3 rows 3 column
A kind of configuration mode of the data flow between PE in column, should not be taken to limit the disclosure to PE two-dimensional array its
Its possible configuration mode.
PE in PE two-dimensional array shown in FIG. 1 and PE two-dimensional array shown in Fig. 2 and disclosure other embodiment
Two-dimensional array, the connection relationship and data flow being merely illustrative between PE, should not be taken to limit the disclosure to PE
Other possible configuration modes of two-dimensional array.It is upper all around inferior between the PE mentioned in multiple embodiments of the disclosure
Relativeness, the location informations such as PE of which column of which row, there are also the statements such as the PE of the column of Far Left one or bottom line, only
Only for facilitating the connection relationship and data flow between illustrating PE, but it should not be construed and require PE in strict accordance with institute
The relativeness and positional relationship mentioned arrange, and more to should not be taken to limit the disclosure may match the other of PE two-dimensional array
Set mode.In addition, PE two-dimensional array shown by the multiple attached drawings of the disclosure has the various data flows indicated by arrow, this
A little arrows should not be taken to limit the disclosure to PE two-dimensional array just to facilitate the data flow between illustrating PE
Other possible configuration modes.
Referring to Fig. 3 to Fig. 6, wherein Fig. 3 shows the tensor processor of another embodiment, and Fig. 4 to Fig. 6 shows defeated
The image data matrix entered is 5 rows 5 column and weighted data matrix is 3 rows 3 column.The tensor processor includes being made of 9 PE
3 rows 3 column two-dimensional array, it is PE1 to PE9 which numbers respectively.The connection between 9 PE is also shown in Fig. 3
Relationship and data flow, the data flow of data flow, weighted data including image data and the data flow of operation result.
Image data is transferred to corresponding PE according to mode shown in Fig. 3.Specifically, each PE corresponds to a line image data: image
The 1st row of data is transferred to the PE that number is PE1, and it is PE2 and the PE of PE4 that the 2nd row of image data, which is transferred to number, image data the
3 rows are transferred to the PE that number is PE3, PE5 and PE7, and it is PE6 and the PE of PE8 that the 4th row of image data, which is transferred to number, Yi Jitu
As the 5th row of data is transferred to the PE that number is PE9.And weighted data is configured to corresponding PE according to mode shown in Fig. 3.Specifically
Ground, each PE correspond to a line weighted data: the 1st row of weighted data is configured to the PE that number is PE1, PE4 and PE7, weight number
It is configured to the PE that number is PE2, PE5 and PE8 according to the 2nd row, it is PE3, PE6 and PE9 that the 3rd row of weighted data, which is configured to number,
PE.Operation result carries out folded biography according to mode shown in Fig. 3.Specifically, the operation result for the PE that number is PE1 is folded to pass to number
For the PE of PE2, it is further continued for folded pass to and numbers the PE for being PE3, finally obtain the 1st row of convolution algorithm output result.Number is PE4
PE operation result it is folded pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain convolution algorithm
Export the 2nd row of result.The operation result of the PE that number is PE7 is folded to pass to the PE that number is PE8, is further continued for folded passing to number and being
The PE of PE9 finally obtains the 3rd row of convolution algorithm output result.
Referring to Fig. 3 and Fig. 4, using two-value convolutional neural networks as example, the image data of PE that number is PE1 to input
1 row and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE2 is to input
The 2nd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE3's
PE carries out the convolution algorithm of two-value convolutional neural networks to the 3rd row of image data of input and the 3rd row of weighted data of configuration.It compiles
The operation result of number PE for being PE1 is folded to pass to the PE that number is PE2, be further continued for it is folded pass to the PE that number is PE3, finally obtain two
It is worth the 1st row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration
According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later,
It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE
It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 5, using two-value convolutional neural networks as example, the image data of PE that number is PE4 to input
2 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE5 is to input
The 3rd row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE6's
PE carries out the convolution algorithm of two-value convolutional neural networks to the 4th row of image data of input and the 3rd row of weighted data of configuration.It compiles
The operation result of number PE for being PE4 is folded to pass to the PE that number is PE5, be further continued for it is folded pass to the PE that number is PE6, finally obtain two
It is worth the 2nd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration
According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later,
It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE
It folds together and passes to third PE.
Referring to Fig. 3 and Fig. 6, using two-value convolutional neural networks as example, the image data of PE that number is PE7 to input
3 rows and the 1st row of the weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.The PE that number is PE8 is to input
The 4th row of image data and the 2nd row of weighted data of configuration carry out the convolution algorithm of two-value convolutional neural networks.Number is PE9's
PE carries out the convolution algorithm of two-value convolutional neural networks to the 5th row of image data of input and the 3rd row of weighted data of configuration.It compiles
The operation result of number PE for being PE7 is folded to pass to the PE that number is PE8, be further continued for it is folded pass to the PE that number is PE9, finally obtain two
It is worth the 3rd row of neural network convolution algorithm output result.Wherein, specific PE is to the image data of input and the weight number of configuration
According to convolution algorithm is carried out, convolution algorithm is completed to occur before receiving the folded operation result transmitted of another PE, later,
It can also occur simultaneously.Operation result after the completion of convolution algorithm is folded the operation result transmitted with another PE by the specific PE
It folds together and passes to third PE.
Referring to Fig. 3 to Fig. 6, the image data matrix of input is 5 rows 5 column, and weighted data matrix is that 3 rows 3 arrange, at the tensor
Reason device is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, the connection relationship and data being provided between 9 PE
Flow direction.Further, which is input to specific PE for certain a line of image data, also by weighted data a line
It is configured to the specific PE.The specific PE carries out convolution algorithm to the image data of input and the weighted data of configuration and exports operation
As a result.The operation result of multiple PE obtains certain a line of neural network convolution algorithm output result after passing according to ad hoc fashion is folded.
In other embodiment, PE two-dimensional array can have different dimensions or a size, for example PE two-dimensional array can be with
It is 12 × 14.In other embodiment, tensor processor is according to the image data matrix and weighted data square of input
The information (such as matrix dimensionality) of battle array adjusts the size of PE two-dimensional array.Fig. 3 is only used to embodiment shown in fig. 6
In a kind of framework and a kind of mode for configuring PE two-dimensional array that illustrate PE two-dimensional array, the disclosure pair should not be taken to limit
The other possible frameworks and configuration mode of PE two-dimensional array.In other embodiment, for carrying out the volume of convolution algorithm
Product core (or weighted data matrix), size can be 3 × 3, be also possible to 1 × 1,5 × 5 or 7 × 7.
According to the other embodiment of the disclosure, tensor processor by configure PE two-dimensional array size and
Framework, and by configuring connection relationship and data flow between PE, convolution algorithm can be used for the synchronous input of multiple PE
Image data, the weighted data for being configured to convolution algorithm can also be synchronized to multiple PE, thus optimize data transmission.According to
Some embodiments of the disclosure and the framework of reference PE two-dimensional array shown in Fig. 3, the PE of number PE3, PE5 and PE7
Synchronously the 3rd row of image data can be received by the buffer from except ping-pang storage device or PE two-dimensional array, and number and be
The PE of PE1, PE4 and PE7 synchronously can receive weighted data by the buffer from except ping-pang storage device or PE two-dimensional array
1st row.Fig. 3 is merely illustrative a kind of framework and a kind of configuration PE of PE two-dimensional array to PE two-dimensional array shown in fig. 6
The mode of two-dimensional array should not be taken to limit the disclosure to the other possible frameworks and configuration mode of PE two-dimensional array.
For Fig. 3 into embodiment shown in fig. 6, the operation result of first PE is folded to pass to second PE, is further continued for folded biography
To third PE.In other embodiment, the operation result of first PE is folded to be passed to after second PE, waits second
PE, which terminates to fold after convolution algorithm, passes to first PE rather than third PE.Later, first PE receives the new image of input
Data can also configure new weighted data if needed or to continue the weighted data for keeping being configured constant, and to new
Image data carry out convolution algorithm, then export result.
Fig. 3 is into embodiment shown in fig. 6, using two-value convolutional neural networks as example, image data of the PE to input
The convolution algorithm of two-value convolutional neural networks is carried out with the weighted data of configuration.In other embodiment, PE can be right
The image data of input and the weighted data of configuration carry out full connection operation.In other embodiment, tensor processor
It can be used for non-two-value convolutional neural networks, for example data type is pushing away for the neural network of INT4, INT8, INT16 or INT32
Operation is led, and PE is corresponding with the data type of the neural network to the image data of input and the weighted data progress of configuration
Convolution algorithm.
Referring to Fig. 7, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated
Enter data matrix for 3 rows 13 column.Tensor processor adjustment PE two-dimensional array allows part PE to be in sluggish state to drop
Low energy consumption.
Referring to Fig. 8, a kind of PE two-dimensional array of the tensor processor of embodiment is the matrix array of 12 rows 14 column, and defeated
Enter data matrix for 5 rows 27 column.The tensor processor cuts input data matrix, is divided into 5 rows 14 column and 5 row, 13 column two inputs
Data matrix, to adapt to the dimension of PE two-dimensional array.
With continued reference to Fig. 7 and Fig. 8, according to some embodiments of the disclosure, tensor processor can be according to the figure of input
As the information (such as matrix dimensionality) of data matrix and weighted data matrix (or convolution kernel), so that it is determined that the dimension of PE two-dimensional array
Spending (or size), there are also the connection relationships and data flow between PE.The tensor processor simultaneously can also be according to determining
The dimension of PE two-dimensional array cut input data matrix.If it is desirable, the tensor processor can adjust previously again
The dimension for the PE two-dimensional array having had determined.Therefore, the tensor processor of the disclosure can maintain current PE two-dimensional array
Dimension it is constant under the premise of, by cutting input data matrix to have processing different dimensions input data matrix spirit
Activity.Neural network tensor data to be treated indicate to become the data matrix of different dimensions after being unfolded, at the tensor
The flexibility of the input data matrix of reason device processing different dimensions is advantageously implemented neural network high speed derivation operation.Another party
Face, when the preferable consistency of dimension holding of the input data matrix of neural network, or according to other actual needs, the tensor
Processor can readjust the PE two-dimensional array previously having had determined according to information such as the dimensions of input data matrix
Dimension, thus selection more suitable for handle present input data matrix PE two-dimensional array dimension there are also between PE connection pass
System and data flow.Such as referring to Fig. 3 to embodiment shown in fig. 6, when the image data matrix of input is 5 rows 5 column,
And weighted data matrix is 3 rows 3 column, then tensor processor is configured with the PE two-dimensional array of 3 rows 3 column of totally 9 PE compositions, thus
Realize the image data matrix of high speed derivation operation input.According to some embodiments of the disclosure, tensor processor was both right
The image data matrix of input is cut, and the dimension and other configurations of adjustable current PE two-dimensional array, thus
Be conducive to high speed processing input tensor data complicated and changeable.
Referring to Fig. 9, a kind of tensor processor of embodiment takes the first matching volume after cutting input data matrix
The mode of product core and image data input.First way refers to that identical convolution kernel is inputted for different image datas.
As shown in figure 9, the image data of the first row is different from the image data of the second row, and the convolution kernel of the first row or weighted data with
The convolution kernel or weighted data of second row are the same filter namely the same convolution kernel.The output knot of the first row and the second row
Fruit all enters channel 1.
Referring to Figure 10, a kind of tensor processor of embodiment takes second of matching to roll up after cutting input data matrix
The mode of product core and image data input.The second way refers to the identical corresponding different convolution kernel of image data input.
As shown in Figure 10, the convolution kernel or weighted data of the first row is different from the convolution kernel or weighted data of the second row, and the first row
Image data and the image data of the second row are identical image datas.The first row and the output result of the second row all enter channel
1。
Referring to Figure 11, a kind of tensor processor of embodiment takes the third matching volume after cutting input data matrix
The mode of product core and image data input.The third mode inputs two different image datas after referring to cutting respectively
Two different convolution kernels.As shown in figure 11, the convolution kernel or weighted data of the first row are different from the convolution kernel or power of the second row
Tuple evidence, and the image data of the first row is different from the image data of the second row.The output result of the first row enters channel 1, and
The output result of second row enters channel 2.
Referring to Figure 12, a kind of tensor processor of embodiment passes through multicast delivery mode (Multicast) configuration data
To optimize the transmission of data.Multicast mean a read operation can from ping-pang storage device or PE two-dimensional array it
Data Concurrent is read in outer buffer is sent to multiple PE.In other words, the PE that can be changed number passes through in an instruction cycle
Multicast receives new data configuration, so that the tensor processor can be by the same data in an instruction cycle
It is configured in multiple PE.And it can be the position in PE two-dimensional array by multiple PE that Multicast is configured the same data
In same a line or it is located at same row, or belongs to any combination of any position in two-dimensional array.For example, simultaneously
Referring to Fig. 3 and Figure 12, tensor processor by Multicast by the 1st row of weighted data simultaneously be configured to number be PE1, PE4 and
The 2nd row of weighted data is configured to the PE numbered as PE2, PE5 and PE8 simultaneously by Multicast, and passed through by the PE of PE7
The 3rd row of weighted data is configured to the PE numbered as PE3, PE6 and PE9 by Multicast simultaneously.Pass through Multicast configuration
Data may include the convolution kernel for convolution algorithm, also may include the weighted data for neural network derivation operation.Root
According to some embodiments of the disclosure, the data by Multicast configuration also may include for binary neural network convolution
Threshold value required for the thresholding operation of operation.Threshold value data for configuration can be trained threshold value.It should
Tensor processor just transmits input data after configuring convolution kernel (or weighted data) and threshold value by Multicast
Convolution algorithm is carried out to PE two-dimensional array.That is to say, the data of input data such as characteristics of image just can during actual operation
It is input to PE two-dimensional array and carries out operation.The tensor processor can use state algorithm, by trained threshold value
It is configured to after PE two-dimensional array, is incessantly matched input data matrix and weighted data matrix respectively by Multicast
It sets corresponding PE and carries out convolution algorithm, realize faster derivation operation speed.
Referring to Figure 13, a kind of tensor processor of embodiment has the framework of the full connection operation of support.Tensor processing
Device by adjusting between PE two-dimensional array connection relationship and data flow realize and supported in neural network above full articulamentum
It is the data stream connected entirely.
The tensor processor that Figure 14 shows a kind of embodiment is allocated to the parameter list of each PE.The image of input
The each guild of characteristic distributes a feature_row_id.And the every a line of weighted data distributes a weight_row_id.
Each PE distributes weight_row_id_local and feature_ being associated with the PE being associated with the PE
row_id_local.When configuring weighted data, weight_row_id that the weighted data being configured has and PE's
Weight_row_id_local compares, and the PE receives the weighted value being configured if consistent, and the PE is not if inconsistent
It receives.When the image feature data of configuration input, feature_row_ that the image feature data for the input being configured has
Id is compared with the feature_row_id_local of PE.The PE receives the characteristics of image number for the input being configured if consistent
According to the PE is not received if inconsistent.The tensor processor is according to the image feature data and weighted data of input (or convolution
Core) information calculate distribution feature_row_id, weight_row_id, feature_row_id_local, and
weight_row_id_local.Such as the image feature data that dimension is 3 dimensions has a length and width, deep totally three dimensions, and dimension
It is that the convolution Nuclear Datas of 4 dimensions has the number of length and width, depth and convolution kernel to be total to four dimensions.The tensor processor can be according to image
The information of the four dimensions of the information and convolution Nuclear Data of three dimensions of characteristic, calculates the number for the PE to be called also
There is the dimension of two-dimensional array composed by called PE, then calculates and be assigned to the every a line of image feature data
Feature_row_id is assigned to the weight_row_id of each row of convolution Nuclear Data, is also assigned to the feature_ of PE
Row_id_local and weight_row_id_local.
Referring to Fig. 3, Figure 12 and Figure 14, the 1st row of weighted data is configured to number by Multicast and is by tensor processor
The PE of PE1, PE4 and PE7.Tensor processor by comparing weighted data the 1st row weight_row_id and each PE
weight_row_id_local.Only number be PE1, PE4 and PE7 PE weight_row_id_local and weighted data
The weight_row_id of 1st row matches, therefore only number is that the PE of PE1, PE4 and PE7 receive the 1st row of weighted data.
With continued reference to Figure 14, for the configuration parameter of specific PE, which can be set parameter model_set
The operating mode for setting the PE is RGB operation or full Binary Operation, and parameter Psum_set can be set whether to set the PE
It receives the operation result of another PE and is added to the operation result of the PE, parameter Pool_en can be set to set the PE and be
It is no that pond layer operation is carried after convolution algorithm, it can set whether the PE exports operation result with setup parameter Dout_on, it can
Set whether the PE participates in operation with setup parameter Row_on.The tensor processor can be illustrated with setup parameter K_num
Participate in convolution algorithm convolution kernel dimension, setup parameter Psum_num come illustrate carry out accumulating operation intermediate result P_sum
Number.By the configuration parameter of configured in advance PE, the tensor processor can control the operating mode of PE, working condition and
Control the connection relationship and data flow between PE.It, can be according to actual needs according to the other embodiment of the disclosure
The configuration parameter in PE is readjusted, and then readjusts the connection relationship and data flow between PE.Further, because
Whether to judge the PE by the assigned parameter of configured in advance in matching PE good parameter and input data or weighted data
Input data or weighted data should be received, which can improve the efficiency of configuration data by Multicast.
Therefore, which accelerates the tensor computation of neural network by using pattern of rows and columns of two-dimensional matrix, and by matching
The parameter of PE is set to optimize data transmission and control operation, and adjust by cutting input matrix and adjustment PE two-dimensional array
The dimension for the tensor that can be handled, to realize the neural network high -speed calculating unit being adapted dynamically.
Embodiment shown in Figure 14 is merely illustrative a kind of possible combination of PE configuration parameter, should not be taken to limit
Other possible configuration modes of the disclosure to PE.According to some embodiments of the disclosure, tensor processor is according to the figure of input
As the information of characteristic and convolution kernel, can calculate to be allocated to each PE for matching image characteristic
Feature_row_id_local and feature_column_id_local, there are also for matching convolution kernel or weighted data
Weight_row_id_local and weight_column_id_local.Each is used to be input to the image feature data of PE
The feature_row_id and feature_column_id that can be distributed into couple.And each convolution kernel or weighted data can divide
With pairs of weight_row_id and weight_column_id.PE can respectively be compared when configuring image feature data
The feature_column_id_ of the feature_row_id and PE of feature_row_id_local and image feature data
The feature_column_id of local and image feature data.Only when this is matched all unanimously twice, which can just receive figure
As characteristic, and refusal receives image feature data if mismatching at least once.It is similar, in configuration convolution kernel or
The weight_row_id_local and weight_row_id and weight_ of PE can respectively be compared when weighted data
Column_id_local and weight_column_id.Only when this matching twice is all consistent, which can just receive convolution kernel
Or weighted data, and refusal receives convolution kernel or weighted data if mismatching at least once.
Referring to Figure 15, a kind of the tensor processor progress binary neural network convolution algorithm and thresholding operation of embodiment.
Specifically, which has all carried out binary conversion treatment, and further earth's surface for character image data and weighted data
It is shown as 0 and 1.Therefore the character image data after binaryzation and weighted data all only need a bit storage bit (if
It is expressed as the storage bit of 1 and two bits of -1 needs, another bit of bit stored symbols stores numerical value), thus
Save a large amount of memory spaces.Further, because indicating the character image data after binaryzation with 0 or the 1 of a bit
And weighted data, the multiplication operation of neural network convolution algorithm can be substituted for exclusive or non-exclusive (XNOR) logic gate.And neural network
The sum operation of convolution algorithm can be substituted for popcount operation.Popcount operation means to count each bit in the result
The number that positional value is 1.For example the number that all bit position values of the operand of 32 bits are 1 is a, then 0 number
(- 1 is represented if being expressed as 1 and-1) is 32-a, and final result is a-(32-a)=2a-32.
With continued reference to Figure 15, by taking the weighted data of the character image data of 32 bits and 32 bits carries out convolution algorithm as an example
Son, then the multiply-add operation of the operand of two 32 bits can be substituted for: the operand of two 32 bits carries out exclusive or not operation,
Obtained result carries out popcount operation.Specifically, as shown in figure 15, by the image data of bit and one corresponding
The weighted data of bit passes through exclusive or non-exclusive (XNOR) logic gate, then the result of multiple exclusive or non-exclusive (XNOR) logic gate is passed through
As soon as the popcount of 32 bits, then stack up the result exported by the popcount of multiple 32 bits to have obtained two-value
The intermediate result Psum of 16 bits of neural network convolution algorithm.Because with gate operation and popcount operation instead of volume
The multiply-add operation of product operation, so that a large amount of floating-point operations are saved, so that the tensor processor realizes adding for binary neural network
Fast convolution algorithm.The acceleration convolution algorithm of binary neural network can pass through either one or two of the PE two-dimensional array of the tensor processor
PE is realized, can also be realized by specified PE.According to some embodiments of the disclosure, tensor processor can also be obstructed
Gate operation and popcount operation are crossed, but uses the convolution algorithm operation based on floating-point operation of general neural network.
Tensor processor shown in figure 15 carries out thresholding operation to intermediate result Psum to improve operational precision.Specifically,
The intermediate result Psum of convolution algorithm is compared by the tensor processor with trained threshold value, and output result is 16 bits
Intermediate result Psum or regularization after 1 bit 0 or 1.
Assuming that result after convolution is a, batch processed function BatchNorm (a)=γ (a-μ)/σ+B, wherein u be to
The mean value of amount, σ are variance, and γ is proportionality coefficient, and B is biasing.Carry out binarization operation namely sign function operation:
Binaryzation:
In order to simplify operation, thresholding operation is merged into Batchnormal operation and binarization operation simplification.It can be concluded that
BatchNorm (a)=0 is separation, and when it is more than or equal to 0, result is 1, and other case values are 0.Therefore BatchNorm is enabled
(a)=0 a=u-(B σ)/γ can, be obtained.Note a is Tk, it is meant that BatchNorm (Tk)=γ (Tk-μ)/σ+B=0.So
It is 1 when the result of convolution algorithm is more than or equal to Tk duration, other case values are 0.
Assuming that the training result of Batchnormal are as follows: γ=4, μ=5, σ=8, B=2.It calculates: Tk=u-(B
σ)/γ=5-(2*8)/4=1.
Abbreviation calculates before merging: when convolutional calculation result is 0, substituting into Batchnormal formula: 4* (0-5)/8+2=-
0.5, because smaller than 0, result 0.When convolutional calculation result is 2, Batchnormal formula: 4* (2-5)/8+2 is substituted into
=0.5, because bigger than 0, result 1.
Abbreviation calculates after merging: threshold T k=1, smaller than Tk when convolution results are 0, so the result is that 0.Work as convolution
The result is that when 2, because bigger than Tk, the result is that 1.So output result is consistent before and after abbreviation.
Therefore, Batchnormal operation and binarization operation simplification are merged into thresholding operation it can be concluded that with before simplification
It is consistent as a result, still thresholding operation saves a large amount of floating-point operation resources.After the completion of the training of two-value convolutional neural networks, root
Tk is exported according to formula Tk=u-(B σ)/γ, as long as being then compared the result of convolution with the value of Tk.Tensor processing
Device is not needed to carry out a large amount of floating-point operation, both be improved because the result of convolution is compared with trained threshold T k
Operational precision also accelerates the inference time of neural network.Thresholding operation can pass through the PE two-dimensional array of the tensor processor
Any one PE is realized, can also be realized by specified PE.
According to some embodiments of the disclosure, tensor processor replaces two by gate operation and popcount operation
Be worth neural network convolution algorithm multiply-add operation, then by thresholding operation by convolution results compared with trained threshold value to mention
High operational precision realizes and has both high speed and high-precision binary neural network tensor computation device.In other embodiment party
In formula, which, which is further advanced by ping-pang storage device and realizes, carries out operation while configuring PE two-dimensional array.
In other embodiments, which opens again after being further advanced by configured in advance authority credentials and threshold value to PE
Beginning input feature vector image data realizes state algorithm and quickly processing input data.In other embodiments, the tensor
Processor, which is further advanced by configuration PE, allows its included pond layer (pooling) to handle.In other embodiments, the tensor
Processor is also further advanced by full connection so that not reading and writing intermediate result Psum by ping-pang storage device, but directly
The folded biography of intermediate result Psum is completed between the PE of PE two-dimensional array.In other embodiments, the tensor processor into
The input that can handle different dimensions is realized to one step by the dimension of cutting input data matrix and adjustment PE two-dimensional array
Measure data.In other embodiments, which, which is further advanced by Multicast configuration data, realizes optimization
Data transmission.
According to some embodiments of the disclosure, trained threshold value can pass through the iteration of predetermined number of times, Huo Zheyou
Constringent regression algorithm, or the methods of compare with the test image marked and to obtain.According to other the one of the disclosure
A little embodiments, trained threshold value can be obtained by general trained neural network and the method for machine learning.
Referring to Figure 16, a kind of tensor processor of embodiment includes the matrix 500 of PE two-dimensional array, control module 502,
Weighted data buffer 504, threshold value data buffer 506, image data buffer 508 and input/output bus 510.According to
Some embodiments of the disclosure, ping-pang storage device 512 include but is not limited to control module 502, weighted data buffer
504, threshold value data buffer 506 and image data buffer 508.According to the other embodiment of the disclosure, table tennis
Controller 512 includes control module 502, weighted data buffer 504 and image data buffer 508.Input/output bus 510
Data, such as the tensor matrix data of three ranks or higher order are received from outside.The data based on the received of input/output bus 510
Type and purposes (such as weighted data, threshold value data or image data), its received data is respectively written into weight
Data buffer 504, threshold value data buffer 506 and image data buffer 508.Weighted data buffer 504, threshold value
The data respectively stored are transferred to control module 502 and from control moulds by data buffer 506 and image data buffer 508
Block 502 reads weighted data, threshold value data and image data respectively.The input data of present embodiment using image data as
Example, but input data is not limited to image data, input data is also possible to voice data, is suitable for target identification
Data type or other data types.In other embodiment, image data buffer 508 could alternatively be input
514 (not shown) of data buffer.Input data buffer 514 is used to receive various types of input numbers from input/output bus
According to, including image, sound or other data types.The data that input data buffer 514 is also stored are transferred to control
Module 502 and corresponding data are read from control module 502.
With continued reference to Figure 16, control module 502 determines the PE's to be called according to the information of weighted data and image data
Number, then determine connection relationship and data flow between called multiple PE, and then construct the matrix of PE two-dimensional array
500.Control module 502 can also be only according to the information of weighted data, perhaps only according to the information of image data or only
Only rely on program of the configured in advance in control module 502 determine the PE to be called number and called multiple PE it
Between connection relationship and data flow.Weighted data, threshold value data and image data are transferred to and build by control module 502
PE two-dimensional array matrix 500.According to the dimension of the dimension of image data matrix and the matrix of PE two-dimensional array, control module
PE or cutting image data matrix in the matrix 500 of 502 adjustable PE two-dimensional arrays.Control module 502 is in cutting drawing
As different modes can be taken to match weighted data or convolution kernel and image data, including identical convolution kernel after data matrix
It is inputted for different image datas, the identical corresponding different convolution kernel of image data input, after also cutting not by two
Identical image data inputs two different convolution kernels respectively.
With continued reference to Figure 16, control module 502 can also adjust PE two-dimensional array again or repeatedly according to actual needs
Connection relationship and data flow between the PE of matrix 500.Control module 502 can be by Multicast by the same data
(weighted data or threshold value data or image data) is configured in multiple PE in an instruction cycle, and is configured same
Multiple PE of a data, which can be, to be located at a line in the matrix 500 of PE two-dimensional array or is located at same row, or is belonged to
Any combination of any position in the matrix 500 of PE two-dimensional array.Control module 502 may be configured to convolution algorithm
Convolution kernel or weighted data, can also be with configured threshold Value Data.Threshold value data for configuration can be trained
Good threshold value.The tensor processor can use state algorithm, and trained threshold value is being configured to PE two dimension battle array
After the matrix 500 of column, convolution algorithm is carried out to image data matrix and weighted data matrix, realizes faster arithmetic speed.
Referring to Figure 17, a kind of ping-pang storage device of the tensor processor of embodiment includes PE configuration parameter registers 600.
PE configuration parameter registers 600 are used to storage configuration parameter such as weighted data or threshold value.The ping-pang storage device is carrying out
It needs to read configuration parameter before the convolution algorithm of any dimension and PE is configured.The ping-pang storage device can configure on one side
Operation is carried out on one side, has CONSUMER pointer 602 and PRODUCER pointer 604 resident in the single burst thus.CONSUMER refers to
Needle 602 is a read-only register field, which can check to determine which table tennis group data path has selected,
And PRODUCER pointer 604 is controlled by the tensor processor completely.In other embodiment, PE configuration parameter registers
600 are also used to the various parameters of storage configuration PE, such as the configuration parameter of PE shown in figure 15.
Input data in multiple embodiments of the disclosure with image data as an example, but input data not office
It is limited to image data.Input data is also possible to voice data, the data type for being suitable for target identification or other data class
Type.The input data of multiple embodiments of the disclosure with the tensor data of three ranks or higher order as an example, but input
Data are not limited to the tensor data of three ranks or higher order.Input data is also possible to of second order, single order or zeroth order
Measure data.
According to some embodiments of the disclosure, ping-pang storage device can carry out operation while configuring, that is to say, that
New convolution kernel (or weighted data) and/or trained threshold value can be configured on one side, it can be on one side to input data
Matrix carries out convolution algorithm or connects operation entirely, to accelerate the processing speed of tensor processor.
According to some embodiments of the disclosure, it is independent for configuring input data, weighted data and threshold value each other
, progress can be synchronized.
Full connection can be realized with PE two-dimensional array by the adjustment on configuring according to some embodiments of the disclosure
Operation or convolution algorithm and read-write intermediate result Psum is not needed, but is directly completed in PE two-dimensional array.And it can be with
It is realized by adjusting configuration in the switching entirely between connection operation and convolution algorithm.
It, can be by the pond layer in neural network model by configuring PE according to some embodiments of the disclosure
(pooling) operation is mixed with convolution algorithm, namely the PE being configured carries pondization operation.
According to some embodiments of the disclosure, the PE to be called can be neatly selected, and for the company between PE
Connect relationship and data flow can by adjusting configure be set with according to actual needs the PE two-dimensional array of specific configuration (including
Control PE between data flow), and can divide according further to the PE two-dimensional array set input data matrix or
The PE that person's selection allows a part to take less than enters disabled state.
According to some embodiments of the disclosure, tensor processor can use state algorithm, will be trained
Threshold value is configured to after PE two-dimensional array, is carried out convolution algorithm to input data matrix and weighted data matrix, is realized faster
Arithmetic speed.
According to some embodiments of the disclosure, it is common that FPGA, GPU etc. can be used in PE used in tensor processor
Neural network processor is also possible to the processor specially designed, as long as needed for meeting the various embodiments for realizing the disclosure
The Functional Requirement for the bottom line wanted.
According to some embodiments of the disclosure, tensor processor is used for two-value convolutional neural networks, and the tensor is handled
The PE of the PE two-dimensional array of device carries out the convolution of two-value convolutional neural networks to the image data of input and the weighted data of configuration
Operation.In other embodiment, PE can carry out full connection fortune to the image data of input and the weighted data of configuration
It calculates.In other embodiment, tensor processor can be used for non-two-value convolutional neural networks, for example data type is
The derivation operation of the neural network of INT4, INT8, INT16 or INT32, and PE is to the image data of input and the weight number of configuration
According to carrying out corresponding with the data type of neural network convolution algorithm or connect operation entirely.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned
The all possible combination of each technical characteristic in embodiment is all described, as long as however, the combination of these technical characteristics
There is no contradictions, all should be considered as described in this specification.
Embodiment described above can not be construed as limiting the scope of the patent.It should be pointed out that for this
For the those of ordinary skill in field, without departing from the inventive concept of the premise, various modifications and improvements can be made, this
Belong to protection scope of the present invention.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.