WO2019136764A1

WO2019136764A1 - Convolutor and artificial intelligent processing device applied thereto

Info

Publication number: WO2019136764A1
Application number: PCT/CN2018/072678
Authority: WO
Inventors: 肖梦秋
Original assignee: 深圳鲲云信息科技有限公司
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2019-07-18
Also published as: CN109416756A

Abstract

A convolutor (100) and an artificial intelligent processing device applied thereto, the convolutor being electrically connected to an external memory, the external memory storing data to be processed and weighting parameters. The convolutor (100) comprises: a parameter buffer (110), an input buffer, a convolution operation circuit (150) and an output buffer (160). The parameter buffer (110) is used for receiving and outputting the weighting parameters. The input buffer comprises: a plurality of connected row buffers for receiving and outputting the data to be processed. Every time each row buffer outputs one bit, the bits are collected to form a column of data output. The convolution operation circuit (150) is used for receiving from the input buffer the data to be processed, receiving the weighting parameters from the parameter buffer (110), performing a convolution operation according to the data to be processed and the weighting parameters, and outputting a convolution operation result. The output buffer (160) is used for receiving the convolution operation result and outputting the convolution operation result to an external memory. The method can solve the problems of a low processing speed and high requirements for the processor brought about by software operations.

Description

Convolver and artificial intelligence processing device applied thereto

Technical field

The present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, and specifically to a convolver and an artificial intelligence processing device to which the same is applied.

Background technique

The Convolutional Neural Network (CNN) is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing. The convolutional neural network includes a convolutional layer and a pooling layer.

Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.

Generally, the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the convolutional network, so that the feature map has displacement invariance. In addition, since the neurons on one mapping surface share weights, the number of network free parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces feature resolution.

CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the convolutional network relative to the neural network connected to each other. The convolutional neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially multidimensional. The feature that the input vector image can be directly input into the network avoids the complexity of data reconstruction during feature extraction and classification.

At present, the convolutional neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the convolutional neural network increases, the processing speed is relatively slow, and the processing is slow. The performance requirements of the device are also getting higher and higher.

Summary of the invention

In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the convolutional neural network in the prior art is realized by software operation. The speed is slower and the performance of the processor is high.

To achieve the above and other related objects, the present invention provides a convolver electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; the convolver includes: a parameter buffer, An input buffer, a convolution operation circuit, and an output buffer; the parameter buffer is configured to receive and output the weight parameter; the input buffer includes: a plurality of connected line buffers for receiving and outputting the to-be- Processing data; wherein each of the line buffers is assembled to form a column of data output for each bit of data output; the convolution operation circuit is configured to receive the to-be-processed data from the input buffer, from the parameter buffer Receiving a weight parameter, performing a convolution operation and outputting a convolution operation result; the output buffer is configured to receive the convolution operation result and output the convolution operation result to the external memory.

In an embodiment of the invention, the input buffer includes: a first line buffer that receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores the input volumes. The feature map of the layer; wherein the number of data per row of the row pixels is the number of parallel filters.

In an embodiment of the invention, the first line buffer sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data.

In an embodiment of the invention, the input buffer further includes: at least one second line buffer, comprising a plurality of serial FIFO memories, each of the FIFO memories storing a row of pixel data of the feature map; wherein each The row pixel data is sequentially stored in the FIFO memory along a path formed by the serial FIFO memory; the second row buffer outputs pixel data in the form of a Pf×K matrix; wherein Pf is the number of parallel filters, K The size of the convolution kernel.

In an embodiment of the invention, the input buffer further includes: at least one matrix buffer, each of the matrix buffers being arranged by a matrix for storing a plurality of registers of pixel data, wherein the size of the register is Pf × K × 2, when the number of columns of input pixel data is larger than K, the matrix buffer outputs pixel data in the form of a Pf × K × K matrix.

In an embodiment of the invention, the convolution operation circuit includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, a plurality of The output results of the multipliers are accumulated; each of the convolvers inputs pixel data in the form of a K×K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.

In an embodiment of the invention, the output buffer includes: two FIFO memories in parallel, the channel data passing through the same filter is accumulated and stored in the same FIFO memory; the data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.

In an embodiment of the invention, the convolver further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the convolution operation result Output to external memory.

In an embodiment of the invention, the internal components included in the convolver and the convolver and the external memory are connected by a first in first out data interface.

The present invention also provides an artificial intelligence processing apparatus including the convolver as described above.

As described above, the convolver of the present invention and the artificial intelligence processing apparatus to which the same is applied have the following advantageous effects:

The convolver of the invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process a complex convolutional neural network algorithm at high speed. It can effectively solve the problem that the processing speed brought by the software operation realized in the prior art is slow and the performance of the processor is high.

DRAWINGS

Figure 1 shows a schematic diagram of the overall principle of a convolver in the prior art.

2 is a schematic diagram showing the input and output of a convolver of the present invention.

3 is a schematic diagram showing a second line buffer in a convolver of the present invention.

4 is a schematic diagram showing the input and output of a matrix buffer in a convolver of the present invention.

FIG. 5 is a schematic structural diagram of an output buffer in a convolver according to the present invention.

Component label description

100 convolver

110 parameter buffer

120 first line buffer

130 second line buffer

140 matrix buffer

150 convolution operation circuit

160 output buffer

170 pool computing circuit

Detailed ways

The embodiments of the present invention are described below by way of specific examples, and those skilled in the art can readily understand other advantages and effects of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.

It should be noted that, as shown in FIG. 1 to FIG. 5, the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.

The purpose of this embodiment is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the processing speed of the convolutional neural network in the prior art is slowed by software operation, and the processor is slow. High performance requirements. The principle and implementation manner of a convolver and an artificial intelligence processing device to which the present invention is applied will be described in detail below, so that a convolver of the embodiment can be understood by those skilled in the art without any creative work. The artificial intelligence processing device to which it is applied.

Specifically, as shown in FIG. 1 , the embodiment provides a convolver 100 electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; The accumulator 100 includes a parameter buffer 110, an input buffer, a convolution operation circuit 150, and an output buffer 160.

The first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to respective channel data; and the convolution operation circuit 150 has a plurality of The convolution operation results of the respective channel data are calculated in parallel one by one.

In the present embodiment, the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2). The parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory. The parameters in the input buffer, the convolution operation circuit 150, and the output buffer 160 are also stored in the parameter buffer 110.

In this embodiment, the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.

The input buffer includes a first line buffer 120 (Conv_in_cache shown in FIG. 2), a second line buffer 130 (Conv_in_buffer shown in FIG. 2), and a matrix buffer 140 (Con_in_matrix shown in FIG. 2) ). The first line buffer 120 (Conv_in_cache shown in FIG. 2), the second line buffer 130 (Conv_in_buffer shown in FIG. 2), and the matrix buffer 140 (Con_in_matrix shown in FIG. 2) are used to 1* The input of 1 pixel data is processed to output Pv*K ² pixel data. Where Pv is the row vector and K is the size of the convolution kernel. The input buffer will be described in detail below.

Specifically, in the embodiment, the first line buffer 120 (Conv_in_cache shown in FIG. 2) receives the pixel data of the feature map to be processed bit by bit, and simultaneously outputs the line pixel data after the filter, and stores The characteristic map of each convolution layer input; wherein the number of data per row of the row pixels is the number of parallel filters.

In this embodiment, the first line buffer 120 includes a BRAM, and the feature map input pixel data of each convolution layer is buffered in the BRAM to improve localized storage of the pixel data.

In the embodiment, the first line buffer 120 sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all the channels of one convolution layer are output, the pixel data of the channel of the next convolution layer is output. The first line buffer 120 performs iterative calculation output from the first convolution layer to the last convolution layer by using different filters.

In this embodiment, the input buffer further includes: at least one second row buffer 130. As shown in FIG. 3, the second row buffer 130 includes a plurality of serial FIFO memories, each of the FIFOs. The memory stores a row of pixel data of the feature map; wherein each of the row of pixel data is sequentially stored in each of the FIFO memories along a path formed by the serial FIFO memory; the second row buffer 130 receives Pf rows of pixel data, and outputs Pix data in the form of a matrix of Pf; where Pf is the number of parallel filters and K is the size of the convolution kernel.

The first row of pixel data is stored in the first FIFO, the first FIFO begins to receive the second row of pixel data, and the first row of pixel data is output to the second FIFO. Thus, the two FIFOs will store two consecutive rows of pixel data and output them.

In this embodiment, the input buffer further includes: at least one matrix buffer 140, each of the matrix buffers 140 is configured by a plurality of registers arranged in a matrix for storing pixel data, and the size of the register is Pf. ×K × 2, as shown in Fig. 4, is displayed as a register when K = 3. The matrix buffer 140 inputs pixel data in the form of a Pf×K matrix. When the number of columns of input pixel data is greater than K, the matrix buffer 140 outputs pixel data in the form of a Pf×K×K matrix.

In this embodiment, the convolution operation circuit 150 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a convolution operation, and output a convolution operation. result.

Specifically, in the embodiment, the convolution operation circuit 150 includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, The output results of the plurality of multipliers are accumulated; each of the convolvers 100 inputs pixel data in the form of a K×K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.

For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices. It is assumed that the first weight parameter, that is, the depth of the filter is 3, that is, the three-layer sub-weight parameter, that is, three two-dimensional matrices, Each length and width is set to K*K, assuming K is an odd number 3, which is convoluted with three Chanel respectively. When a data cube of Pv*k*3 is taken from the first data to be processed (Pv>K), Pv is assumed. If it is 5, the filter is to be calculated by the convolution operation circuit 150 three times with the data cube. Preferably, the convolution operation circuit 150 can be provided with a corresponding number of three, so that each of the filters can be performed in parallel in one clock cycle. The convolution operation of the responsible channel.

In this embodiment, the output buffer 160 is configured to receive the convolution operation result and output the convolution operation result to the external memory.

Specifically, the output buffer 160 receives the convolution operation result of each channel, and then accumulates the convolution operation result of all the channel data, and the result is temporarily stored in the output buffer 160.

Specifically, in the embodiment, as shown in FIG. 5, the output buffer 160 includes: two FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.

Wherein, the number of the adders is equal to Pf*Pv, and in addition, the data selector (MUX) is also used to reduce the data stream speed to 1*1, one bit pixel pixel output.

In this embodiment, the convolver 100 further includes a pooling operation circuit 170 connected between the output buffer 160 and the external memory for pooling the convolution operation result. Output to external memory.

The pooling operation circuit 170 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 170 also includes a FIFO memory for storing pixel data for each row.

Specifically, the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.

In this embodiment, the internal components included in the convolver 100, and the first in first out data interface between the convolver 100 and the external memory (the plurality shown in FIG. 2) SIF) connection.

Specifically, the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.

The first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;

The first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.

Specifically, the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.

The second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.

Specifically, the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.

In this embodiment, the running process of the convolver 100 is as follows:

The data to be processed is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).

The weight parameter (one channel) of k*k is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and then stored to the parameter buffer 110.

Once the parameter buffer 110 is loaded into a weight parameter, the pixel data of the processing feature map is started to be received, through the first line buffer 120 (Conv_in_cache shown in FIG. 2), and the second line buffer 130 (in FIG. 2 The illustrated Conv_in_buffer) and the processing of the matrix buffer 140 (Con_in_matrix shown in FIG. 2), the convolution operation circuit 150 receives Pv*K ² pixel data every clock cycle.

The input data of each channel (the height of the characteristic map input to each channel is H and the width W) is accumulated by the convolution operation circuit 150, and then the result of each channel is output to the output buffer 160. .

The different input channels are cyclically accessed, and the output buffer 160 accumulates the data of each channel to know that the feature map corresponding to the filter size of (H-K+1)*(W-K+1) is acquired.

Then, the pooling operation circuit 170 can receive the (H-K+1)*(W-K+1) pixel data for the pooling process and then output the feature map, or directly output the feature map from the output buffer 160.

After the pooling operation circuit 170 or the output buffer 160 outputs the feature map processed by one filter, the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until the pixel processing of all convolutional layers is completed.

The embodiment also provides an artificial intelligence processing apparatus including the convolver 100 as described above. The convolver 100 has been described in detail above and will not be described again.

The artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS). The processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, or a DSP, etc., such as an embedded processor chip of an ARM architecture; the central processing unit is communicatively coupled to an external memory, and the external memory 200 For example, RAM or ROM memory, such as three generations, four generations of DDR SDRAM, etc.; the central processing unit can read and write data to and from external memory.

In summary, the convolver of the present invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process high complexity at high speed. The convolutional neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the invention will be covered by the appended claims.

Claims

A convolver electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; wherein the convolver comprises: a parameter buffer, an input buffer, and a convolution operation circuit And an output buffer;

The parameter buffer is configured to receive and output the weight parameter;

The input buffer includes: a plurality of connected line buffers for receiving and outputting the data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;

The convolution operation circuit is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer, perform a convolution operation, and output a convolution operation result;

The output buffer is configured to receive the convolution operation result and output the convolution operation result to the external memory.
The decimator of claim 1 wherein said input buffer comprises:

a first line buffer, which receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after passing through the filter, and stores the characteristic map of each input convolution layer; wherein, each row of data of the row pixel The number is the number of filters in parallel.
The convolver according to claim 2, wherein said first line buffer sequentially outputs row pixel data of each of said convolution layers, and sequentially outputs pixel data of each of said convolution layer rows Outputs line pixel data of each channel data.
The decimator of claim 2, wherein the input buffer further comprises:

At least one second row buffer comprising a plurality of serially connected FIFO memories, each of the FIFO memories storing a row of pixel data of a feature map; wherein each of the row of pixel data is sequentially stored in a path formed by the serial FIFO memory to each The FIFO memory; the second line buffer outputs pixel data in the form of a Pf×K matrix; wherein Pf is the number of parallel filters and K is the size of the convolution kernel.
The decimator of claim 4, wherein the input buffer further comprises:

At least one matrix buffer, each of the matrix buffers being arranged in a matrix for storing a plurality of registers of pixel data, the size of the register being Pf×K×2, when the number of columns of the input pixel data is greater than K The matrix buffer outputs pixel data in the form of a Pf×K×K matrix.
The convolver according to claim 5, wherein said convolution operation circuit comprises:

a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation;

An adder tree that accumulates output results of a plurality of the multipliers;

Each of the convolvers inputs pixel data in the form of a K×K matrix, and outputs pixel data bit by bit according to the input pixel data and the weight parameter by a convolution operation.
The decimator of claim 6 wherein said output buffer comprises:

Parallel two FIFO memories, the channel data passing through the same filter is accumulated and stored in the same FIFO memory;

A data selector for returning each accumulated result to the adder tree until the adder outputs a final accumulated result.
The decimator of claim 1 wherein said convolver further comprises:

The pooling operation circuit is connected between the output buffer and the external memory, and is used for pooling the convolution operation result and outputting the result to the external memory.
The convolver according to claim 1, wherein each of the internal components included in the convolver and the convolver and the external memory are connected by a first-in first-out data interface.
An artificial intelligence processing apparatus, characterized in that the artificial intelligence processing apparatus comprises a convolver as claimed in any one of claims 1 to 9.