WO2019136764A1 - Convolutor and artificial intelligent processing device applied thereto - Google Patents
Convolutor and artificial intelligent processing device applied thereto Download PDFInfo
- Publication number
- WO2019136764A1 WO2019136764A1 PCT/CN2018/072678 CN2018072678W WO2019136764A1 WO 2019136764 A1 WO2019136764 A1 WO 2019136764A1 CN 2018072678 W CN2018072678 W CN 2018072678W WO 2019136764 A1 WO2019136764 A1 WO 2019136764A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- buffer
- data
- pixel data
- convolution operation
- output
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/06—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
- G06F5/10—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor having a sequence of storage locations each being individually accessible for both enqueue and dequeue operations, e.g. using random access memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
Definitions
- the present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, and specifically to a convolver and an artificial intelligence processing device to which the same is applied.
- the Convolutional Neural Network is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing.
- the convolutional neural network includes a convolutional layer and a pooling layer.
- CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.
- the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted.
- the second is the feature mapping layer
- each computing layer of the network is composed of multiple feature maps, and each feature map is a plane.
- the weights of all neurons on the plane are equal.
- the feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the convolutional network, so that the feature map has displacement invariance.
- the neurons on one mapping surface share weights, the number of network free parameters is reduced.
- Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces feature resolution.
- CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the convolutional network relative to the neural network connected to each other.
- the convolutional neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially multidimensional.
- the feature that the input vector image can be directly input into the network avoids the complexity of data reconstruction during feature extraction and classification.
- the convolutional neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the convolutional neural network increases, the processing speed is relatively slow, and the processing is slow. The performance requirements of the device are also getting higher and higher.
- the object of the present invention is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the convolutional neural network in the prior art is realized by software operation.
- the speed is slower and the performance of the processor is high.
- the present invention provides a convolver electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters;
- the convolver includes: a parameter buffer, An input buffer, a convolution operation circuit, and an output buffer;
- the parameter buffer is configured to receive and output the weight parameter;
- the input buffer includes: a plurality of connected line buffers for receiving and outputting the to-be- Processing data; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;
- the convolution operation circuit is configured to receive the to-be-processed data from the input buffer, from the parameter buffer Receiving a weight parameter, performing a convolution operation and outputting a convolution operation result;
- the output buffer is configured to receive the convolution operation result and output the convolution operation result to the external memory.
- the input buffer includes: a first line buffer that receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores the input volumes.
- the feature map of the layer wherein the number of data per row of the row pixels is the number of parallel filters.
- the first line buffer sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data.
- the input buffer further includes: at least one second line buffer, comprising a plurality of serial FIFO memories, each of the FIFO memories storing a row of pixel data of the feature map; wherein each The row pixel data is sequentially stored in the FIFO memory along a path formed by the serial FIFO memory; the second row buffer outputs pixel data in the form of a Pf ⁇ K matrix; wherein Pf is the number of parallel filters, K The size of the convolution kernel.
- the input buffer further includes: at least one matrix buffer, each of the matrix buffers being arranged by a matrix for storing a plurality of registers of pixel data, wherein the size of the register is Pf ⁇ K ⁇ 2, when the number of columns of input pixel data is larger than K, the matrix buffer outputs pixel data in the form of a Pf ⁇ K ⁇ K matrix.
- the convolution operation circuit includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, a plurality of The output results of the multipliers are accumulated; each of the convolvers inputs pixel data in the form of a K ⁇ K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.
- the output buffer includes: two FIFO memories in parallel, the channel data passing through the same filter is accumulated and stored in the same FIFO memory; the data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.
- the convolver further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the convolution operation result Output to external memory.
- the internal components included in the convolver and the convolver and the external memory are connected by a first in first out data interface.
- the present invention also provides an artificial intelligence processing apparatus including the convolver as described above.
- the convolver of the present invention and the artificial intelligence processing apparatus to which the same is applied have the following advantageous effects:
- the convolver of the invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process a complex convolutional neural network algorithm at high speed. It can effectively solve the problem that the processing speed brought by the software operation realized in the prior art is slow and the performance of the processor is high.
- Figure 1 shows a schematic diagram of the overall principle of a convolver in the prior art.
- FIG. 2 is a schematic diagram showing the input and output of a convolver of the present invention.
- FIG. 3 is a schematic diagram showing a second line buffer in a convolver of the present invention.
- FIG. 4 is a schematic diagram showing the input and output of a matrix buffer in a convolver of the present invention.
- FIG. 5 is a schematic structural diagram of an output buffer in a convolver according to the present invention.
- FIG. 1 to FIG. 5 the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.
- the purpose of this embodiment is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the processing speed of the convolutional neural network in the prior art is slowed by software operation, and the processor is slow. High performance requirements.
- the principle and implementation manner of a convolver and an artificial intelligence processing device to which the present invention is applied will be described in detail below, so that a convolver of the embodiment can be understood by those skilled in the art without any creative work.
- the artificial intelligence processing device to which it is applied will be described in detail below, so that a convolver of the embodiment can be understood by those skilled in the art without any creative work.
- the artificial intelligence processing device to which it is applied is applied.
- the embodiment provides a convolver 100 electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters;
- the accumulator 100 includes a parameter buffer 110, an input buffer, a convolution operation circuit 150, and an output buffer 160.
- the first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to respective channel data; and the convolution operation circuit 150 has a plurality of The convolution operation results of the respective channel data are calculated in parallel one by one.
- the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2).
- the parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory.
- the parameters in the input buffer, the convolution operation circuit 150, and the output buffer 160 are also stored in the parameter buffer 110.
- the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.
- the input buffer includes a first line buffer 120 (Conv_in_cache shown in FIG. 2), a second line buffer 130 (Conv_in_buffer shown in FIG. 2), and a matrix buffer 140 (Con_in_matrix shown in FIG. 2) ).
- the first line buffer 120 (Conv_in_cache shown in FIG. 2), the second line buffer 130 (Conv_in_buffer shown in FIG. 2), and the matrix buffer 140 (Con_in_matrix shown in FIG. 2) are used to 1*
- the input of 1 pixel data is processed to output Pv*K 2 pixel data. Where Pv is the row vector and K is the size of the convolution kernel.
- Pv is the row vector and K is the size of the convolution kernel.
- the first line buffer 120 receives the pixel data of the feature map to be processed bit by bit, and simultaneously outputs the line pixel data after the filter, and stores The characteristic map of each convolution layer input; wherein the number of data per row of the row pixels is the number of parallel filters.
- the first line buffer 120 includes a BRAM, and the feature map input pixel data of each convolution layer is buffered in the BRAM to improve localized storage of the pixel data.
- the first line buffer 120 sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all the channels of one convolution layer are output, the pixel data of the channel of the next convolution layer is output. The first line buffer 120 performs iterative calculation output from the first convolution layer to the last convolution layer by using different filters.
- the input buffer further includes: at least one second row buffer 130.
- the second row buffer 130 includes a plurality of serial FIFO memories, each of the FIFOs.
- the memory stores a row of pixel data of the feature map; wherein each of the row of pixel data is sequentially stored in each of the FIFO memories along a path formed by the serial FIFO memory; the second row buffer 130 receives Pf rows of pixel data, and outputs Pix data in the form of a matrix of Pf; where Pf is the number of parallel filters and K is the size of the convolution kernel.
- the first row of pixel data is stored in the first FIFO, the first FIFO begins to receive the second row of pixel data, and the first row of pixel data is output to the second FIFO.
- the two FIFOs will store two consecutive rows of pixel data and output them.
- the matrix buffer 140 inputs pixel data in the form of a Pf ⁇ K matrix.
- the matrix buffer 140 outputs pixel data in the form of a Pf ⁇ K ⁇ K matrix.
- the convolution operation circuit 150 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a convolution operation, and output a convolution operation. result.
- the convolution operation circuit 150 includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, The output results of the plurality of multipliers are accumulated; each of the convolvers 100 inputs pixel data in the form of a K ⁇ K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.
- the image has three channel data of R, G, and B, that is, three two-dimensional matrices.
- the first weight parameter that is, the depth of the filter is 3
- the three-layer sub-weight parameter that is, three two-dimensional matrices.
- K is an odd number 3, which is convoluted with three Chanel respectively.
- Pv is assumed. If it is 5, the filter is to be calculated by the convolution operation circuit 150 three times with the data cube.
- the convolution operation circuit 150 can be provided with a corresponding number of three, so that each of the filters can be performed in parallel in one clock cycle. The convolution operation of the responsible channel.
- the output buffer 160 is configured to receive the convolution operation result and output the convolution operation result to the external memory.
- the output buffer 160 receives the convolution operation result of each channel, and then accumulates the convolution operation result of all the channel data, and the result is temporarily stored in the output buffer 160.
- the output buffer 160 includes: two FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
- MUX data selector
- the number of the adders is equal to Pf*Pv, and in addition, the data selector (MUX) is also used to reduce the data stream speed to 1*1, one bit pixel pixel output.
- the convolver 100 further includes a pooling operation circuit 170 connected between the output buffer 160 and the external memory for pooling the convolution operation result. Output to external memory.
- the pooling operation circuit 170 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 170 also includes a FIFO memory for storing pixel data for each row.
- the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.
- the internal components included in the convolver 100, and the first in first out data interface between the convolver 100 and the external memory (the plurality shown in FIG. 2) SIF) connection are included in the convolver 100, and the first in first out data interface between the convolver 100 and the external memory (the plurality shown in FIG. 2) SIF) connection.
- the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.
- the first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;
- the first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.
- the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.
- the second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.
- the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.
- the running process of the convolver 100 is as follows:
- the data to be processed is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).
- the weight parameter (one channel) of k*k is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and then stored to the parameter buffer 110.
- the convolution operation circuit 150 receives Pv*K 2 pixel data every clock cycle.
- the input data of each channel (the height of the characteristic map input to each channel is H and the width W) is accumulated by the convolution operation circuit 150, and then the result of each channel is output to the output buffer 160. .
- the different input channels are cyclically accessed, and the output buffer 160 accumulates the data of each channel to know that the feature map corresponding to the filter size of (H-K+1)*(W-K+1) is acquired.
- the pooling operation circuit 170 can receive the (H-K+1)*(W-K+1) pixel data for the pooling process and then output the feature map, or directly output the feature map from the output buffer 160.
- the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until the pixel processing of all convolutional layers is completed.
- the embodiment also provides an artificial intelligence processing apparatus including the convolver 100 as described above.
- the convolver 100 has been described in detail above and will not be described again.
- the artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS).
- the processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, or a DSP, etc., such as an embedded processor chip of an ARM architecture; the central processing unit is communicatively coupled to an external memory, and the external memory 200 For example, RAM or ROM memory, such as three generations, four generations of DDR SDRAM, etc.; the central processing unit can read and write data to and from external memory.
- the convolver of the present invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process high complexity at high speed.
- the convolutional neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Mathematics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Analysis (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Processing (AREA)
- Complex Calculations (AREA)
Abstract
A convolutor (100) and an artificial intelligent processing device applied thereto, the convolutor being electrically connected to an external memory, the external memory storing data to be processed and weighting parameters. The convolutor (100) comprises: a parameter buffer (110), an input buffer, a convolution operation circuit (150) and an output buffer (160). The parameter buffer (110) is used for receiving and outputting the weighting parameters. The input buffer comprises: a plurality of connected row buffers for receiving and outputting the data to be processed. Every time each row buffer outputs one bit, the bits are collected to form a column of data output. The convolution operation circuit (150) is used for receiving from the input buffer the data to be processed, receiving the weighting parameters from the parameter buffer (110), performing a convolution operation according to the data to be processed and the weighting parameters, and outputting a convolution operation result. The output buffer (160) is used for receiving the convolution operation result and outputting the convolution operation result to an external memory. The method can solve the problems of a low processing speed and high requirements for the processor brought about by software operations.
Description
本发明涉及处理器技术领域,特别是涉及人工智能处理器技术领域,具体为卷积器及其所应用的人工智能处理装置。The present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, and specifically to a convolver and an artificial intelligence processing device to which the same is applied.
卷积神经网络(Convolutional Neural Network,CNN)是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于大型图像处理有出色表现。卷积神经网络包括卷积层(convolutional layer))和池化层(pooling layer)。The Convolutional Neural Network (CNN) is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing. The convolutional neural network includes a convolutional layer and a pooling layer.
现在,CNN已经成为众多科学领域的研究热点之一,特别是在模式分类领域,由于该网络避免了对图像的复杂前期预处理,可以直接输入原始图像,因而得到了更为广泛的应用。Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.
一般地,CNN的基本结构包括两层,其一为特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。一旦该局部特征被提取后,它与其它特征间的位置关系也随之确定下来;其二是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。卷积神经网络中的每一个卷积层都紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。Generally, the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the convolutional network, so that the feature map has displacement invariance. In addition, since the neurons on one mapping surface share weights, the number of network free parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces feature resolution.
CNN主要用来识别位移、缩放及其他形式扭曲不变性的二维图形。由于CNN的特征检测层通过训练数据进行学习,所以在使用CNN时,避免了显示的特征抽取,而隐式地从训练数据中进行学习;再者由于同一特征映射面上的神经元权值相同,所以网络可以并行学习,这也是卷积网络相对于神经元彼此相连网络的一大优势。卷积神经网络以其局部权值共享的特殊结构在语音识别和图像处理方面有着独特的优越性,其布局更接近于实际的生物神经网络,权值共享降低了网络的复杂性,特别是多维输入向量的图像可以直接输入网络这一特点避免了特征提取和分类过程中数据重建的复杂度。CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the convolutional network relative to the neural network connected to each other. The convolutional neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially multidimensional. The feature that the input vector image can be directly input into the network avoids the complexity of data reconstruction during feature extraction and classification.
目前,卷积神经网络都是通过运行在一个处理器或多个分布式处理中的软件进行运算实现,随着卷积神经网络的复杂性增大,处理速度相对就会减慢,而且对处理器的性能要求也越来越高。At present, the convolutional neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the convolutional neural network increases, the processing speed is relatively slow, and the processing is slow. The performance requirements of the device are also getting higher and higher.
发明内容Summary of the invention
鉴于以上所述现有技术的缺点,本发明的目的在于提供卷积器及其所应用的人工智能处理装置,用于解决现有技术中卷积神经网络均是通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the convolutional neural network in the prior art is realized by software operation. The speed is slower and the performance of the processor is high.
为实现上述目的及其他相关目的,本发明提供一种卷积器,电性连接至外部存储器,其中,所述外部存储器存储有待处理数据及权重参数;所述卷积器包括:参数缓存器、输入缓存器、卷积运算电路及输出缓存器;所述参数缓存器用于接收并输出所述权重参数;所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出;所述卷积运算电路用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器接收权重参数,据以进行卷积运算并输出卷积运算结果;所述输出缓存器用于接收所述卷积运算结果并将该卷积运算结果向所述外部存储器输出。To achieve the above and other related objects, the present invention provides a convolver electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; the convolver includes: a parameter buffer, An input buffer, a convolution operation circuit, and an output buffer; the parameter buffer is configured to receive and output the weight parameter; the input buffer includes: a plurality of connected line buffers for receiving and outputting the to-be- Processing data; wherein each of the line buffers is assembled to form a column of data output for each bit of data output; the convolution operation circuit is configured to receive the to-be-processed data from the input buffer, from the parameter buffer Receiving a weight parameter, performing a convolution operation and outputting a convolution operation result; the output buffer is configured to receive the convolution operation result and output the convolution operation result to the external memory.
于本发明的一实施例中,所述输入缓存器包括:第一行缓存器,逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各卷积层的所述特征图谱;其中,行像素每行的数据个数为并行的滤波器数量。In an embodiment of the invention, the input buffer includes: a first line buffer that receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores the input volumes. The feature map of the layer; wherein the number of data per row of the row pixels is the number of parallel filters.
于本发明的一实施例中,所述第一行缓存器依次输出各所述卷积层的行像素数据,并在输出每一个所述卷积层行像素数据时依次输出各通道数据的行像素数据。In an embodiment of the invention, the first line buffer sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data.
于本发明的一实施例中,所述输入缓存器还包括:至少一个第二行缓存器,包含多个串联的FIFO存储器,每一个所述FIFO存储器存储特征图谱的一行像素数据;其中,各所述行像素数据沿串联的FIFO存储器形成的路径依次存储至各所述FIFO存储器;所述第二行缓存器输出Pf×K矩阵形式的像素数据;其中,Pf为并行的滤波器数量,K为卷积核的大小。In an embodiment of the invention, the input buffer further includes: at least one second line buffer, comprising a plurality of serial FIFO memories, each of the FIFO memories storing a row of pixel data of the feature map; wherein each The row pixel data is sequentially stored in the FIFO memory along a path formed by the serial FIFO memory; the second row buffer outputs pixel data in the form of a Pf×K matrix; wherein Pf is the number of parallel filters, K The size of the convolution kernel.
于本发明的一实施例中,所述输入缓存器还包括:至少一个矩阵缓存器,每一个所述矩阵缓存器由排列成矩阵用于存储像素数据的多个寄存器,所述寄存器的大小为Pf×K×2,当输入的像素数据的列数大于K时,所述矩阵缓存器输出Pf×K×K矩阵形式的像素数据。In an embodiment of the invention, the input buffer further includes: at least one matrix buffer, each of the matrix buffers being arranged by a matrix for storing a plurality of registers of pixel data, wherein the size of the register is Pf × K × 2, when the number of columns of input pixel data is larger than K, the matrix buffer outputs pixel data in the form of a Pf × K × K matrix.
于本发明的一实施例中,所述卷积运算电路包括:多个并行运行的卷积核,每一个所述卷积核包含用于进行卷积运算的乘法器;加法器树,对多个所述乘法器的输出结果进行累加;每一个所述卷积器输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重 参数经过卷积运算逐位输出像素数据。In an embodiment of the invention, the convolution operation circuit includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, a plurality of The output results of the multipliers are accumulated; each of the convolvers inputs pixel data in the form of a K×K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.
于本发明的一实施例中,所述输出缓存器包括:并行的两个FIFO存储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;数据选择器,用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终的累加结果。In an embodiment of the invention, the output buffer includes: two FIFO memories in parallel, the channel data passing through the same filter is accumulated and stored in the same FIFO memory; the data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.
于本发明的一实施例中,所述卷积器还包括:池化运算电路,连接于所述输出缓存器和所述外部存储器之间,用于对所述卷积运算结果进行池化后向外部存储器输出。In an embodiment of the invention, the convolver further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the convolution operation result Output to external memory.
于本发明的一实施例中,所述卷积器所包括的各内部部件之间、以及所述卷积器与所述外部存储器之间通过先入先出数据接口连接。In an embodiment of the invention, the internal components included in the convolver and the convolver and the external memory are connected by a first in first out data interface.
本发明还提供一种人工智能处理装置,所述人工智能处理装置包括如上所述的卷积器。The present invention also provides an artificial intelligence processing apparatus including the convolver as described above.
如上所述,本发明的卷积器及其所应用的人工智能处理装置,具有以下有益效果:As described above, the convolver of the present invention and the artificial intelligence processing apparatus to which the same is applied have the following advantageous effects:
本发明的卷积器由参数缓存器、输入缓存器、卷积运算电路、输出缓存器,池化运算电路以及先入先出数据接口等硬件组成,可高速处理复杂度高的卷积神经网络算法,可以有效解决现有技术中通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。The convolver of the invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process a complex convolutional neural network algorithm at high speed. It can effectively solve the problem that the processing speed brought by the software operation realized in the prior art is slow and the performance of the processor is high.
图1显示为现有技术中的一种卷积器的整体原理示意图。Figure 1 shows a schematic diagram of the overall principle of a convolver in the prior art.
图2显示为本发明的一种卷积器的输入输出示意图。2 is a schematic diagram showing the input and output of a convolver of the present invention.
图3显示为本发明的一种卷积器中第二行缓存器的示意图。3 is a schematic diagram showing a second line buffer in a convolver of the present invention.
图4显示为本发明的一种卷积器中矩阵缓存器的输入输出示意图。4 is a schematic diagram showing the input and output of a matrix buffer in a convolver of the present invention.
图5显示为本发明的一种卷积器中输出缓存器的结构示意图。FIG. 5 is a schematic structural diagram of an output buffer in a convolver according to the present invention.
元件标号说明Component label description
100 卷积器100 convolver
110 参数缓存器110 parameter buffer
120 第一行缓存器120 first line buffer
130 第二行缓存器130 second line buffer
140 矩阵缓存器140 matrix buffer
150 卷积运算电路150 convolution operation circuit
160 输出缓存器160 output buffer
170 池化运算电路170 pool computing circuit
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below by way of specific examples, and those skilled in the art can readily understand other advantages and effects of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.
需要说明的是,如图1至图5所示,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that, as shown in FIG. 1 to FIG. 5, the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.
本实施例的目的在于提供一种卷积器及其所应用的人工智能处理装置,用于解决现有技术中卷积神经网络均是通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。以下将详细描述本实施例的一种卷积器及其所应用的人工智能处理装置的原理和实施方式,使本领域技术人员不需要创造性劳动即可理解本实施例的一种卷积器及其所应用的人工智能处理装置。The purpose of this embodiment is to provide a convolver and an artificial intelligence processing device thereof, which are used to solve the problem that the processing speed of the convolutional neural network in the prior art is slowed by software operation, and the processor is slow. High performance requirements. The principle and implementation manner of a convolver and an artificial intelligence processing device to which the present invention is applied will be described in detail below, so that a convolver of the embodiment can be understood by those skilled in the art without any creative work. The artificial intelligence processing device to which it is applied.
具体地,如图1所示,本实施例提供一种卷积器100,所述卷积器100电性连接至外部存储器,其中,所述外部存储器存储有待处理数据及权重参数;所述卷积器100包括:参数缓存器110、输入缓存器、卷积运算电路150及输出缓存器160。Specifically, as shown in FIG. 1 , the embodiment provides a convolver 100 electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; The accumulator 100 includes a parameter buffer 110, an input buffer, a convolution operation circuit 150, and an output buffer 160.
所述第一待处理数据包含多个通道数据;所述第一权重参数包含多层子参数,每层子参数分别一一对应各个通道数据;所述卷积运算电路150有多个,用于一一对应地并行计算各个通道数据的卷积运算结果。The first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to respective channel data; and the convolution operation circuit 150 has a plurality of The convolution operation results of the respective channel data are calculated in parallel one by one.
于本实施例中,所述参数缓存器110(图2中所示的Con_reg)用于接收并输出所述权重参数(图2中所示的Weight)。所述参数缓存器110包括一FIFO存储器,所述权重参数存储于所述FIFO存储器中。其中,输入缓存器、卷积运算电路150及输出缓存器160中的参数均配置好后也储存于所述参数缓存器110中。In the present embodiment, the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2). The parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory. The parameters in the input buffer, the convolution operation circuit 150, and the output buffer 160 are also stored in the parameter buffer 110.
于本实施例中,所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出。In this embodiment, the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.
所述输入缓存器包括第一行缓存器120(图2中所示的Conv_in_cache),第二行缓存器130(图2中所示的Conv_in_buffer)以及矩阵缓存器140(图2中所示的Con_in_matrix)。第一行缓存器120(图2中所示的Conv_in_cache),第二行缓存器130(图2中所示的 Conv_in_buffer)以及矩阵缓存器140(图2中所示的Con_in_matrix)用于将1*1像素数据的输入进行处理输出Pv*K
2像素数据。其中,Pv为行矢量,K为卷积核的大小。以下对所述输入缓存器进行详细说明。
The input buffer includes a first line buffer 120 (Conv_in_cache shown in FIG. 2), a second line buffer 130 (Conv_in_buffer shown in FIG. 2), and a matrix buffer 140 (Con_in_matrix shown in FIG. 2) ). The first line buffer 120 (Conv_in_cache shown in FIG. 2), the second line buffer 130 (Conv_in_buffer shown in FIG. 2), and the matrix buffer 140 (Con_in_matrix shown in FIG. 2) are used to 1* The input of 1 pixel data is processed to output Pv*K 2 pixel data. Where Pv is the row vector and K is the size of the convolution kernel. The input buffer will be described in detail below.
具体地,于本实施例中,所述第一行缓存器120(图2中所示的Conv_in_cache)逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各卷积层的所述特征图谱;其中,行像素每行的数据个数为并行的滤波器数量。Specifically, in the embodiment, the first line buffer 120 (Conv_in_cache shown in FIG. 2) receives the pixel data of the feature map to be processed bit by bit, and simultaneously outputs the line pixel data after the filter, and stores The characteristic map of each convolution layer input; wherein the number of data per row of the row pixels is the number of parallel filters.
于本实施例中,所述第一行缓存器120包括一BRAM,每个卷积层的特征图谱输入像素数据将被缓存BRAM中,以提高像素数据的本地化存储。In this embodiment, the first line buffer 120 includes a BRAM, and the feature map input pixel data of each convolution layer is buffered in the BRAM to improve localized storage of the pixel data.
其中,于本实施例中,所述第一行缓存器120依次输出各所述卷积层的行像素数据,并在输出每一个所述卷积层行像素数据时依次输出各通道数据的行像素数据。即所述第一行缓存器120在开始时输出第一个通道的像素数据,当对第一个通道的像素数据处理完成后,所述第一行缓存器120开始输出第二个通道的像素数据,当一个卷积层的所有通道的像素数据都输出后,进行下一个卷积层的通道的像素数据输出。其中,所述第一行缓存器120会利用不同的滤波器从第一个卷积层到最后一个卷积层进行迭代计算输出。In the embodiment, the first line buffer 120 sequentially outputs row pixel data of each of the convolution layers, and sequentially outputs rows of each channel data when outputting each of the convolution layer row pixel data. Pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all the channels of one convolution layer are output, the pixel data of the channel of the next convolution layer is output. The first line buffer 120 performs iterative calculation output from the first convolution layer to the last convolution layer by using different filters.
于本实施例中,所述输入缓存器还包括:至少一个第二行缓存器130,如图3所示,所述第二行缓存器130包含多个串联的FIFO存储器,每一个所述FIFO存储器存储特征图谱的一行像素数据;其中,各所述行像素数据沿串联的FIFO存储器形成的路径依次存储至各所述FIFO存储器;所述第二行缓存器130接收Pf个行像素数据,输出Pf×K矩阵形式的像素数据;其中,Pf为并行的滤波器数量,K为卷积核的大小。In this embodiment, the input buffer further includes: at least one second row buffer 130. As shown in FIG. 3, the second row buffer 130 includes a plurality of serial FIFO memories, each of the FIFOs. The memory stores a row of pixel data of the feature map; wherein each of the row of pixel data is sequentially stored in each of the FIFO memories along a path formed by the serial FIFO memory; the second row buffer 130 receives Pf rows of pixel data, and outputs Pix data in the form of a matrix of Pf; where Pf is the number of parallel filters and K is the size of the convolution kernel.
第一行像素数据存储在第一个FIFO中,第一个FIFO开始接收第二行像素数据,并将第一行像素数据输出到第二个FIFO中。这样,两个FIFO将存储两个连续的行像素数据并将它们输出。The first row of pixel data is stored in the first FIFO, the first FIFO begins to receive the second row of pixel data, and the first row of pixel data is output to the second FIFO. Thus, the two FIFOs will store two consecutive rows of pixel data and output them.
于本实施例中,所述输入缓存器还包括:至少一个矩阵缓存器140,每一个所述矩阵缓存器140由排列成矩阵用于存储像素数据的多个寄存器,所述寄存器的大小为Pf×K×2,如图4所示,显示为K=3时的寄存器。所述矩阵缓存器140输入Pf×K矩阵形式的像素数据,当输入的像素数据的列数大于K时,所述矩阵缓存器140输出Pf×K×K矩阵形式的像素数据。In this embodiment, the input buffer further includes: at least one matrix buffer 140, each of the matrix buffers 140 is configured by a plurality of registers arranged in a matrix for storing pixel data, and the size of the register is Pf. ×K × 2, as shown in Fig. 4, is displayed as a register when K = 3. The matrix buffer 140 inputs pixel data in the form of a Pf×K matrix. When the number of columns of input pixel data is greater than K, the matrix buffer 140 outputs pixel data in the form of a Pf×K×K matrix.
于本实施例中,所述卷积运算电路150用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器110接收权重参数,据以进行卷积运算并输出卷积运算结果。In this embodiment, the convolution operation circuit 150 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a convolution operation, and output a convolution operation. result.
具体地,于本实施例中,所述卷积运算电路150包括:多个并行运行的卷积核,每一个 所述卷积核包含用于进行卷积运算的乘法器;加法器树,对多个所述乘法器的输出结果进行累加;每一个所述卷积器100输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重参数经过卷积运算逐位输出像素数据。Specifically, in the embodiment, the convolution operation circuit 150 includes: a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation; an adder tree, The output results of the plurality of multipliers are accumulated; each of the convolvers 100 inputs pixel data in the form of a K×K matrix, and the pixel data is output bit by bit according to the input pixel data and the weight parameter by a convolution operation.
举例来讲,图像有R、G、B三个通道数据,即三个二维矩阵,假设第一权重参数即filter的深度为3,即具有三层子权重参数,即三个二维矩阵,每个长宽设为K*K,假设K是奇数3,分别与三个Chanel卷积运算,当从第一待处理数据取出Pv*k*3的一个数据立方体(Pv>K),假设Pv是5,则该filter要与该数据立方体通过卷积运算电路150三次才能运算完毕,而优选的,卷积运算电路150可以设有对应数量的3个,从而可以在一个时钟周期内并行进行各自所负责Channel的卷积运算。For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices. It is assumed that the first weight parameter, that is, the depth of the filter is 3, that is, the three-layer sub-weight parameter, that is, three two-dimensional matrices, Each length and width is set to K*K, assuming K is an odd number 3, which is convoluted with three Chanel respectively. When a data cube of Pv*k*3 is taken from the first data to be processed (Pv>K), Pv is assumed. If it is 5, the filter is to be calculated by the convolution operation circuit 150 three times with the data cube. Preferably, the convolution operation circuit 150 can be provided with a corresponding number of three, so that each of the filters can be performed in parallel in one clock cycle. The convolution operation of the responsible channel.
于本实施例中,所述输出缓存器160用于接收所述卷积运算结果并将该卷积运算结果向所述外部存储器输出。In this embodiment, the output buffer 160 is configured to receive the convolution operation result and output the convolution operation result to the external memory.
具体地,所述输出缓存器160接收每一个通道的卷积运算结果,然后累加所有通道数据的卷积运算结果,结果暂时存储于所述输出缓存器160。Specifically, the output buffer 160 receives the convolution operation result of each channel, and then accumulates the convolution operation result of all the channel data, and the result is temporarily stored in the output buffer 160.
具体地,于本实施例中,如图5所示,所述输出缓存器160包括:并行的两个FIFO存储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;数据选择器(MUX),用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终的累加结果。Specifically, in the embodiment, as shown in FIG. 5, the output buffer 160 includes: two FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
其中,所述加法器的数量等于Pf*Pv,此外,数据选择器(MUX)还用于将数据流速度降至1*1,一位一位像素点像素输出。Wherein, the number of the adders is equal to Pf*Pv, and in addition, the data selector (MUX) is also used to reduce the data stream speed to 1*1, one bit pixel pixel output.
于本实施例中,所述卷积器100还包括:池化运算电路170,连接于所述输出缓存器160和所述外部存储器之间,用于对所述卷积运算结果进行池化后向外部存储器输出。In this embodiment, the convolver 100 further includes a pooling operation circuit 170 connected between the output buffer 160 and the external memory for pooling the convolution operation result. Output to external memory.
所述池化运算电路170为每两行像素数据提供最大的池,所述池化运算电路170也包含一个FIFO存储器,用于存储每行像素数据。The pooling operation circuit 170 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 170 also includes a FIFO memory for storing pixel data for each row.
具体的,池化方式可以是Max pooling,也可以是Average pooling,都可以通过逻辑电路实现。Specifically, the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.
于本实施例中,所述卷积器100所包括的各内部部件之间、以及所述卷积器100与所述外部存储器之间通过先入先出数据接口(图2中所示的多个SIF)连接。In this embodiment, the internal components included in the convolver 100, and the first in first out data interface between the convolver 100 and the external memory (the plurality shown in FIG. 2) SIF) connection.
具体地,所述先入先出数据接口包括:先入先出存储器,第一逻辑单元和第二逻辑单元。Specifically, the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.
其中,所述先入先出存储器包括:上行的可写使能管脚、数据输入管脚、及存储器满状态标识管脚;以及,下行的可读使能管脚、数据输出管脚、及存储器空状态标识管脚;The first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;
所述第一逻辑单元连接上行对象、所述可写使能管脚、及存储器满状态标识管脚,用于 在接收到上行对象的写请求时,根据存储器满状态标识管脚上的信号确定所述先入先出存储器是否已满;若未满,则发送使能信号至可写使能管脚来令先入先出存储器可写;否则,令所述先入先出存储器不可写。The first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.
具体地,所述第一逻辑单元包括:第一反向器,其输入端连接所述存储器满状态标识管脚,其输出端引出供连接上行对象的第一标识端;第一与门,其第一输入端连接所述第一数据有效标识端,其第二输入端连接于供连接上行对象的上行数据有效端,其输出端连接所述可写使能管脚。Specifically, the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.
所述第二逻辑单元连接下行对象、所述可读使能管脚、及存储器空状态标识管脚,用于在接收到下行对象的读请求时,根据存储器空状态标识管脚上的信号确定所述先入先出存储器是否已空;若未空,则发送使能信号至可读使能管脚来令先入先出存储器可读;否则,令所述先入先出存储器不可读。The second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.
具体地,所述第二逻辑单元包括:第二反向器,其输入端连接所述存储器空状态标识管脚,其输出端引出供连接下行对象的下行数据有效端;第二与门,其第一输入端连接所述下行数据有效端,其第二输入端连接于供连接下行对象的下行数据有效标识端。Specifically, the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.
本实施例中,卷积器100的运行过程如下:In this embodiment, the running process of the convolver 100 is as follows:
通过先入先出数据接口(图2中所示的多个SIF)从外部存储器读取待处理数据,并存储至第一行缓存器120(图2中所示的Conv_in_cache)的BRAM。The data to be processed is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).
通过先入先出数据接口(图2中所示的多个SIF)从外部存储器读取k*k的权重参数(一个通道),然后存储至所述参数缓存器110。The weight parameter (one channel) of k*k is read from the external memory through the first-in first-out data interface (several SIFs shown in FIG. 2) and then stored to the parameter buffer 110.
一旦所述参数缓存器110加载到一个权重参数,开始接收处理特征图谱的像素数据,通过第一行缓存器120(图2中所示的Conv_in_cache),第二行缓存器130(图2中所示的Conv_in_buffer)以及矩阵缓存器140(图2中所示的Con_in_matrix)的处理,所述卷积运算电路150每时钟周期接收到Pv*K
2像素数据。
Once the parameter buffer 110 is loaded into a weight parameter, the pixel data of the processing feature map is started to be received, through the first line buffer 120 (Conv_in_cache shown in FIG. 2), and the second line buffer 130 (in FIG. 2 The illustrated Conv_in_buffer) and the processing of the matrix buffer 140 (Con_in_matrix shown in FIG. 2), the convolution operation circuit 150 receives Pv*K 2 pixel data every clock cycle.
通过所述卷积运算电路150对每一个通道(每一个通道输入的特征图谱的高为H和宽为W)的输入数据进行卷积累加,然后输出各通道的结果至所述输出缓存器160。The input data of each channel (the height of the characteristic map input to each channel is H and the width W) is accumulated by the convolution operation circuit 150, and then the result of each channel is output to the output buffer 160. .
循环访问不同的输入通道,所述输出缓存器160累加每一个通道的数据结果知道获取到与滤波器为(H-K+1)*(W-K+1)大小对应的特征图谱。The different input channels are cyclically accessed, and the output buffer 160 accumulates the data of each channel to know that the feature map corresponding to the filter size of (H-K+1)*(W-K+1) is acquired.
然后可以利用池化运算电路170接收(H-K+1)*(W-K+1)像素数据做池化处理后再输出特征图谱,也可以直接从所述输出缓存器160输出特征图谱。Then, the pooling operation circuit 170 can receive the (H-K+1)*(W-K+1) pixel data for the pooling process and then output the feature map, or directly output the feature map from the output buffer 160.
当所述池化运算电路170或所述输出缓存器160输出经过一个滤波器处理的特征图谱之 后,所述参数缓存器110重新加载到一个权重参数,通过不同的滤波器重复迭代上述像素处理过程,直至完成所有卷积层的像素处理。After the pooling operation circuit 170 or the output buffer 160 outputs the feature map processed by one filter, the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until the pixel processing of all convolutional layers is completed.
本实施例还提供一种人工智能处理装置,所述人工智能处理装置包括如上所述的卷积器100。上述已对所述卷积器100进行了详细说明,在此不再赘述。The embodiment also provides an artificial intelligence processing apparatus including the convolver 100 as described above. The convolver 100 has been described in detail above and will not be described again.
其中,所述人工智能处理器,包括:可编程逻辑电路(PL)及处理系统电路(PS)。所述处理系统电路包括中央处理器,其可通过MCU、SoC、FPGA或DSP等实现,例如ARM架构的嵌入式处理器芯片等;所述中央处理器与外部存储器通信连接,所述外部存储器200例如为RAM或ROM存储器,例如三代、四代DDR SDRAM等;所述中央处理器可对外部存储器读写数据。The artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS). The processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, or a DSP, etc., such as an embedded processor chip of an ARM architecture; the central processing unit is communicatively coupled to an external memory, and the external memory 200 For example, RAM or ROM memory, such as three generations, four generations of DDR SDRAM, etc.; the central processing unit can read and write data to and from external memory.
综上所述,本发明的卷积器由参数缓存器、输入缓存器、卷积运算电路、输出缓存器,池化运算电路以及先入先出数据接口等硬件组成,可高速处理复杂度高的卷积神经网络算法,可以有效解决现有技术中通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。所以,本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。In summary, the convolver of the present invention is composed of a parameter buffer, an input buffer, a convolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process high complexity at high speed. The convolutional neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the invention will be covered by the appended claims.
Claims (10)
- 一种卷积器,电性连接至外部存储器,其中,所述外部存储器存储有待处理数据及权重参数;其特征在于,所述卷积器包括:参数缓存器、输入缓存器、卷积运算电路及输出缓存器;A convolver electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; wherein the convolver comprises: a parameter buffer, an input buffer, and a convolution operation circuit And an output buffer;所述参数缓存器用于接收并输出所述权重参数;The parameter buffer is configured to receive and output the weight parameter;所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出;The input buffer includes: a plurality of connected line buffers for receiving and outputting the data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;所述卷积运算电路用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器接收权重参数,据以进行卷积运算并输出卷积运算结果;The convolution operation circuit is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer, perform a convolution operation, and output a convolution operation result;所述输出缓存器用于接收所述卷积运算结果并将该卷积运算结果向所述外部存储器输出。The output buffer is configured to receive the convolution operation result and output the convolution operation result to the external memory.
- 根据权利要求1所述的卷积器,其特征在于,所述输入缓存器包括:The decimator of claim 1 wherein said input buffer comprises:第一行缓存器,逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各卷积层的所述特征图谱;其中,行像素每行的数据个数为并行的滤波器数量。a first line buffer, which receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after passing through the filter, and stores the characteristic map of each input convolution layer; wherein, each row of data of the row pixel The number is the number of filters in parallel.
- 根据权利要求2所述的卷积器,其特征在于,所述第一行缓存器依次输出各所述卷积层的行像素数据,并在输出每一个所述卷积层行像素数据时依次输出各通道数据的行像素数据。The convolver according to claim 2, wherein said first line buffer sequentially outputs row pixel data of each of said convolution layers, and sequentially outputs pixel data of each of said convolution layer rows Outputs line pixel data of each channel data.
- 根据权利要求2所述的卷积器,其特征在于,所述输入缓存器还包括:The decimator of claim 2, wherein the input buffer further comprises:至少一个第二行缓存器,包含多个串联的FIFO存储器,每一个所述FIFO存储器存储特征图谱的一行像素数据;其中,各所述行像素数据沿串联的FIFO存储器形成的路径依次存储至各所述FIFO存储器;所述第二行缓存器输出Pf×K矩阵形式的像素数据;其中,Pf为并行的滤波器数量,K为卷积核的大小。At least one second row buffer comprising a plurality of serially connected FIFO memories, each of the FIFO memories storing a row of pixel data of a feature map; wherein each of the row of pixel data is sequentially stored in a path formed by the serial FIFO memory to each The FIFO memory; the second line buffer outputs pixel data in the form of a Pf×K matrix; wherein Pf is the number of parallel filters and K is the size of the convolution kernel.
- 根据权利要求4所述的卷积器,其特征在于,所述输入缓存器还包括:The decimator of claim 4, wherein the input buffer further comprises:至少一个矩阵缓存器,每一个所述矩阵缓存器由排列成矩阵用于存储像素数据的多个寄存器,所述寄存器的大小为Pf×K×2,当输入的像素数据的列数大于K时,所述矩阵缓存器输出Pf×K×K矩阵形式的像素数据。At least one matrix buffer, each of the matrix buffers being arranged in a matrix for storing a plurality of registers of pixel data, the size of the register being Pf×K×2, when the number of columns of the input pixel data is greater than K The matrix buffer outputs pixel data in the form of a Pf×K×K matrix.
- 根据权利要求5所述的卷积器,其特征在于,所述卷积运算电路包括:The convolver according to claim 5, wherein said convolution operation circuit comprises:多个并行运行的卷积核,每一个所述卷积核包含用于进行卷积运算的乘法器;a plurality of convolution kernels running in parallel, each of the convolution kernels including a multiplier for performing a convolution operation;加法器树,对多个所述乘法器的输出结果进行累加;An adder tree that accumulates output results of a plurality of the multipliers;每一个所述卷积器输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重参数经过卷积运算逐位输出像素数据。Each of the convolvers inputs pixel data in the form of a K×K matrix, and outputs pixel data bit by bit according to the input pixel data and the weight parameter by a convolution operation.
- 根据权利要求6所述的卷积器,其特征在于,所述输出缓存器包括:The decimator of claim 6 wherein said output buffer comprises:并行的两个FIFO存储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;Parallel two FIFO memories, the channel data passing through the same filter is accumulated and stored in the same FIFO memory;数据选择器,用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终的累加结果。A data selector for returning each accumulated result to the adder tree until the adder outputs a final accumulated result.
- 根据权利要求1所述的卷积器,其特征在于,所述卷积器还包括:The decimator of claim 1 wherein said convolver further comprises:池化运算电路,连接于所述输出缓存器和所述外部存储器之间,用于对所述卷积运算结果进行池化后向外部存储器输出。The pooling operation circuit is connected between the output buffer and the external memory, and is used for pooling the convolution operation result and outputting the result to the external memory.
- 根据权利要求1所述的卷积器,其特征在于,所述卷积器所包括的各内部部件之间、以及所述卷积器与所述外部存储器之间通过先入先出数据接口连接。The convolver according to claim 1, wherein each of the internal components included in the convolver and the convolver and the external memory are connected by a first-in first-out data interface.
- 一种人工智能处理装置,其特征在于,所述人工智能处理装置包括如权利要求1至权利要求9任一权利要求所述的卷积器。An artificial intelligence processing apparatus, characterized in that the artificial intelligence processing apparatus comprises a convolver as claimed in any one of claims 1 to 9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201880002156.XA CN109416756A (en) | 2018-01-15 | 2018-01-15 | Acoustic convolver and its applied artificial intelligence process device |
PCT/CN2018/072678 WO2019136764A1 (en) | 2018-01-15 | 2018-01-15 | Convolutor and artificial intelligent processing device applied thereto |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/072678 WO2019136764A1 (en) | 2018-01-15 | 2018-01-15 | Convolutor and artificial intelligent processing device applied thereto |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019136764A1 true WO2019136764A1 (en) | 2019-07-18 |
Family
ID=65462114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/072678 WO2019136764A1 (en) | 2018-01-15 | 2018-01-15 | Convolutor and artificial intelligent processing device applied thereto |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109416756A (en) |
WO (1) | WO2019136764A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111047010A (en) * | 2019-11-25 | 2020-04-21 | 天津大学 | Method and device for reducing first-layer convolution calculation delay of CNN accelerator |
US20210081770A1 (en) * | 2019-09-17 | 2021-03-18 | GOWN Semiconductor Corporation | System architecture based on soc fpga for edge artificial intelligence computing |
CN112734024A (en) * | 2020-04-17 | 2021-04-30 | 神亚科技股份有限公司 | Processing apparatus for performing convolutional neural network operations and method of operation thereof |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978161B (en) * | 2019-03-08 | 2022-03-04 | 吉林大学 | Universal convolution-pooling synchronous processing convolution kernel system |
CN109992225B (en) * | 2019-04-04 | 2022-02-22 | 中科寒武纪科技股份有限公司 | Data output method and related device |
CN110866597B (en) * | 2019-09-27 | 2021-07-27 | 珠海博雅科技有限公司 | Data processing circuit and data processing method |
CN114503126A (en) * | 2019-10-18 | 2022-05-13 | 北京希姆计算科技有限公司 | Matrix operation circuit, device and method |
CN112784973B (en) * | 2019-11-04 | 2024-09-13 | 广州希姆半导体科技有限公司 | Convolution operation circuit, device and method |
CN111814675B (en) * | 2020-07-08 | 2023-09-29 | 上海雪湖科技有限公司 | Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA |
CN112101178B (en) * | 2020-09-10 | 2023-03-24 | 电子科技大学 | Intelligent SOC terminal assisting blind people in perceiving external environment |
CN112464150A (en) * | 2020-11-06 | 2021-03-09 | 苏州浪潮智能科技有限公司 | Method, device and medium for realizing data convolution operation based on FPGA |
CN114692073A (en) * | 2021-05-19 | 2022-07-01 | 神盾股份有限公司 | Data processing method and circuit based on convolution operation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160379109A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Convolutional neural networks on hardware accelerators |
US20170221176A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | Convolutional neural network |
WO2017129325A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | A convolutional neural network |
CN107392309A (en) * | 2017-09-11 | 2017-11-24 | 东南大学—无锡集成电路技术研究所 | A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA |
CN107403221A (en) * | 2016-05-03 | 2017-11-28 | 想象技术有限公司 | The hardware of convolutional neural networks is realized |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10049322B2 (en) * | 2015-05-21 | 2018-08-14 | Google Llc | Prefetching weights for use in a neural network processor |
CN104915322B (en) * | 2015-06-09 | 2018-05-01 | 中国人民解放军国防科学技术大学 | A kind of hardware-accelerated method of convolutional neural networks |
US10614354B2 (en) * | 2015-10-07 | 2020-04-07 | Altera Corporation | Method and apparatus for implementing layers on a convolutional neural network accelerator |
US10846591B2 (en) * | 2015-12-29 | 2020-11-24 | Synopsys, Inc. | Configurable and programmable multi-core architecture with a specialized instruction set for embedded application based on neural networks |
CN106228240B (en) * | 2016-07-30 | 2020-09-01 | 复旦大学 | Deep convolution neural network implementation method based on FPGA |
CN107229967B (en) * | 2016-08-22 | 2021-06-15 | 赛灵思公司 | Hardware accelerator and method for realizing sparse GRU neural network based on FPGA |
CN106970896B (en) * | 2017-03-30 | 2020-05-12 | 中国人民解放军国防科学技术大学 | Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution |
-
2018
- 2018-01-15 CN CN201880002156.XA patent/CN109416756A/en active Pending
- 2018-01-15 WO PCT/CN2018/072678 patent/WO2019136764A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160379109A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Convolutional neural networks on hardware accelerators |
US20170221176A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | Convolutional neural network |
WO2017129325A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | A convolutional neural network |
CN107403221A (en) * | 2016-05-03 | 2017-11-28 | 想象技术有限公司 | The hardware of convolutional neural networks is realized |
CN107392309A (en) * | 2017-09-11 | 2017-11-24 | 东南大学—无锡集成电路技术研究所 | A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081770A1 (en) * | 2019-09-17 | 2021-03-18 | GOWN Semiconductor Corporation | System architecture based on soc fpga for edge artificial intelligence computing |
US11544544B2 (en) * | 2019-09-17 | 2023-01-03 | Gowin Semiconductor Corporation | System architecture based on SoC FPGA for edge artificial intelligence computing |
CN111047010A (en) * | 2019-11-25 | 2020-04-21 | 天津大学 | Method and device for reducing first-layer convolution calculation delay of CNN accelerator |
CN112734024A (en) * | 2020-04-17 | 2021-04-30 | 神亚科技股份有限公司 | Processing apparatus for performing convolutional neural network operations and method of operation thereof |
Also Published As
Publication number | Publication date |
---|---|
CN109416756A (en) | 2019-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019136764A1 (en) | Convolutor and artificial intelligent processing device applied thereto | |
JP6857286B2 (en) | Improved performance of neural network arrays | |
CN110050267B (en) | System and method for data management | |
CN109284817B (en) | Deep separable convolutional neural network processing architecture/method/system and medium | |
US11775430B1 (en) | Memory access for multiple circuit components | |
CN108416327B (en) | Target detection method and device, computer equipment and readable storage medium | |
CN108090565A (en) | Accelerated method is trained in a kind of convolutional neural networks parallelization | |
CN109903221A (en) | Image oversubscription method and device | |
CN109859178B (en) | FPGA-based infrared remote sensing image real-time target detection method | |
CN110766127B (en) | Neural network computing special circuit and related computing platform and implementation method thereof | |
CN111210019B (en) | Neural network inference method based on software and hardware cooperative acceleration | |
WO2023123919A1 (en) | Data processing circuit, data processing method, and related product | |
CN110738317A (en) | FPGA-based deformable convolution network operation method, device and system | |
CN108717571A (en) | A kind of acceleration method and device for artificial intelligence | |
CN109740619B (en) | Neural network terminal operation method and device for target recognition | |
CN110782430A (en) | Small target detection method and device, electronic equipment and storage medium | |
CN108764182B (en) | Optimized acceleration method and device for artificial intelligence | |
CN109416743B (en) | Three-dimensional convolution device for identifying human actions | |
WO2019136747A1 (en) | Deconvolver and an artificial intelligence processing device applied by same | |
CN114359662A (en) | Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution | |
CN110532219B (en) | FPGA-based ping-pong data storage removing method | |
US10970201B2 (en) | System, method and apparatus for data manipulation | |
US11830114B2 (en) | Reconfigurable hardware acceleration method and system for gaussian pyramid construction | |
CN111831207B (en) | Data processing method, device and equipment thereof | |
Ngo et al. | Real time iris segmentation on FPGA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18900184 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.11.2020) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18900184 Country of ref document: EP Kind code of ref document: A1 |