CN111582451A

CN111582451A - Image recognition interlayer parallel pipeline type binary convolution neural network array architecture

Info

Publication number: CN111582451A
Application number: CN202010383601.0A
Authority: CN
Inventors: 陈松; 刘百成; 康一
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-25
Anticipated expiration: 2040-05-08
Also published as: CN111582451B

Abstract

The invention discloses an image recognition interlayer parallel assembly line type binarization convolution neural network array framework, which comprises: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein: the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; and each convolution layer and each full connection layer are internally provided with a control unit connected with the global controller and a memory for storing the weight parameters and the binary coding parameters. The framework can improve the parallelism of image recognition calculation, reduce the requirement of weight storage, effectively avoid multiplication calculation, reduce power consumption and improve energy efficiency.

Description

Image recognition interlayer parallel pipeline type binary convolution neural network array architecture

Technical Field

The invention relates to the field of a binarization convolutional neural network, in particular to an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture.

Background

Biology considers that brain neurons and synapses of organisms form a network that can be used to generate biological awareness, help organisms generate thinking and actions. Based on the method, scientists researching the artificial neural network abstract a mathematical model from the artificial neural network, abstract neurons of human brains from the aspect of information processing, establish a simple mathematical model, and form the network according to different connection modes. At present, the artificial neural network is widely applied and has application in the fields of voice recognition, image recognition, target detection and the like. In the process of artificial neural network research, scientists propose the concept of convolutional neural network, which is an artificial neural network containing a deep structure and consists of a feedforward neural network and a negative feedback neural network, wherein only feedforward neural network calculation is carried out during identification, and negative feedback neural network calculation is required during training. The study of the convolutional neural network is inspired by the study of visual cells, and the neuron in the primary visual cortex is found to respond to simple characteristics in a visual environment, the visual cortex has simple cells and complex cells, the simple cells have strong reaction to specific spatial positions and preference directions, and invariance in the complex space can be realized by pooling the input of the simple cells. It follows that in convolutional neural networks, the underlying computation is a convolution computation and a pooling computation. Convolution calculation is a calculation process that uses a convolution kernel of a specific size to extract features in a specific area, mainly by multiply-accumulate. The pooling calculation is a down-sampling process, and the down-sampling can remove unimportant feature elements, reduce the scale of the feature map, reduce the calculation parameters, and simultaneously can retain the important features of the feature map so as not to influence the subsequent calculation.

With the further research, the scale of the convolutional neural network gradually increases, which results in that the convolutional neural network needs more storage resources and the consumption of computing resources is continuously increased. Therefore, research to reduce the storage requirement and the computation requirement of the convolutional neural network becomes a hot spot of the convolutional neural network research. At present, the mainstream methods for reducing the storage requirement and the calculation requirement of the convolutional neural network include pruning, singular value decomposition, quantization, pulse neural network and the like. Pruning can find relatively unimportant connection between two adjacent layers during training and reset the weight of the connection to 0, namely the connection is equivalent to shearing, so that the storage and calculation times of weight parameters are reduced in the calculation process; singular value decomposition is generally applied to a full connection layer, and two large-scale matrix multiplication can be converted into three smaller-scale matrix multiplication in a singular value decomposition mode, so that the storage requirement and the calculation requirement can be reduced; the quantization neural network uses less bit number to represent the original floating point numerical value, generally 11bit, 8bit, 5bit, 3bit, 2bit, 1bit and the like can be used, and the network which adopts 1bit and completes calculation by using +1 state and-1 state is called a binary convolution neural network; the pulse neural network is closer to the working mode of the biological neural network, a pulse is transmitted backwards if the membrane potential of a certain pre-synaptic neuron exceeds a preset voltage threshold in calculation, otherwise the corresponding post-synaptic neuron keeps a non-working state because no input pulse exists, no pulse exists in hardware acceleration, namely no dynamic power consumption exists, only static power consumption exists, and therefore the total power consumption can be reduced.

In order to achieve the real-time image processing effect, researchers generally design accelerators with a GPU, an FPGA, and an ASIC. However, the method is limited by the large storage requirement and calculation requirement of the convolutional neural network, the image recognition consumes more resources, many hardware are difficult to meet the storage requirement, the calculation parallelism is low, and high energy efficiency cannot be realized, so that it is very important to design an interlayer parallel pipeline type array architecture for image recognition based on the binary convolutional neural network.

Disclosure of Invention

The invention aims to provide an image identification interlayer parallel pipelined binarization convolutional neural network array architecture, which can improve the image identification calculation parallelism, reduce the weight storage requirement, effectively avoid multiplication calculation, reduce the power consumption and improve the energy efficiency.

The purpose of the invention is realized by the following technical scheme:

an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture comprises: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein:

the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; and each convolution layer and each full connection layer are internally provided with a control unit connected with the global controller and a memory for storing the weight parameters and the binary coding parameters.

According to the technical scheme provided by the invention, the hardware accelerated calculation of the image identification binary convolution neural network can reduce the hardware storage requirement, avoid multiplication calculation, reduce energy consumption and improve the parallelism, thereby improving the identification speed and energy efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an interlayer parallel pipeline calculation according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a first type C structure of a convolution calculation part of a PE unit according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a second type C structure of a convolution calculation portion of a PE unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating conversion of binary multiply-accumulate calculation into same or accumulate calculation according to an embodiment of the present invention;

fig. 6 is a schematic diagram of PE units of a convolution kernel of 3 × 3 size according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture, which mainly comprises: five calculation layers of M1, M2, M3, M4 and M5 which are arranged in sequence form an interlayer pipeline, wherein:

As shown in fig. 1, the M1 layer includes two convolutional layers of C1 and C2 and constitutes a two-stage pipeline, the M2 layer includes two convolutional layers of C3 and C4 and constitutes a two-stage pipeline, the M3 layer includes two convolutional layers of C5 and C6 and constitutes a two-stage pipeline, the M4 layer includes a fully-connected layer F1, and the M5 layer includes fully-connected layers F2 and F3.

As shown in fig. 1, the whole binary convolutional neural network array architecture is divided into 9 blocks, which correspond to 6 convolutional layers (C1-C6) and 3 full-link layers (layers F1-F3), respectively; wherein, the GlobalControl is a global controller used for realizing global control. In fig. 1, Images in the convolutional layer C1 are used to store pictures to be recognized, Weights & & K _ H are used to store weight parameters and K _ H parameters required for binary coding, Control is a Control unit of the convolutional layer C1 (receiving a Control signal issued by a global controller), and Ping-pong buffer unit is a Ping-pong buffer unit, which includes two identical storage units and is used to store array calculation results of the convolutional layer C1, wherein the convolutional layer C1 uses 24 PEs (processing units) to complete calculation, and inputs of the PEs are connected to the Control unit and a memory for storing the weight parameters and the binary coding parameters; convolutional layer C2 also contains control signals, weights and binary coding parameters, where the computational array uses 256 PEs, the input is fed by the Ping-pong buffer in convolutional layer C1, and the output is also stored in the Ping-pong buffer in convolutional layer C2; convolutional layer C3 contains control signals, weights and binary coding parameters, the array uses 256 PEs, the input is fed by Ping-PongBuffer in convolutional layer C2; convolutional layer C4 contains control signals, weights and binary coding parameters, the array uses 512 PEs, the input is fed by Ping-PongBuffer in convolutional layer C3, and the output is stored in Ping-PongBuffer in convolutional layer C4; convolutional layer C5 contains control signals, weights and binary coding parameters, the array uses 256 PEs, the input is fed from the Ping-pong buffer in convolutional layer C4, the output is stored in the Ping-pong buffer in convolutional layer C5; the convolutional layer C6 contains control signals, weights and binary coding parameters, 512 PEs are used in the array, input is fed from the Ping-PongBuffer in the convolutional layer C5, and output is stored in the Ping-PongBuffer in the convolutional layer C6; the full connection layer F1 contains control signals, weight parameters and binary coding parameters, the array calculation does not adopt convolution PE, but uses 512 exclusive nor calculation units and an addition tree to complete the operation, the input is fed by the Ping-PongBuffer in the convolution layer C6, and the output is stored in the Ping-PongBuffer in the full connection layer F1; the full connection layer F2 comprises control signals, weight parameters and binary coding parameters, 86 same or calculation units are used and the calculation is completed by an addition tree, the Ping-PongBuffer in the full connection layer F1 is input, and the output is directly output to the full connection layer F3 for calculation; the full connection layer F3 comprises control parameters, weight parameters and binary coding parameters, the calculation results are input into the array calculation result of the full connection layer F2, calculation is completed one by one and accumulation is completed, and the calculation result of the full connection layer F2 is not stored in the Ping-PongBuffer.

In the above description and the structure shown in fig. 1, the number of PEs in each convolution layer is given by way of example and not limitation; in practical application, a user can make a matched adjustment according to actual conditions.

The binarization convolutional neural network array architecture provided by the embodiment of the invention can form a five-stage pipeline, the first convolutional layer C1 in the M1 layer is calculated firstly, and after partial result calculation is finished, the first convolutional layer C1 and the second convolutional layer C2 in the M1 layer are calculated simultaneously; and the calculation results of the M1 layer need N clock cycles in total, the M1 layer carries out calculation in the first N clock cycle, the M1 layer and the M2 layer work simultaneously in the second N clock cycle, and the like, and the M1 layer, the M2 layer, the M3 layer, the M4 layer and the M5 layer work simultaneously in the fifth N clock cycle, so that a five-stage pipeline is formed.

As shown in fig. 2, there are 9 layers from top to bottom, C1, C2, C3, C4, C5, C6, F1, F2, and F3, respectively. n1 represents the time required for C1 to complete the calculation, n2 represents the time required for C2 to complete the calculation, and n3 represents the time required for M1 to complete the calculation. Since the calculation result of C1 needs to be used as input in the calculation of C2, in order to ensure the accuracy of the calculation result and improve the calculation parallelism, C2 starts calculation when C1 completes partial calculation, and C3 and C4 in an M2 layer start calculation in sequence by waiting until C2 finishes complete calculation, similarly, C5 and C6 in M3 complete calculation, M4 only comprises one layer of F1, and M5 comprises two layers of F2 and F3, which is set because the operation speed of the pipeline structure depends on the layer with the slowest calculation.

In an embodiment of the present invention, the first convolutional layer C1 may input a different type of data than the subsequent convolutional layer. Specifically, the method comprises the following steps: the input of the first convolutional layer C1 is floating point data, the second convolutional layer C2 in the M1 layer is input with binary data output by the first convolutional layer C1; similarly, the input of the convolution layers in the M2 layer and the M3 layer are both binary data; the structure of the convolution calculation section is different between the first convolution layer C1 of the M1 layer and the second convolution layer C2 of the M1 layer and the convolution layers of the M2 layer and the M3 layer.

As shown in fig. 1, the first convolutional layer C1 inputs image images, and some input pictures are RGB color, so that binarization cannot be performed, therefore, the first convolutional layer C1 needs to use a C unit shown in fig. 3, which controls input data through a gating unit, then stores the input data in a register, and then completes the subsequent accumulation step, and the use of this unit eliminates multiplication, and the formula is:

in is an input value, is floating point data, is a weight, is binary data, is generally in two states of +1 or-1, and can use 1bit to represent data in hardware, y is a result of completing convolution of one pixel, and generally one convolution kernel has convolution accumulation of a plurality of pixels, such as 3 × 3 size, 5 × 5 size, 7 × 7 size, and the like, and exemplarily, a convolution kernel of 3 × 3 size can be used. Since the input picture is floating point data, the C-unit calculation of fig. 3 uses a 16-bit register to store data, with 1 sign bit, 5 integer bits, and 10 decimal bits.

Unlike the structure of fig. 3, fig. 4 is applied to the calculation of the convolutional layers C2 to C6. The second convolutional layer in the M1 layer, C2, and the convolutional layers in the M2 and M3 layers, input is binarized data, indicating that both input and weight are binarized, using an exclusive-nor operation. The exclusive nor operation can be expressed by the following formula:

in the above formula, k represents the size of the convolution kernel, for example, convolution of 3 × 3, k is 3, i and j represent the feature map pixel of the convolution and the position coordinate of the weight, respectively, in represents the input feature map, w represents the weight, ⊙ is the exclusive nor operator, that is, y_xnorIs the result of the exclusive nor operation. The multiplication operation can be replaced by an exclusive-nor operation in a binary convolution neural network through the formula. With reference to fig. 5, if a convolution operation of 3 x 3 size is performed, the result of the left graph using multiply-accumulate operation is-3; the right graph becomes +3 after being changed to the same or cumulative calculation. To ensure the consistency of the calculations on the hardware circuit, the final convolution result needs to be obtained by the following formula:

y′＝2×y_xnor-L_conv

wherein, y_xnorIs the result of convolution using an exclusive-nor operation; l is_convIs the convolution kernel size, assuming 3 x 3 is used, since more than one convolution kernel for one neuron, then L_convTypically a multiple of 9(ii) a y' is the final convolution result. Full connectivity layer calculation of L_convIs the size of the weight matrix column. In the above description, the result of the multiply-accumulate calculation in the left diagram of fig. 5 is-3, the result of the same or accumulate calculation in the right diagram is 3, and the result of substituting the same or accumulate result into the above formula to obtain y' is-3, so that the convolution calculation result is still correct after the multiply-accumulate calculation is converted into the same or accumulate calculation.

If the calculation is performed according to the formula, there is still one multiplication operation, however, in the embodiment of the present invention, the output of each convolution layer (i.e., C1-C6) in the M1 layer, the M2 layer, and the M3 layer, the fully-connected layer in the M4 layer, and the first fully-connected layer in the M5 layer all perform a batch normalization (abbreviated as BN) operation, which is used to accelerate the training speed during training, and the formula of the batch normalization operation is:

where μ is desired and σ is²Is the variance, weight and bias of gamma and β batch normalization operations, ∈ is a constant (much less than a normal number of 1) added to prevent sigma²Equal to 0; x represents the output of the convolutional layer or the fully-connected layer.

Combining the multiply-accumulate to the same or accumulate formula and the BN layer formula, the formula can be deduced:

Y＝k_f×(X-h_f)

in the above equation, X is the result of convolution calculation without bias, bias is bias, k_fAnd h_fTwo groups of calculation parameters are deduced by combining convolution calculation and a BN layer calculation formula, and are floating point numbers.

The convolutional layers C1-C6 and the full connection layers F1-F2 are all binary outputs, binary coding is required to be carried out, and the binary coding is expressed as follows by combining the formula and an activation function:

in the above formula, sign (x) is a sign function; x refers to the input information, i.e., Y from batch normalization.

In conjunction with the activation function, the derived function can be simplified to:

Y＝k_i⊙(x≥h_f)

in the above formula, k_iRepresents k_fThe integer is 1, the negative number is 0, h_fBefore hardware acceleration, the data can be processed off-line and then guided into an accelerator for calculation, and can be directly input into a hardware circuit to participate in calculation, so that binary encoding can be completed, multiplication can be avoided, and the calculation process is simplified.

As shown in fig. 1, the convolutional layers C1 to C6 each include a plurality of PE units (processing units) therein, each processing unit is in a parallel operating mode, and fig. 6 illustrates an exemplary structure of a PE unit with 3 × 3 convolution, which mainly includes three parts: the first part is an input buffer part; the second part is a plurality of convolution calculation parts; in FIG. 6, 9C units shown in FIG. 3 are provided, that is, the PE unit of the convolutional layer C1 in FIG. 6 is replaced by the C unit shown in FIG. 4 for the other convolutional layers C2-C6; the third part is an addition tree unit for accumulating the result output by the second part, because the convolution kernel size is 3 × 3 in this example, the input buffer needs to buffer 3 lines, the weight register unit of the same or calculation part completes the buffer before calculation in a broadcast mode, and the addition tree starts to work and outputs the result after the same or calculation part completes the calculation.

To better illustrate the calculation process of the embodiment, a specific structure of the neural network of the embodiment is given below, as follows:

layer(s)

Input device

Filling in

Convolution kernel

Output of

Size of weight

Output size

C1

3*32*32

1(-1)

64*3*3*3

64*32*32

1728b

64kb

C2

64*32*32

1(-1)

64*64*3*3

64*32*32

36kb

64kb

MP1

64*32*32

-

64*16*16

-

16kb

C3

64*16*16

1(-1)

128*64*3*3

128*16*16

72kb

32kb

C4

128*16*16

1(-1)

128*128*3*3

128*16*16

144kb

32kb

MP2

128*16*16

-

128*8*8

-

8kb

C5

128*8*8

1(-1)

256*128*3*3

256*8*8

288kb

16kb

C6

256*8*8

1(-1)

256*256*3*3

256*8*8

576kb

16kb

MP3

256*8*8

-

256*4*4

-

4kb

F1

4096

-

4096

1024

4Mb

1kb

F2

1024

-

1024

1Mb

1kb

F3

1024

-

1024

10

10kb

10b

TABLE 1 neural network architecture

In the above table, C1, C2, C3, C4, C5 and C6 represent 6 convolutional layers, F1, F2 and F3 represent 3 fully-connected layers, MP1, MP2 and MP3 represent 3 pooling layers, and the pooling layers are designed to retain the maximum value in the 2 × 2 region, that is, to adopt the maximum value pooling method. Specifically, when the acceleration is combined with a hardware circuit architecture, the filled value is changed from 0 to-1 during software training, 0 is usually filled during general training, and the filling value is modified to-1, so that ternary operations of +1, 0 and-1 can be avoided when the same or alternative multiplication operation is adopted in a hardware circuit. The first value in the term convolution kernel represents the number of convolution kernels and the second value represents the convolution kernel

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image recognition interlayer parallel pipelined binarization convolutional neural network array architecture is characterized by comprising: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein:

2. The array architecture of claim 1, wherein a first convolutional layer C1 in the M1 layer is calculated first, and when partial result calculation is completed, a first convolutional layer C1 and a second convolutional layer C2 in the M1 layer are calculated simultaneously; and the calculation results of the M1 layer need N clock cycles in total, the M1 layer carries out calculation in the first N clock cycle, the M1 layer and the M2 layer work simultaneously in the second N clock cycle, and the like, and the M1 layer, the M2 layer, the M3 layer, the M4 layer and the M5 layer work simultaneously in the fifth N clock cycle, so that a five-stage pipeline is formed.

3. The array architecture of claim 1, wherein a plurality of processing units are disposed in each of the M1, M2, and M3 layers, and the input of each processing unit is connected to a control unit and a memory for storing the weight parameters and the binary coding parameters; the processing unit comprises three parts, wherein the first part is an input buffer part, the second part is a plurality of convolution calculation parts, and the third part is an addition tree unit and is used for accumulating results output by the second part.

4. The array architecture of the image recognition interlayer parallel pipelined binarization convolutional neural network as claimed in claim 1, 2 or 3, wherein the input of the first convolutional layer C1 in the M1 layer is floating point data, the input of the second convolutional layer C2 in the M1 layer is the binarization data output by the first convolutional layer C1; similarly, the input of the convolution layers in the M2 layer and the M3 layer are both binary data;

the structure of the convolution calculation section in the processing unit is different in the first convolution layer C1 in the M1 layer from the second convolution layer C2 in the M1 layer and the convolution layers of the M2 layer and the M3 layer.

5. The architecture of claim 4, in which the image recognition interlayer parallel pipelined binarization convolutional neural network array is implemented,

the first convolutional layer C1 inputs data through a gating unit, and then stores the data in a register, and then completes the subsequent accumulation step, and the formula is:

where in is an input value, which is floating point data, w is a weight, which is binary data, and y is a result of completing convolution of a pixel.

6. The array architecture of claim 5, wherein the input of the second convolutional layer C2 in M1 layer and the input of the convolutional layers in M2 layer and M3 layer are binary data, which indicates that the input and the weight are binary, and the final convolution result is obtained by performing result transformation with the same or cumulative operation and the following formula:

y′＝2×y_xnor-L_conv

wherein, y_xnorIs the result of a union or summation, L_convIs the convolution kernel size and y' is the final convolution result.

7. The array architecture of claim 1, wherein the convolutional layers of the M1 layer, the M2 layer and the M3 layer and the fully connected layer of the M4 layer are all provided with ping-pong buffer units at the ends thereof for storing the computation results of the corresponding layers.

8. The array architecture of claim 1, wherein the outputs of each convolutional layer of the M1, M2, and M3 layers, the fully-connected layer of the M4 layers, and the first fully-connected layer of the M5 layers are batch normalized by the following formula:

Y＝k_i⊙(x≥h_f)

in the above formula, k_iRepresents k_fThe integer is 1, the negative number is 0, k_f⊙ is an exclusive nor operator.