CN111639751A

CN111639751A - Non-zero padding training method for binary convolutional neural network

Info

Publication number: CN111639751A
Application number: CN202010455024.1A
Authority: CN
Inventors: 丁文锐; 李越; 刘春蕾
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-08

Abstract

The invention discloses a non-zero filling training method for a binary convolutional neural network. First, the loss function in ordinary neural network training is improved, and a joint loss function is constructed by using the knowledge distillation theory, so that the zero-filling binary network can be used for Guided training of nonzero-padded binary networks. Then, the progressive training method is used to train the non-zero-filled binary network. Based on the zero-filled binary network, the number of non-zero-filled binary activations is gradually increased, thereby reducing the training difficulty of the non-zero-filled binary network. The invention corrects the pseudo-binary activation problem in the zero-filled binary network, and effectively reduces the training difficulty of the non-zero-filled binary network by combining the joint loss function and the progressive training method, and greatly reduces the need for correcting and filling values. It brings the problem of performance degradation of non-zero-filled binary networks.

Description

A Non-Zero Padding Training Method for Binary Convolutional Neural Networks

技术领域technical field

本发明属于图像处理领域，具体是指一种用于二值卷积神经网络的非零填补训练方法。The invention belongs to the field of image processing, in particular to a non-zero filling training method for a binary convolutional neural network.

背景技术Background technique

近十年来，深度学习由于在特征提取和模型建构上相较于浅层模型的巨大优势，受到了越来越多研究者的关注，在计算机视觉、文字识别等领域均获得了快速发展。深度学习以深度神经网络作为主要呈现形式，而卷积神经网络(Convolutional Neural Network，CNN)则是其中因生物神经学科启发而生的开创性研究成果。相较于传统方法，卷积神经网络具有权值共享、局部连接、池化操作等特性，因而可有效减少全局优化训练参数、降低模型复杂度，使网络模型对于输入的缩放、平移、扭曲具之一有一定程度的不变性。在这种特性的优势作用下，卷积神经网络在包括图像分类、目标检测与跟踪在内的众多计算机视觉任务中都表现出了卓越的性能。In the past ten years, deep learning has attracted more and more researchers' attention due to its huge advantages over shallow models in feature extraction and model construction, and has achieved rapid development in the fields of computer vision and text recognition. Deep learning takes the deep neural network as the main form, and the Convolutional Neural Network (CNN) is the pioneering research result inspired by biological neurology. Compared with the traditional method, the convolutional neural network has the characteristics of weight sharing, local connection, pooling operation, etc., so it can effectively reduce the global optimization of training parameters, reduce the complexity of the model, and make the network model better for input scaling, translation, and distortion. One has some degree of invariance. Under the advantage of this characteristic, convolutional neural networks have shown excellent performance in many computer vision tasks including image classification, object detection and tracking.

尽管卷积深度网络在诸多视觉任务中都展现了可靠的效果，然而巨大的存储及计算开销限制了其在目前广为流行的便携式设备上的应用，为了拓展卷积神经网络的应用，模型的压缩和加速成为了计算机视觉领域内的热点问题。目前针对卷积神经网络的压缩方法主要分为三类：Although convolutional deep networks have shown reliable results in many vision tasks, the huge storage and computational overhead limit their application on popular portable devices. In order to expand the application of convolutional neural networks, the model's Compression and acceleration have become hot issues in the field of computer vision. At present, the compression methods for convolutional neural networks are mainly divided into three categories:

一是网络剪枝方法。该方法的基本思想为：性能较好的卷积神经网络往往具有更复杂的结构，但其中有些参数对最终的输出结果贡献不大而显得冗余，因此对于一个已有的卷积神经网络，可以寻找一种有效的卷积核通道重要性评判手段，剪掉相应冗余卷积核参数，提高神经网络的效率。在该种方法中，评判手段对于模型性能有十分重要的影响。One is the network pruning method. The basic idea of this method is: Convolutional neural networks with better performance often have more complex structures, but some of the parameters do not contribute much to the final output result and appear redundant. Therefore, for an existing convolutional neural network, An effective method for evaluating the importance of convolution kernel channels can be found, and the corresponding redundant convolution kernel parameters can be cut off to improve the efficiency of the neural network. In this method, the evaluation method has a very important influence on the model performance.

二是神经网络结构搜索方法。该方法可在一定范围的搜索空间内，通过良好的搜索算法让机器自动搜索出速度快、精度高的网络，实现网络压缩的目的。该方法的关键是建立一个庞大的网络体系结构空间，通过一些有效的网络搜索算法来探索空间，并在训练数据和计算量约束(例如，网络大小和延迟)的特定组合下搜索最佳的卷积神经网络架构。The second is the neural network structure search method. This method can automatically search out a network with high speed and high precision through a good search algorithm within a certain range of search space, so as to achieve the purpose of network compression. The key to this approach is to build a large space of network architectures, explore the space through some efficient network search algorithms, and search for the optimal volume under a specific combination of training data and computational constraints (e.g., network size and latency) Integral Neural Network Architecture.

三是网络量化方法。该方法通过将32位全精度网络权重参数量化为更低位参数(如8-bit，4-bit，1-bit等)，以此获得低比特网络。该方法能够有效降低参数冗余，从而减少存储占用、通信带宽和计算复杂度，有助于深度网络在人工智能芯片等轻量级应用场景下的应用。The third is the network quantification method. This method obtains a low-bit network by quantizing the 32-bit full-precision network weight parameters into lower-bit parameters (such as 8-bit, 4-bit, 1-bit, etc.). This method can effectively reduce parameter redundancy, thereby reducing storage occupancy, communication bandwidth and computational complexity, which is helpful for the application of deep networks in lightweight application scenarios such as artificial intelligence chips.

在网络量化方法中，网络二值化方法是一种可实现压缩最大化的量化方法，该方法将网络模型中的激活与权重量化为1-bit，使得其模型存储理论上可获得约32倍的降低，而由于网络中运算从浮点数的加与乘转变为比特级的位运算，使其运行速度理论上可获得约58倍的提升，因而在二值量化后，网络往往能得到极大地压缩。目前在二值卷积神经网络中，多利用符号函数(sign(·))将网络权重与激活量化为{-1，+1}两个数值。而对于某一特定卷积层，在卷积过程中为保持特征图(feature map)的大小不变，往往需要引入“填补”(padding)操作，即将输入卷积层的特征图先进行数值填补，接下来再将填补过后的特征图进行卷积操作。对于目前主流的深度学习框架，如Pytorch，其填补的默认数值往往设置为0，这就造成了通过符号函数将该卷积层的激活量化为{-1，+1}两个数值后，其经过填补操作又多了一种数值——“0”，该填补过程可见图1。由于当前二值网络研究领域并未注意到这一问题，实际上各项研究中所报出的各种结果，均是基于该{-1，+1，0}混合激活的“伪二值网络”，在实际应用时无法享受二值网络所拥有地极高的压缩比。而经实验验证，当填补为1时，即为量化后激活为{-1，+1}的非零填补二值网络，其相比于同结构的零填补二值网络，性能上有十分明显的下降。如何弥补校正填补数值后带来的非零填补二值网络的性能下降，使其在未来实际部署时能够以高精度与高压缩比运行，成为了目前有待深入研究的问题。In the network quantization method, the network binarization method is a quantization method that can maximize the compression. This method quantizes the activation and weight in the network model into 1-bit, so that the model storage can theoretically obtain about 32 times. However, since the operations in the network are changed from the addition and multiplication of floating-point numbers to bit-level bit operations, the operating speed can theoretically be improved by about 58 times. Therefore, after binary quantization, the network can often be greatly improved. compression. At present, in the binary convolutional neural network, the sign function (sign( )) is often used to quantize the network weights and activations into two values {-1, +1}. For a specific convolutional layer, in order to keep the size of the feature map unchanged during the convolution process, it is often necessary to introduce a "padding" operation, that is, the feature map input to the convolutional layer is numerically padded. , and then perform a convolution operation on the filled feature map. For the current mainstream deep learning frameworks, such as Pytorch, the default value of the padding is often set to 0, which causes the activation of the convolutional layer to be quantized into {-1, +1} by the sign function. After the filling operation, there is another value - "0", the filling process can be seen in Figure 1. Since the current field of binary network research has not paid attention to this problem, in fact, various results reported in various studies are based on the "pseudo-binary network" with mixed activation of {-1, +1, 0}. ”, which cannot enjoy the extremely high compression ratio of the binary network in practical applications. It has been verified by experiments that when the padding is 1, it is a non-zero padding binary network with activation of {-1, +1} after quantization. Compared with the zero-padding binary network of the same structure, its performance is very obvious. Decline. How to make up for the performance degradation of the non-zero-padded binary network caused by correcting the padded value, so that it can run with high precision and high compression ratio in actual deployment in the future, has become a problem to be further studied.

发明内容SUMMARY OF THE INVENTION

本发明考虑到目前二值网络研究中卷积层错误填补数值所带来的“伪二值”问题，构建了激活为{-1，+1}的非零填补二值卷积神经网络，并提出了一种用于二值卷积神经网络的非零填补训练方法，用以弥补校正填补数值后带来的二值网络的性能下降问题。Considering the "pseudo-binary" problem caused by the wrongly filled values of the convolution layer in the current binary network research, the present invention constructs a non-zero filled binary convolutional neural network whose activation is {-1, +1}, and A non-zero padding training method for binary convolutional neural networks is proposed to compensate for the performance degradation of binary networks after correcting the padding values.

本发明用于二值卷积神经网络的非零填补训练方法，具体包括如下步骤：The present invention is used for the non-zero padding training method of the binary convolutional neural network, which specifically includes the following steps:

步骤一：准备应用于视觉任务的训练数据集。Step 1: Prepare the training dataset to be applied to the vision task.

步骤二：构建用于视觉任务的深度卷积神经网络，将网络中的权重与激活均二值化为1bit数值，并以预训练的填补为0的零填补二值网络权重对其进行初始化。Step 2: Construct a deep convolutional neural network for vision tasks, binarize the weights and activations in the network into 1-bit values, and initialize it with pre-trained zero-padded binary network weights that are padded to 0.

步骤三：利用知识蒸馏理论构建用于非零填补二值网络的联合损失函数。Step 3: Use knowledge distillation theory to construct a joint loss function for non-zero-filled binary networks.

步骤四：在训练数据集上对非零填补二值网络进行逐渐增加填补为1的通道数渐进式训练。Step 4: Perform progressive training on the non-zero-padded binary network on the training data set, gradually increasing the number of channels padded to 1.

步骤五：将步骤四得到的完全非零填补二值网络络用于任务的测试集，测试其分类效果。Step 5: Use the completely non-zero filled binary network obtained in Step 4 for the test set of the task to test its classification effect.

本发明的优点在于：The advantages of the present invention are:

(1)本发明用于二值卷积神经网络的非零填补训练方法，考虑到了过去研究中所忽略的二值化与填补数值之间的矛盾问题，构建了激活为{-1，+1}的非零填补二值深度网络，纠正了零填补二值网络中的伪二值激活问题。(1) The non-zero padding training method used in the present invention for binary convolutional neural network, taking into account the contradiction between binarization and padding values neglected in the past research, the activation is constructed as {-1, +1 }'s non-zero-padded binary deep network, which corrects the pseudo-binary activation problem in zero-padded binary networks.

(2)本发明用于二值卷积神经网络的非零填补训练方法，利用知识蒸馏理论构建了用于非零填补二值网络训练的联合损失函数，实现了零填补二值网络对于非零填补二值网络的引导训练，弥补了校正填补数值后带来的非零填补二值网络的性能下降问题。(2) The present invention is used for the non-zero padding training method of the binary convolutional neural network. The knowledge distillation theory is used to construct a joint loss function for the training of the non-zero padding binary network. The guided training of the padded binary network compensates for the performance degradation of the non-zero padded binary network caused by correcting the padded values.

(3)本发明用于二值卷积神经网络的非零填补训练方法，采用渐进式训练的方法来训练非零填补二值网络，为零填补二值网络向非零填补二值网络的演变提供了过渡期，降低了非零填补网络精度上的损失。(3) The present invention is used for the non-zero padding training method of the binary convolutional neural network, and the progressive training method is used to train the non-zero padding binary network, and the evolution of the zero padding binary network to the non-zero padding binary network Provides a transition period that reduces the loss in accuracy of the non-zero padding network.

附图说明Description of drawings

图1为零填补二值网络卷积操作中的填补过程示意图；Figure 1 is a schematic diagram of the filling process in the zero-padding binary network convolution operation;

图2为本发明用于二值卷积神经网络的非零填补训练方法整体流程图；Fig. 2 is the overall flow chart of the non-zero padding training method for binary convolutional neural network according to the present invention;

图3为本发明用于二值卷积神经网络的非零填补训练方法中联合损失函数构建过程图；3 is a process diagram of the construction of a joint loss function in the non-zero padding training method for binary convolutional neural networks according to the present invention;

图4为本发明用于二值卷积神经网络的非零填补训练方法中渐进式训练过程图；Fig. 4 is the progressive training process diagram in the non-zero padding training method used for binary convolutional neural network of the present invention;

图5为本发明的训练方法与普通训练方法在CIFAR100测试集上的训练与测试曲线。FIG. 5 is the training and testing curves of the training method of the present invention and the common training method on the CIFAR100 test set.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明用于二值卷积神经网络的非零填补训练方法，如图1所示，具体方法包括以下步骤：The present invention is used for the non-zero padding training method of the binary convolutional neural network, as shown in Figure 1, and the specific method includes the following steps:

步骤一:准备应用于特定视觉任务的训练样本。Step 1: Prepare training samples to be applied to a specific vision task.

本发明以图像分类任务作为训练任务，以CIFAR-100数据集中的训练集作为训练样本。训练样本共有100个类别，每个类别共有500张图像，样本数量总计为50000个，其中第i个训练样本可表示为(x_i，y_i)，其中：x_i代表图像数据，y_i表示该图像对应的类别标签。在训练时对每个样本的图像数据进行随机裁剪和翻转，以进行数据增强、减轻网络过拟合问题。The invention takes the image classification task as the training task, and takes the training set in the CIFAR-100 data set as the training sample. There are 100 categories of training samples, 500 images in each category, and the total number of samples is 50,000, of which the ith training sample can be represented as ( _xi , _yi ), where: x _i represents image data, y _i represents The category label corresponding to this image. The image data of each sample is randomly cropped and flipped during training to perform data augmentation and alleviate the problem of network overfitting.

步骤二:并初始化。Step 2: And initialize.

首先以三元组的形式定义一个L层卷积的全精度CNN结构：<I，W，Conv>。其中I与W为一系列张量的集合，均为32位浮点数，即全精度值。令I_l代表第l层的输入特征图(或激活)；W_l表示第l层的卷积核；则操作符Conv表示输入特征图I_l与同层所有卷积核W_l之间的卷积操作。First, a full-precision CNN structure with L-layer convolution is defined in the form of triples: <I, W, Conv>. Among them, I and W are a collection of tensors, both of which are 32-bit floating-point numbers, that is, full-precision values. Let I1 represent the input feature map (or activation) of the _lth layer; W1 represent the convolution kernel of the _lth layer; then the operator _Conv represents the volume between the input feature map _I1 and all convolution kernels W1 in the same layer accumulation operation.

构建二值卷积神经网络，需利用符号函数对全精度特征图与卷积核进行二值化操作，具体表述为：To construct a binary convolutional neural network, the full-precision feature map and convolution kernel need to be binarized using a symbolic function, which is specifically expressed as:

其中，

与

表示二值化后的特征图与卷积核，对于任意输入a，符号函数sign(·)的表达式为：in,

and

Represents the feature map and convolution kernel after binarization. For any input a, the expression of the sign function sign( ) is:

进行卷积操作时，由于一些网络结构上的需要，需对

进行填补操作，可以以下式描述：When performing convolution operations, due to some network structure needs, it is necessary to

The filling operation can be described as follows:

其中，

即为填补数值value后的特征图。in,

That is, the feature map after filling the numerical value.

二值卷积神经网络构建完毕后，对网络进行初始化操作，具体描述为：以预训练的填补(padding)为0的零填补二值网络权重初始化所构建的二值卷积神经网络权重，此时公式

中value值设置为0。After the construction of the binary convolutional neural network is completed, the network is initialized, which is specifically described as follows: the weight of the binary convolutional neural network constructed by initializing the weight of the binary network with the pre-trained padding (padding) of 0 is filled. time formula

The value value is set to 0.

步骤三:构建联合损失函数。Step 3: Construct the joint loss function.

基于知识蒸馏理论，通过联合训练方法改进网络训练过程中的损失函数，实现填补为0的网络对填补为1的网络的引导训练，用以减小二者之间分类精度的差距。本发明的联合损失函数构建过程如图3所示，将输入图像分别输入到非零填补与零填补二值网络中，其中非零填补网络由量化、非零填补二值卷积、批规范处理与激活组成的基本模块堆叠而成，零填补网络由量化、零填补二值卷积、批规范处理与激活组成的基本模块堆叠而成，通过计算两个网络输出之间的交叉熵损失，得到基于知识蒸馏理论的损失。同时通过计算非零填补二值网络的输出与图片真实标注(硬目标)之间的交叉熵损失，得到基于硬目标的损失。将基于知识蒸馏理论的交叉熵损失、以及基于硬目标的交叉熵损失加权求和可得到训练过程中总损失，该损失即为本发明构建的联合损失函数。Based on the knowledge distillation theory, the loss function in the network training process is improved by a joint training method, and the network with 0 padding is implemented to guide the training of the network with 1 padding, so as to reduce the gap of classification accuracy between the two. The construction process of the joint loss function of the present invention is shown in Figure 3. The input image is input into the non-zero padding and zero padding binary networks, wherein the non-zero padding network is processed by quantization, non-zero padding binary convolution, and batch normalization. It is stacked with the basic module composed of activation, and the zero-padding network is composed of quantization, zero-padding binary convolution, batch normalization, and activation. Losses based on knowledge distillation theory. At the same time, the loss based on the hard target is obtained by calculating the cross-entropy loss between the output of the non-zero padding binary network and the real image annotation (hard target). The weighted summation of the cross-entropy loss based on the knowledge distillation theory and the cross-entropy loss based on the hard target can obtain the total loss in the training process, which is the joint loss function constructed by the present invention.

对于基于知识蒸馏理论的损失函数，具体描述为，将填补为0的预训练二值网络作为教师网络，将填补为1的待训练二值网络作为学生网络，将输入图像分别输入两个网络，教师网络的输出作为软目标，计算与学生网络输出的交叉熵损失，具体计算公式为：For the loss function based on knowledge distillation theory, the specific description is that the pre-trained binary network filled with 0 is used as the teacher network, the binary network to be trained with the filling of 1 is used as the student network, and the input images are input into the two networks respectively, The output of the teacher network is used as a soft target to calculate the cross entropy loss with the output of the student network. The specific calculation formula is:

其中，x_i为第i个训练样本中的图像数据，W_s为学生网络的权重参数，L_st(x_i；W_s)为学生网络与教师网络输出之间的交叉熵损失，n为分类任务的类别总数，S_j与T_j分别对应于学生网络与教师网络的第j个输出，其中学生网络的总输出S的计算公式为S＝F_s(x；W_s)，即将输入图像输入到参数为W_s的学生网络F_s中所产生的输出；教师网络的总输出T的计算公式为T＝F_t(x；W_t)，即将输入图片输入到参数为W_t的教师网络F_t中所产生的输出，也可称之为软目标。Among them, x _i is the image data in the ith training sample, W _s is the weight parameter of the student network, L _st ( _xi ; W _s ) is the cross-entropy loss between the output of the student network and the teacher network, and n is the classification The total number of categories of tasks, S _j and T _j correspond to the jth output of the student network and the teacher network respectively, and the calculation formula of the total output S of the student network is S=F _s (x; W _s ), that is, the input image is input The output generated in the student network F _s with the parameter W _s ; the calculation formula of the total output T of the teacher network is T=F _t (x; W _t ), that is, the input picture is input to the teacher network F with the parameter W _t The output generated in _t can also be called a soft target.

对于网络训练过程中的联合损失函数，其具体计算公式为：For the joint loss function in the network training process, its specific calculation formula is:

L(x；W_s)＝L_s(x；W_s)+θL_st(x；W_s)L(x; W _s )=L _s (x; W _s )+θL _st (x; W _s )

等式左侧即为本方法改进后的联合损失函数，等式右侧中L_s(x；W_s)为学生网络与硬目标的交叉损失，θ为蒸馏损失的加权系数，加权系数越大说明训练越依赖于教师网络的贡献，有助于学生网络更轻松地鉴别简单样本，但过大则会降低真实标注的影响，使网络难以鉴别困难样本，本发明中设计θ数值为0.25，可使联合训练方式发挥更好的作用。The left side of the equation is the improved joint loss function of this method. In the right side of the equation, L _s (x; W _s ) is the cross loss between the student network and the hard target, and θ is the weighting coefficient of the distillation loss. It shows that the more training depends on the contribution of the teacher network, it helps the student network to identify simple samples more easily, but if it is too large, it will reduce the impact of real annotations, making it difficult for the network to identify difficult samples. Make joint training work better.

步骤四：进行渐进式网络训练。Step 4: Perform progressive network training.

考虑到在一些量化的相关工作中，渐进地量化网络或逐渐的降低网络表示的bit数可以对于低比特网络的训练起到积极作用，本方法根据该思路对步骤三中非零填补二值网络的训练方法进行进一步改进，以初始化后的学生网络为基础(即初始时所有通道的特征图填补均为0)，渐进地增加网络中填补为1的特征图通道数。本发明的渐进式网络训练过程见图4，以c表示训练过程中填补为1的特征图通道比例，epoch定义为当前训练过程对于数据集的迭代次数，训练时c将随着epoch的增加而逐渐增大，其函数可具体描述为：Considering that in some quantization-related works, gradually quantizing the network or gradually reducing the number of bits represented by the network can play a positive role in the training of low-bit networks, this method fills the binary network with non-zero values in step 3 according to this idea. Based on the initialized student network (that is, the feature maps of all channels are initially filled with 0), the number of feature map channels filled with 1 in the network is gradually increased. The progressive network training process of the present invention is shown in Figure 4, where c represents the ratio of feature map channels filled with 1 in the training process, epoch is defined as the number of iterations of the current training process for the data set, and c will increase with the increase of epoch during training. Gradually increase, its function can be specifically described as:

c＝tanh(epoch/50)c=tanh(epoch/50)

其中对于任意输入a，tanh函数产生的输出可表示为：where for any input a, the output produced by the tanh function can be expressed as:

在某一特定迭代次数下，二值网络填补为1的特征图通道数表示为：Under a certain number of iterations, the number of feature map channels padded to 1 by the binary network is expressed as:

cnum＝[c₁×c]cnum=[c ₁ ×c]

c₁为特征图的总通道数，[·]为取整操作。当迭代次数为0即初始时刻时，cnum＝0，表示该网络实际上为零填补二值网络，随着迭代数增大，cnum逐渐增大，网络逐渐向非零填补二值网络演化，当迭代数为某一较大值时，网络最终演变为完全非零填补二值网络。表1展示了随着数据集迭代次数epoch的变化，一个输入特征图总通道数c₁＝5的卷积层，其填补为1的特征图通道比例c、数量cnum、以及填补为0的特征图数量。c ₁ is the total number of channels of the feature map, [ ] is the rounding operation. When the number of iterations is 0, that is, the initial moment, cnum=0, which means that the network is actually zero-filled binary network. As the number of iterations increases, cnum gradually increases, and the network gradually evolves to a non-zero filled binary network. When the number of iterations is a certain large value, the network eventually evolves into a completely non-zero filled binary network. Table 1 shows a convolutional layer with a total number of input feature map channels c ₁ =5, the feature map channel ratio c padded to 1, the number cnum, and the features padded to 0 with the change of the number of iterations of the dataset epoch number of pictures.

表1 cnum0所发生的变化情况Table 1 Changes in cnum0

步骤五，将最终得到的非零填补二值深度网络用于任务的测试集，测试其分类效果。Step 5: Use the finally obtained non-zero filled binary deep network for the test set of the task to test its classification effect.

图5展示了使用本发明所述的方法来训练非零填补二值网络，与的训练与测试精度曲线，本发明训练方法与直接训练非零填补二值网络(以零填补二值网络的权重参数来进行初始化)在CIFAR100测试集上的训练与测试曲线，其中train1与test1指本发明的训练方法，train2与test2指普通训练方法，纵坐标accuracy为精度数值，可以看出本方法有效地保持了零填补二值网络的性能，可以提升了非零填补二值网络的测试结果，弥补了校正填补数值后带来的非零填补二值网络的性能下降问题。Figure 5 shows the training and test accuracy curves of using the method described in the present invention to train a non-zero-padded binary network, and the training method of the present invention and directly training a non-zero-padded binary network (with zero-padded weights of the binary network parameters to initialize) the training and test curves on the CIFAR100 test set, wherein train1 and test1 refer to the training method of the present invention, train2 and test2 refer to the ordinary training method, and the ordinate accuracy is the precision value, it can be seen that this method effectively maintains The performance of the zero-padded binary network can be improved, the test results of the non-zero-padded binary network can be improved, and the performance degradation of the non-zero-padded binary network caused by the correction of the padded value can be compensated.

Claims

1. a non-zero padding training method for binary convolutional neural network, is characterized in that: specifically comprise the steps:

Step 1: Prepare a training dataset for vision tasks;

Step 2: Build a deep convolutional neural network for vision tasks, binarize the weights and activations in the network into 1-bit values, and initialize it with the pre-trained zero-padding binary network weights with 0 padding;

Step 3: Use knowledge distillation theory to construct a joint loss function for non-zero filled binary network;

Step 4: On the training data set, the non-zero-padded binary network is gradually trained to gradually increase the number of channels filled with 1;

Step 5: Use the completely non-zero filled binary network obtained in Step 4 for the test set of the task to test its classification effect.

2. a kind of non-zero padding training method for binary convolutional neural network as claimed in claim 1, is characterized in that: the concrete method of constructing binary convolutional neural network in step 2 is:

First, a full-precision CNN structure with L-layer convolution is defined in the form of triples: <I, W, Conv>; where I and W are sets of tensors; let I _l represent the input feature map of the lth layer; W _l represents the convolution kernel of the _lth layer; then the operator Conv represents the convolution operation between the input feature map _I1 and all the convolution kernels W1 of the same layer;

To construct a binary convolutional neural network, the full-precision feature map and convolution kernel need to be binarized using a symbolic function, which is specifically expressed as:

in,

and

When performing convolution operations, due to some network structure needs, it is necessary to

The filling operation is performed, described by the following formula:

in,

That is, the feature map after filling the numerical value.

3. a kind of non-zero padding training method for binary convolutional neural network as claimed in claim 1, is characterized in that: in step 2, with pre-trained padding as 0 zero padding binary network weight initialization constructed by The weights of the binary convolutional neural network.

4. a kind of non-zero filling training method for binary convolutional neural network as claimed in claim 1 is characterized in that: in step 2, the concrete method for the joint loss function of non-zero filling binary network is:

The pre-trained binary network filled with 0 is used as the teacher network, and the binary network to be trained with the filling of 1 is used as the student network. The cross entropy loss of , the specific calculation formula is:

Among them, x _i is the image data in the ith training sample, W _s is the weight parameter of the student network, L _st ( _xi ; W _s ) is the cross-entropy loss between the output of the student network and the teacher network, and n is the classification The total number of categories of tasks, S _j and T _j correspond to the jth output of the student network and the teacher network respectively, and the calculation formula of the total output S of the student network is S=F _s (x; W _s ), that is, the input image is input The output generated in the student network F _s with the parameter W _s ; the calculation formula of the total output T of the teacher network is T=F _t (x; W _t ), that is, the input picture is input to the teacher network F with the parameter W _t The output generated in _t is called the soft target;

For the joint loss function in the network training process, its specific calculation formula is:

L(x; W _s )=L _s (x; W _s )+θL _st (x; W _s )

The left side of the equation is the improved joint loss function of this method. In the right side of the equation, L _s (x; W _s ) is the cross loss between the student network and the hard target, and θ is the weighting coefficient of the distillation loss.

5. A non-zero padding training method for a binary convolutional neural network according to claim 4, wherein the design θ value is 0.25.

6. a kind of non-zero padding training method for binary convolutional neural network as claimed in claim 4, is characterized in that: in step 4, progressive training method is:

Let c represent the ratio of feature map channels filled with 1 during the training process, and epoch is the number of iterations of the current training process for the dataset. During training, c will gradually increase with the increase of epoch. Its function can be specifically described as:

c=tanh(epoch/50)

where for any input a, the output produced by the tanh function can be expressed as:

Under a certain number of iterations, the number of feature map channels padded to 1 by the binary network is expressed as:

cnum=[c ₁ ×c]

c ₁ is the total number of channels of the feature map, [ ] is the rounding operation. When the number of iterations is 0, that is, the initial moment, cnum=0, which means that the network actually fills the binary network with zero. Evolves into a completely non-zero filled binary network.